Introduction
Websites, that aggregate something become more and more popular because we need all information available in one place and accessed fast and easily. From time to time, you come across with tasks, that require retrieving some data from the password-protected area. For example, I use Guru.com for my job search. They post lots of the projects there, but they don’t offer convenient listing and filtering. So I developed my own tool, that grabs everything from there and ranks it all in the way I need. So, we’ll take a look at different types of password-protected areas and see how to deal with any of them
Cookie-based authorization
We’ll start from the most popular type of authorization
First, let’s take a look, how data goes between the user and the server:
From the process it becomes evident, that to pull the data from the pass-protected area, we should basically “catch” the cookie and then perform all requests passing it with every request. It seems to be complex, but luckily we have cURL PHP extension, that makes all the dirty job.
So all we have to do is:
- Send login request
- Catch cookie
- Send all data requests we need
The second part is done automatically, so we can just focus on authentication and data retrieval.
Implementation
We’ll basically need 3 functions:
- requestContent – this will make actual HTTP request to the URL we provide with parameters we provide (POST, GET variables, HTTP_REFERRER etc)
- authenticate – this will send authentication info to the server using the requestContent function
- getPage – this will actually get data we need using the same requestContent function. Actually, you may use requestContent instead, but you’ll definitely want some post-processing the result, for example, fetch some info. So it’s better to define a separate function, that does it.
So let’s go! I’ve created 3 files: config.php where we’ll store config data, functions.php where we’ll define out functions and index.php, that actually carries out the job.
config.php follows:
-
<?php
-
/**
-
* Specifies user agent. YOu can put anything you want here, I’m just showing,
-
* that with cUrl you can fake anything. I just copied my UserAgent string. Make ]
-
* sure you don’t have line breaks in it!!!
-
*/
-
define(‘GR_USER_AGENT’, ‘Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 FirePHP/0.2.4′);
-
/**
-
* Path to the file where cUrl will store cookies. cUrl behaves like a browser -
-
* it stores all cookies it gets somewhere and then submits them with each HTTP
-
* request.
-
*/
-
/**
-
* URL where you log in to your service. We’ll be logging in to Wikipedia.
-
* I took this path from the action parameter of the <form> tag on this page:
-
* http://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Special:UserLogout
-
*
-
* You should just take a look at the source code of the page you’re going to
-
* login from, finf the login form and see action parameter there.
-
*/
-
define(‘GR_LOGIN_URL’, ‘http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login’);
-
/**
-
* Your username
-
*/
-
/**
-
* Your password
-
*/
-
?>
In functions.php I define our 3 functions. requestContent looks as following:
-
{
-
$ch = curl_init();//creating cUrl instance
-
curl_setopt($ch, CURLOPT_URL, $url);//setting our URL
-
curl_setopt($ch, CURLOPT_HEADER, 0);
-
if(GR_USER_AGENT) curl_setopt($ch, CURLOPT_USERAGENT, GR_USER_AGENT);//setting user agent
-
if($postData)//if we have post data
-
{
-
curl_setopt($ch, CURLOPT_POST, 1);//tell cUrl we’ll be using POST
-
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);//set post data
-
}
-
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
-
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);//force cUrl to follow any redirects
-
if($referer) curl_setopt($ch, CURLOPT_REFERER, $referer);//if there is referrer specified, set it
-
if(GR_COOKIE_FILE)//if there is cookie file
-
{
-
//set next 2 options needed in order to uuse cookies correctly everywhere
-
curl_setopt($ch, CURLOPT_COOKIEFILE, GR_COOKIE_FILE);
-
curl_setopt($ch, CURLOPT_COOKIEJAR, GR_COOKIE_FILE);
-
}
-
if ($headers)//if there are additional headers
-
{
-
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);//set them
-
}
-
$result = curl_exec($ch);//execute cUrl query
-
curl_close($ch);//close cUrl
-
return $result;//return cUrl result
-
}
Basically, here we init cURL, and get the handler. It is similar to the fopen(). Then we set all options we need and finally execute the HTTP query and return result.
Authentication is based on the previous function, that carries our all the low-level job. We only have to examine our login form and determine, what variables to send. So, open the login page of the service you want to access, find the <form> tag and copy the login URL. You’ll have to put it into the config.php file. Then see what fields are there. Note, that there my be hidden fields also, note their variables and create an associative array for them (see the following code). If you’re not sure what variables are actually sent, you may use Live Http Headers plugin for the Firefox.
Usage is simple. Open the login page. Fill it in. Then start the plugin (Tools -> Live HTTP headers). Submit the form and see the number of queries are shown. You actually need only the first one – this is your request. See the screenshot:
And now the code:
-
function authenticate()
-
{
-
‘wpName’ => GR_USERNAME,
-
‘wpPassword’ => GR_PASSWORD,
-
‘wpLoginAttempt’ => ‘Log+in’);
-
$postData = ”;
-
foreach($post_info as $name => $value)
-
{
-
$postData .= $name.‘=’.$value.‘&’;
-
}
-
$result = requestContent(GR_LOGIN_URL, $postData, GR_LOGIN_URL);
-
//here you may check if you were logged in successfully
-
}
After getting result, you may parse it and see if you got a successful login. After you performed this query and got response, you’ve also got the session ID stored in cookie. You may open the cookie file and take a look yourself. After this you may perform any requests to the servic and it will respond as if you are logged user. The trick is that cUrl will send him cookies from the cookie file. And we have the session ID there or anything else that site uses for user identification.
So my getPage() function is a simple wrapper over the requestContent():
-
function getPage()
-
{
-
return requestContent(‘http://en.wikipedia.org/’);
-
}
Basically, that’s all you need in order to get into the protected area. Now you’re free to do anything you want. Using regular expressions for parsing is the most common way to get the data from the page, so maybe you’ll need the tool for testing your regexps. I really recommend “The Regext Coach”, it greatly simplifies debugging
HTTP authentication
Another approach to the authentication, that is sometimes used in different control panels is HTTP authentication. You should have seen that – you’re presented with a standard screen, prompting for username and password. This username and password are sent with each request. You are not prompted fro them every time just because browser remembers it. To access such area, you don’t need any authenticate() function. You just pass login and pass with each request.
Here are modifications you should do in the above code:
-
{
-
$ch = curl_init();//creating cUrl instance
-
curl_setopt($ch, CURLOPT_URL, $url);//setting our URL
-
curl_setopt($ch, CURLOPT_HEADER, 0);
-
curl_setopt($ch, CURLOPT_USERPWD, GR_USERNAME.‘:’.GR_PASSWORD);//<—– here is the change – you pass yout auth data each time
-
if(GR_USER_AGENT) curl_setopt($ch, CURLOPT_USERAGENT, GR_USER_AGENT);//setting user agent
-
if($postData)//if we have post data
-
{
-
curl_setopt($ch, CURLOPT_POST, 1);//tell cUrl we’ll be using POST
-
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);//set post data
-
}
-
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
-
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);//force cUrl to follow any redirects
-
if($referer) curl_setopt($ch, CURLOPT_REFERER, $referer);//if there is referrer specified, set it
-
if(GR_COOKIE_FILE)//if there is cookie file
-
{
-
//set next 2 options needed in order to uuse cookies correctly everywhere
-
curl_setopt($ch, CURLOPT_COOKIEFILE, GR_COOKIE_FILE);
-
curl_setopt($ch, CURLOPT_COOKIEJAR, GR_COOKIE_FILE);
-
}
-
if ($headers)//if there are additional headers
-
{
-
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);//set them
-
}
-
$result = curl_exec($ch);//execute cUrl query
-
curl_close($ch);//close cUrl
-
return $result;//return cUrl result
-
}
That’s all I wanted to share with you today. Hope you fins it helpful! If you have any questions/comments, feel free to add them!
Quick Note on OOP
I didn’t create the Curl class or present solution as a class because I don’t think it makes my explanations more clear or the code easier to use. This is not full solution, this is just a technique and I tried to present the code in such a way, that you could use them with minimal changes. You can easily put that functions to the class
Further reading
Related posts:
- Concurrent process management in Yii Introduction In my recent project there are quite many tasks...
- 8 Firefox plugins I use every day Firefox is great for web development mainly because of the...
Related posts brought to you by Yet Another Related Posts Plugin.
Share this post with a friend

(2 votes, average: 4.00 out of 5)











Onelableva says:
Hi, courteous posts there
thank’s concerning the compelling advice
May 24, 2009, 23:25Konstantin Mirin says:
Thanks
I hope this will be really helpful for my readers.
May 25, 2009, 11:03soxBeerry says:
Hi, Congratulations to the site owner for this marvelous work you’ve done. It has lots of useful and interesting data.
June 5, 2009, 21:13Rings says:
Thank you very much for that great article
August 3, 2009, 01:42babafisa says:
There’s a lot of information here. I’ll be back again.
August 3, 2009, 17:17Crasty says:
Are you a professional journalist? You write very well.
August 6, 2009, 23:39Konstantin Mirin says:
Thanks. I’m working for you
August 21, 2009, 13:53Konstantin Mirin says:
You are welcome. Subsribe to RSS not to miss new posts.
August 21, 2009, 13:54Konstantin Mirin says:
Thanks. Have you subscribed to RSS?
August 21, 2009, 13:54Konstantin Mirin says:
Thanks for such compliment
No, I am a professional developer and stand far from journalism. Furthermore, English is not my native language 
August 21, 2009, 13:56But I do my best.
8 Firefox plugins I use every day | Programmer's Notes says:
[...] go here and there while you’re requesting the page. It’s a great thing when you need to grab some content from the password-protected area or inspect what’s going on when you request [...]
March 16, 2010, 08:13ScallioXTX says:
Works like a charm, I only added curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); to prevent the script from hanging indefinitely in case the connection to the server could not be established.
May 4, 2010, 10:33Monah says:
http://rel” rel=”nofollow”>хм…
Что то со ссылками…
June 5, 2010, 01:27tvneervx says:
tvneervx…
tvneervx…
July 9, 2010, 20:54DREE says:
Hi,
I tried to learn these things, but am a beginner to all this. Wanted to request you to help me ligin to a specific site using curl and php.
Please let me know if you have some time to help me.
Regards,
July 18, 2010, 13:24D