1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 3.67 out of 5)
Loading ... Loading ...

Introduction

Websites, that aggregate something become more and more popular because we need all information available in one place and accessed fast and easily. From time to time, you come across with tasks, that require retrieving some data from the password-protected area. For example, I use Guru.com for my job search. They post lots of the projects there, but they don’t offer convenient listing and filtering. So I developed my own tool, that grabs everything from there and ranks it all in the way I need. So, we’ll take a look at different types of password-protected areas and see how to deal with any of them

Cookie-based authorization

We’ll start from the most popular type of authorization :) First, let’s take a look, how data goes between the user and the server:

Cookie-based authentication

Cookie-based authentication


From the process it becomes evident, that to pull the data from the pass-protected area, we should basically “catch” the cookie and then perform all requests passing it with every request. It seems to be complex, but luckily we have cURL PHP extension, that makes all the dirty job.

So all we have to do is:

  1. Send login request
  2. Catch cookie
  3. Send all data requests we need

The second part is done automatically, so we can just focus on authentication and data retrieval.

Implementation

We’ll basically need 3 functions:

  • requestContent – this will make actual HTTP request to the URL we provide with parameters we provide (POST, GET variables, HTTP_REFERRER etc)
  • authenticate – this will send authentication info to the server using the requestContent function
  • getPage – this will actually get data we need using the same requestContent function. Actually, you may use requestContent instead, but you’ll definitely want some post-processing the result, for example, fetch some info. So it’s better to define a separate function, that does it.

So let’s go! I’ve created 3 files: config.php where we’ll store config data, functions.php where we’ll define out functions and index.php, that actually carries out the job.
config.php follows:

  1. <?php
  2. /**
  3.  * Specifies user agent. YOu can put anything you want here, I’m just showing,
  4.  * that with cUrl you can fake anything. I just copied my UserAgent string. Make ]
  5.  * sure you don’t have line breaks in it!!!
  6.  */
  7. define(‘GR_USER_AGENT’, ‘Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 FirePHP/0.2.4′);
  8. /**
  9.  * Path to the file where cUrl will store cookies. cUrl behaves like a browser -
  10.  * it stores all cookies it gets somewhere and then submits them with each HTTP
  11.  * request.
  12.  */
  13. define(‘GR_COOKIE_FILE’, ‘cookies.txt’);
  14. /**
  15.  * URL where you log in to your service. We’ll be logging in to Wikipedia.
  16.  * I took this path from the action parameter of the <form> tag on this page:
  17.  * http://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Special:UserLogout
  18.  *
  19.  * You should just take a look at the source code of the page you’re going to
  20.  * login from, finf the login form and see action parameter there.
  21.  */
  22. define(‘GR_LOGIN_URL’, ‘http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login’);
  23. /**
  24.  * Your username
  25.  */
  26. define(‘GR_USERNAME’, ‘KJedi’);
  27. /**
  28.  * Your password
  29.  */
  30. define(‘GR_PASSWORD’, ‘did you think I\’ll give you my pass? :));
  31. ?>

In functions.php I define our 3 functions. requestContent looks as following:

  1. function requestContent($url, $postData = array(), $referer = , $headers = array())
  2. {
  3.         $ch = curl_init();//creating cUrl instance
  4.         curl_setopt($ch, CURLOPT_URL, $url);//setting our URL
  5.         curl_setopt($ch, CURLOPT_HEADER, 0);
  6.         if(GR_USER_AGENT) curl_setopt($ch, CURLOPT_USERAGENT, GR_USER_AGENT);//setting user agent
  7.         if($postData)//if we have post data
  8.         {
  9.                 curl_setopt($ch, CURLOPT_POST, 1);//tell cUrl we’ll be using POST
  10.                 curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);//set post data
  11.         }
  12.         curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  13.         curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);//force cUrl to follow any redirects
  14.         if($referer) curl_setopt($ch, CURLOPT_REFERER, $referer);//if there is referrer specified, set it
  15.         if(GR_COOKIE_FILE)//if there is cookie file
  16.         {
  17.                 //set next 2 options needed in order to uuse cookies correctly everywhere
  18.                 curl_setopt($ch, CURLOPT_COOKIEFILE, GR_COOKIE_FILE);
  19.                 curl_setopt($ch, CURLOPT_COOKIEJAR, GR_COOKIE_FILE);
  20.         }
  21.         if ($headers)//if there are additional headers
  22.         {
  23.                 curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);//set them
  24.         }
  25.         $result = curl_exec($ch);//execute cUrl query
  26.         curl_close($ch);//close cUrl
  27.         return $result;//return cUrl result
  28. }

Basically, here we init cURL, and get the handler. It is similar to the fopen(). Then we set all options we need and finally execute the HTTP query and return result.
Authentication is based on the previous function, that carries our all the low-level job. We only have to examine our login form and determine, what variables to send. So, open the login page of the service you want to access, find the <form> tag and copy the login URL. You’ll have to put it into the config.php file. Then see what fields are there. Note, that there my be hidden fields also, note their variables and create an associative array for them (see the following code). If you’re not sure what variables are actually sent, you may use Live Http Headers plugin for the Firefox.
Usage is simple. Open the login page. Fill it in. Then start the plugin (Tools -> Live HTTP headers). Submit the form and see the number of queries are shown. You actually need only the first one – this is your request. See the screenshot:

Http headers

Http headers


And now the code:

  1. function authenticate()
  2. {
  3.         $post_info = array(
  4.                            ‘wpName’ => GR_USERNAME,
  5.                            ‘wpPassword’ => GR_PASSWORD,
  6.                            ‘wpLoginAttempt’  => ‘Log+in’);
  7.         $postData = ;
  8.         foreach($post_info as $name => $value)
  9.         {
  10.                 $postData .= $name.‘=’.$value.‘&’;
  11.         }
  12.         $postData = trim(substr($postData, 0, -1));
  13.         $result = requestContent(GR_LOGIN_URL, $postData, GR_LOGIN_URL);
  14.         //here you may check if you were logged in successfully
  15. }

After getting result, you may parse it and see if you got a successful login. After you performed this query and got response, you’ve also got the session ID stored in cookie. You may open the cookie file and take a look yourself. After this you may perform any requests to the servic and it will respond as if you are logged user. The trick is that cUrl will send him cookies from the cookie file. And we have the session ID there or anything else that site uses for user identification.
So my getPage() function is a simple wrapper over the requestContent():

  1. function getPage()
  2. {
  3.         return requestContent(‘http://en.wikipedia.org/’);
  4. }

Basically, that’s all you need in order to get into the protected area. Now you’re free to do anything you want. Using regular expressions for parsing is the most common way to get the data from the page, so maybe you’ll need the tool for testing your regexps. I really recommend “The Regext Coach”, it greatly simplifies debugging :)

HTTP authentication

Another approach to the authentication, that is sometimes used in different control panels is HTTP authentication. You should have seen that – you’re presented with a standard screen, prompting for username and password. This username and password are sent with each request. You are not prompted fro them every time just because browser remembers it. To access such area, you don’t need any authenticate() function. You just pass login and pass with each request.
Here are modifications you should do in the above code:

  1. function requestContentHttpAuth($url, $postData = array(), $referer = , $headers = array())
  2. {
  3.         $ch = curl_init();//creating cUrl instance
  4.         curl_setopt($ch, CURLOPT_URL, $url);//setting our URL
  5.         curl_setopt($ch, CURLOPT_HEADER, 0);
  6.         curl_setopt($ch, CURLOPT_USERPWD, GR_USERNAME.‘:’.GR_PASSWORD);//<—– here is the change – you pass yout auth data each time
  7.         if(GR_USER_AGENT) curl_setopt($ch, CURLOPT_USERAGENT, GR_USER_AGENT);//setting user agent
  8.         if($postData)//if we have post data
  9.         {
  10.                 curl_setopt($ch, CURLOPT_POST, 1);//tell cUrl we’ll be using POST
  11.                 curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);//set post data
  12.         }
  13.         curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  14.         curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);//force cUrl to follow any redirects
  15.         if($referer) curl_setopt($ch, CURLOPT_REFERER, $referer);//if there is referrer specified, set it
  16.         if(GR_COOKIE_FILE)//if there is cookie file
  17.         {
  18.                 //set next 2 options needed in order to uuse cookies correctly everywhere
  19.                 curl_setopt($ch, CURLOPT_COOKIEFILE, GR_COOKIE_FILE);
  20.                 curl_setopt($ch, CURLOPT_COOKIEJAR, GR_COOKIE_FILE);
  21.         }
  22.         if ($headers)//if there are additional headers
  23.         {
  24.                 curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);//set them
  25.         }
  26.         $result = curl_exec($ch);//execute cUrl query
  27.         curl_close($ch);//close cUrl
  28.         return $result;//return cUrl result
  29. }

That’s all I wanted to share with you today. Hope you fins it helpful! If you have any questions/comments, feel free to add them!

Quick Note on OOP

I didn’t create the Curl class or present solution as a class because I don’t think it makes my explanations more clear or the code easier to use. This is not full solution, this is just a technique and I tried to present the code in such a way, that you could use them with minimal changes. You can easily put that functions to the class :)

Further reading

Download code for the post

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

Share this post with a friend Share this post with a friend

16 Comments

  1. Onelableva says:

    Hi, courteous posts there :-) thank’s concerning the compelling advice

  2. Konstantin Mirin says:

    Thanks :) I hope this will be really helpful for my readers.

  3. soxBeerry says:

    Hi, Congratulations to the site owner for this marvelous work you’ve done. It has lots of useful and interesting data.

  4. Rings says:

    Thank you very much for that great article

  5. babafisa says:

    There’s a lot of information here. I’ll be back again.

  6. Crasty says:

    Are you a professional journalist? You write very well.

  7. Konstantin Mirin says:

    Thanks. I’m working for you :)

  8. Konstantin Mirin says:

    You are welcome. Subsribe to RSS not to miss new posts.

  9. Konstantin Mirin says:

    Thanks. Have you subscribed to RSS?

  10. Konstantin Mirin says:

    Thanks for such compliment :) No, I am a professional developer and stand far from journalism. Furthermore, English is not my native language :)
    But I do my best.

  11. 8 Firefox plugins I use every day | Programmer's Notes says:

    [...] go here and there while you’re requesting the page. It’s a great thing when you need to grab some content from the password-protected area or inspect what’s going on when you request [...]

  12. ScallioXTX says:

    Works like a charm, I only added curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); to prevent the script from hanging indefinitely in case the connection to the server could not be established.

  13. DREE says:

    Hi,
    I tried to learn these things, but am a beginner to all this. Wanted to request you to help me ligin to a specific site using curl and php.

    Please let me know if you have some time to help me.

    Regards,
    D

  14. building inspections says:

    building inspections…

    Grabbing password-protected content with cURL | Programmer’s Notes…

  15. dog walkers eastern suburbs says:

    dog walkers eastern suburbs…

    Grabbing password-protected content with cURL | Programmer’s Notes…

  16. Niagara therapy says:

    Niagara therapy…

    Grabbing password-protected content with cURL | Programmer’s Notes…

Leave a Reply