Home > PHP How Tos, PHP Scripts, PHP Tips & Tricks > PHP Page Scraper - Keyword Harvester

PHP Page Scraper - Keyword Harvester

March 23rd, 2009

Working with PHP, I’ve had to on more than one occasion find creative ways to scrape data off other websites. Therefore, I’ve decided to add a tutorial on how to create web scrapes using DOM and Xpath in PHP.

A long time ago, I used to do the same with regular expressions before PHP supported DOM. It was doable and considering it was the only method, a must if you wanted to do any time of web scraping. There was never any reason to like or dislike the technique as it was very effective and in my Perl days, it’s how we used to do it.

When PHP 5 came out and it incorporated DOM and Xpath, I gave it a go. I have to say, using regular expression seemed like starting a fire with 2 sticks compared to the ease of Xpath object and methods.

With regular expression, there were some elements of trial and error to match a unique case or string to catch the beginning and end of the area I wanted to capture. Then it was a matter of stripping out the html and getting to the data I wanted.

But DOM and Xpath takes alot of the guessing game out of web scraping because it was designed for handling documents and querying for data within it. So basically its treats a web page like an XML document and you can then query for nodes and attributes, which allows the developer to focus on the data and not the matching expressions.

Ok, let’s get started.

The tools you will need are:

  1.  cURL extension. Most hosting companies have this preinstalled so there shouldn’t be a problem. Run phpinfo() and make sure cURL extensions are installed on your server.
  2. Make sure the DOM extension is installed.

That’s all you need to get started.

First of all, I have to note, cURL is not necessary to do web scraping with DOM because you can feed DOM the web page directly. However, I like using it because cURL allows you to send headers and also in advanced situations, provide login authentication in order to scrape data behind some security like in Facebook.

So the first thing we do is set the useragent header.

  $userAgent = “IE 7 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)”;

Now decide which website to scrape. I’ve selected GigaBlast search engine to scrape the keywords displayed when you do a search. I like this example because this simple function will allow us to take a keyword and get 5-10 more related keywords which you can harvest and store for future use in SEO or keyword suggestion tool.

  $target_url = “http://www.gigablast.com/search?k5d=14907&s=0&q=$keyword“;

Now the setup for cURL is pretty straight forward.

 $ch = curl_init();
 curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
 curl_setopt($ch, CURLOPT_URL, $target_url);
 curl_setopt($ch, CURLOPT_FAILONERROR, true);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
 curl_setopt($ch, CURLOPT_AUTOREFERER, true);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
 curl_setopt($ch, CURLOPT_TIMEOUT, 10);
 $html = curl_exec($ch);

A short error handler for the cURL

if (!$html) { 
  echo “cURL error number:” .curl_errno($ch); 
  echo “cURL error:” . curl_error($ch); 
  exit;
 }

Now we get to the meat. We create the DOM object and assign it to $dom variable.

 $dom = new DOMDocument();
 @$dom->loadHTML($html);

Now create the xpath object and pass that to variable $xpath.

 $xpath = new DOMXPath( $dom );

Now your ready to start the scraping process.  First we query the table and we use a matching pattern by referencing the class attribute. We do this by using the query method and using the following matching pattern “//table[@class='bluetable']“.

“//table” is telling the xpath query to look at only tables. “[@class='bluetable']” is the distinguishing matching feature as in this case, only the table with the keywords uses this class.

 $key = $xpath->query( “//table[@class='bluetable']” );

Next, I create an empty array where I will be adding the keywords I find.

 $return = array();

So we’ve gotten everything we needed so now all we have to do is look at the data and extract what we need.

The variable $key, has everything in it that we need. To go through the data and get the keywords, I use a Loop through and create another dom object to treat the sub html as a new document.

 foreach ( $key as $item ) {
  $newDom = new DOMDocument;
  $newDom->appendChild($newDom->importNode($item,true));
  
  $xpath = new DOMXPath( $newDom );
  
  $a = $xpath->evaluate(”//a”); <- this line looks for only ancher tags. $a can then be assigned t0 length like such $a->length to determine how many ancher tags there are. Then we just look through each anchor tag and get the keywords and if we want, the href value.
  
  for($x=0;$x<$a->length;$x++) {
   $keywords = trim($xpath->query(”//a”)->item($x)->nodeValue);
   $ahref = $a->item($x);
   $url = $ahref->getAttribute(’href’);
   $return[$x]['keywords'] = $keywords; <- Keyword
   $return[$x]['url'] = $url; <- URL
  }
 
 }

Then just return the the variable $return.

All done. Now you have an array list of all the keywords that you can store in a db or output for additonal keyword lookups.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google

admin PHP How Tos, PHP Scripts, PHP Tips & Tricks , , , , , , ,

  1. Joe
    March 24th, 2009 at 02:44 | #1

    Great script, thanks for posting. How can I get the full function and maybe some help?

  2. Emily
    June 16th, 2009 at 22:11 | #2

    Amazing post, it helped me a lot :)

    One thing though; I think you have a typo in the last section of your script. Replacing the “$i” with the “$x” in the line “$ahref = $a->item($i);” made the script function correctly for me.

  3. June 17th, 2009 at 04:06 | #3

    Thanks. Good catch. I’ve updated and fixed that error.

  4. mcllain
    August 15th, 2009 at 13:40 | #4

    working on a site!

    Could not find the download link for this script.

    Please advise?

  5. Bob
    August 18th, 2009 at 07:23 | #5

    I am new to Xpath, is there any setting to set UTF, the strings I am getting have some weird characters in them (dunno where it gets them from as the links names are basically ascii on the web page that I am trying to scrape)

  6. November 23rd, 2009 at 06:17 | #6

    Hi…

    I’d like to develop a google like search engine, on which a user inputs a keyword or phrase then it looks on the web all the web sites or pages that contains such keyword..
    can someone help on that. can be a PHP script

  7. November 23rd, 2009 at 06:29 | #7

    hi, again

    i’d like to crawl multiple web pages to get from there url, title and description of these multiple web pages….can someone help me on that…using PHP script…

  1. No trackbacks yet.