Sky Blue Sofa Blog How to Use PHP's DOMDocument to Scrape a Web Page - Sky Blue Sofa Blog

How to Use PHP's DOMDocument to Scrape a Web Page

Posted by Dave Rogers // February 1, 2013 // in Blog // 0 Comments

Letters on a page

I've been working on an SEO addon for concrete5The issue I'm trying to solve right now is 'how to strip all irrelevant tags and content form the HTML and just return the text', aka web page scraping.

Bang head on desk.

Many Different Paths

First, I tried PHP's striptags function. Um, no. That just doesn't work well.

Next, I tried regular expressions. They were really clumsy, long-winded and not 100% effective.

Then I tried PHP's Document Object Model classes. They seemed magical.

PHP's Document Object Model Classes

The DOM (Document Object Model) classes allow you to:

  • Parse HTML and XML documents;
  • Transverse the DOM of those documents;
  • Add and remove nodes within the DOM
  • Query the DOM using XPath
So after some time working with the DOM classes, I created this function to scrape the text from a real HTML document:

  1. function getTextFromHTML($html='') {
  2. // An array of words that should be removed
  3. //from the resultant text
  4. $stopWords = array(' ');
  5.  
  6. // Initially remove the script tags using regex (there were some
  7. // issues if I didn't do this)
  8. $html = preg_replace('/<script.*?script>/is', '', $html);
  9.  
  10. //Load the $html into a DOMDocument object
  11. $dom = new DOMDocument();
  12. $dom->preserveWhiteSpace = false;
  13. //libxml_use_internal_errors (true);
  14. $dom->loadHTML(strtolower($html));
  15.  
  16. // Strip out scripts if there are any left
  17. $scripts = $dom->getElementsByTagName('script');
  18. foreach ($scripts as $script) {
  19. $script->parentNode->removeChild($script);
  20. }
  21.  
  22. // Strip out style blocks
  23. $styles = $dom->getElementsByTagName('style');
  24. foreach ($styles as $style) {
  25. $style->parentNode->removeChild($style);
  26. }
  27.  
  28. // Go through the resultant $html and get all text nodes
  29. $xPath = new DOMXPath($dom);
  30. $textNodes = $xPath->evaluate('//text()');
  31. $text = "";
  32. foreach ($textNodes as $textNode) {
  33. // Do some magic on the gathered text
  34. $nodeValue = strtolower($textNode->nodeValue);
  35. $nodeValue = str_replace($stopWords,' ', $nodeValue);
  36. $nodeValue = preg_replace("/[.:()\/\$\'\#]/", ' ', $nodeValue);
  37. $nodeValue = preg_replace('/[^a-z0-9 -\\._]/', '', $nodeValue);
  38. $nodeValue = trim($nodeValue);
  39. if (!empty($nodeValue)) {
  40. $text .= $nodeValue." ";
  41. }
  42. }
  43. return $text;
  44. }

I'm almost sure there is a better way to do this, so if you have any suggestions, let me know in the comments.

About the Author

Dave Rogers

Dave is the founder of Sky Blue Sofa Web Design. He enjoys working out, spending time with his wife and dogs and programming. He grew up and currently resides in the Illinois Quad Cities. You can find his personal blog at strength/reliance.com.

View Profile »

Comments

You must be logged in to leave a reply. Login »