Web scraping with Python (part II)

The first part of this article dealt with retrieving HTML pages from the web with the help of a mechanize-propelled web crawler. Now your HTML pieces are safely saved locally on your hard drive and you want to extract structured data from them. This is part 2, HTML parsing with Python. For this task, I adopted a slightly more imaginative approach than for my crawling hacks. I designed a data extraction technology based on HTML templates. Maybe this could be called « reverse-templating » (or something like template-based reverse-web-engineering).

You may be used with HTML templates for producing HTML pages. An HTML template plus structured data can be transformed into a set of HTML pages with the help of a proper templating engine. One famous technology for HTML templating is called Zope Page Templates (because this kind of templates is used within the Zope application server). ZPTs use a special set of additional HTML tags and attributes referred to by the « tal: » namespace. One advantage of ZPT (over competing technologies) is that ZPT are nicely rendered in WYSIWYG HTML editors. Thus web designers produce HTML mockups of the screens to be generated by the application. Web developpers insert tal: attributes into these HTML mockups so that the templating engine will know which parts of the HTML template have to be replaced by which pieces of data (usually pumped from a database). As an example, web designers will say <title>Camcorder XYZ</title> then web developpers will modify this into <title tal:content= »camcorder_name »>Camcorder XYZ</title> and the templating engine will further produce a <title>Camcorder Canon MV6iMC</title> when it processes the « MV6iMC » record in your database (it replaces the content of the title element with the value of the camcorder_name variable as it is retrieved from the current database record). This technology is used to merge structured data with HTML templates in order to produce Web pages.

I took inspiration from this technology to design parsing templates. The idea here is to reverse the use of HTML templates. In the parsing context, HTML templates are still produced by web developpers but the templating engine is replaced by a parsing engine (known as web_parser.py, see below for the code of this engine). This engine takes HTML pages (the ones you previously crawled and retrieved) plus ZPT-like HTML templates as input. It then outputs structured data. First your crawler saved <title>Camcorder Canon MV6iMC</title>. Then you wrote <title tal:content= »camcorder_name »>Camcorder XYZ</title> into a template file. Eventually the engine will output camcorder_name = « Camcorder Canon MV6iMC ».

In order to trigger the engine, you just have to write a small launch script that defines several setup variables such as :

  • the URL of your template file,
  • the list of URLs of the HTML files to be parsed,
  • whether you would like or not to pre-process these files with an HTML tidying library (this is useful when the engine complains about badly formed HTML),
  • an arbitrary keyword defining the domain of your parsing operation (may be the name of the web site your HTML files come from),
  • the charset these HTML files are made with (no automatic detection at the moment, sorry…)
  • the output format (csv-like file or semantic web document)
  • an optional separator character or string if ever you chose the csv-like output format

The easiest way to go is to copy and modify my example launch script (parser_dvspot.py) included in the ZIP distribution of this web_parser.

Let’s summarize the main steps to go through :

  1. install utidylib into your python installation
  2. copy and save my modified version of BeautifulSoup into your python libraries directory (usually …/Lib/site-packages)
  3. copy and save my engine (web_parser.py) into your local directory or into you python libraries directory
  4. choose a set of HTML files on your hard drive or directly on a web site,
  5. save one of these files as your template,
  6. edit this template file and insert the required pseudotal attributes (see below for pseudotal instructions, and see the example dvspot template template_dvspot.zpt),
  7. copy and edit my example launch script so that you define the proper setup variables in it (the example parser_dvspot.py contains more detailed instructions than above), save it as my_script.py
  8. launch your script with a python my_script.py > output_file.cowl (or python my_script.py > output_file.cowl)
  9. enjoy yourself and your fresh output_file.owl or output_file.csv (import it within Excel)
  10. give me some feedback about your reverse-templating experience (preferably as a comment on this blog)

This is just my first attempt at building such an engine and I don’t want to make confusion between real (and mature) tal attributes and my pseudo-tal instructions. So I adopted pseudotal as my main namespace. In some future, when the specification of these reverse-templating instructions are somewhat more stabilized (and if ever the « tal » guys agree), I might adopt tal as the namespace. Please also note that the engine is somewhat badly written : the code and internal is rather clumsy. There is much room for future improvement and refactoring.

The current version of this reverse-templating engine now supports the following template attributes/instructions (see source code for further updates and documentation) :

  • pseudotal:content gives the name of the variable that will contain the content of the current HTML element
  • pseudotal:replace gives the name of the variable that will contain the entire current HTML element
  • (NOT SUPPORTED YET) pseudotal:attrs gives the name of the variable that will contain the (specified?) attribute(s ?) of the current HTML element
  • pseudotal:condition is a list of arguments ; gives the condition(s) that has(ve) to be verified so that the parser is sure that current HTML element is the one looked after. This condition is constructed as a list after BeautifulSoup fetch arguments : a python dictionary giving detailed conditions on the HTML attributes of the current HTML element, some content to be found in the current HTML element, the scope of research for the current HTML element (recursive search or not)
  • pseudotal:from_anchor gives the name of the pseudotal:anchor that is used in order to build the relative path that leads to the current HTML element ; when no from_anchor is specified, the path used to position the current HTML element is calculted from the root of the HTML file
  • pseudotal:anchor specifies a name for the current HTML element ; this element can be used by a pseudotal:from_anchor tag as the starting point for building the path to the element specified by pseudotal:from_anchor ; usually used in conjunction with a pseudotal:condition ; the default anchor is the root of the HTML file.
  • pseudotal:option describes some optional behavior of the HTML parser ; is a list of constants ; contains NOTMANDATORY if the parser should not raise an error when the current element is not found (it does as default) ; contains FULL_CONTENT when data looked after is the whole content of the current HTML element (default is the last part of the content of the current HTML element, i.e. either the last HTML tags or the last string included in the current element)
  • pseudotal:is_id_part a special ‘id’ variable is automatically built for every parsed resource ; this id variable is made of several parts that are concatenated ; this pseudotal:is_id_part gives the index the current variable will be used at for building the id of the current resource ; usually used in conjunction with pseudotal:content, pseudotal:replace or pseudotal:attrs
  • (NOT SUPPORTED YET) pseudotal:repeat specifies the scope of the HTML tree that describes ONE resource (useful when several resources are described in one given HTML file such as in a list of items) ; the value of this tag gives the name of a class that will instantiate the parsed resource scope plus the name of a list containing all the parsed resource

The current version of the engine can output structured data either as a CSV-like output (tab-delimited for example) or as an RDF/OWL document (of Semantic-Web fame). Both formats can easily be imported and further processed with Excel. The RDF/OWL format gives you the ability to process it with all the powerful tools that are emerging along the Semantic Web effort. If you feel adventurous, you may thus import your RDF/OWL file into Stanford’s Protege semantic modeling tool (or into Eclipse with its SWEDE plugin) and further process your data with the help of a SWRL rules-based inference engine. The future Semantic Web Rules Language will help at further processing this output so that you can powerfully compare RDF data coming from distinct sources (web sites). In order to be more productive in terms of fancy buzz-words, let’s say that this reverse-templating technology is some sort of a web semantizer. It produces semantically-rich data out of flat web pages.

The current version of the engine makes an extensive use of BeautifulSoup. Maybe it should have been based on a more XMLish approach instead (using XML pathes ?). But it would have implied that the HTML templates and HTML files to be processed should then have been turned into XHTML. The problem is that I would then have relied on utidylib but this library breaks too much some mal-formed HTML pages so that they are not valuable anymore.

Current known limitation : there is currently no way to properly handle some situations where you need to make the difference between two similar anchors. In some cases, two HTML elements that you want to use as distinct anchors have in fact exactly the same attributes and content. This is not a problem as long as these two anchors are always positioned at the same place in all the HTML page that you will parse. But, as soon as one of the anchors is not mandatory or it is located after a non mandatory element, the engine can get lost and either confuse the two anchors or complain that one is missing. At the moment, I don’t know how to handle this kind of situation. Example : long lists of specifications with similar names where some specifications are optional (see canon camcorders as an example : difference between lcd number of pixels and viewfinder number of pixels). The worst case scenario would be when there is a flat list of HTML paragraphs. The engine will try to identify these risks and should output some warnings in this kind of situations.


Here are the contents of the ZIP distribution of this project (distributed under the General Public License) :

  • web_parser.py : this is the web parser engine.
  • parser_dvspot.py : this is an example launch script to be used if you want to parser HTML files coming from the dvspot.com web site.
  • template_dvspot.zpt : this is the example template file corresponding to the excellent dvspot.com site
  • BeautifulSoup.py : this is MY version of BeautifulSoup. Indeed, I had to modify Leonard Richardson’s official one and I couldn’t obtain any answer from him at the moment regarding my suggested modifications. I hope he will soon answer me and maybe include my modifications in the official version or help me overcoming my temptation to fork. My modifications are based on the official 1.2 release of beautifulsoup : I added « center » as a nestable tag and added the ability to match the content of an element with the help of wildcards. You should save this BeautifulSoup.py file into the « Lib\site-packages » folder of your python installation.
  • README.html is the file you are currently reading, also published on my blog.

23 réflexions au sujet de « Web scraping with Python (part II) »

  1. kedai

    nice. thanks for the link to beautiful soup. probably will try and use that as a parser for KebasData (a zope product to scrape the web, what else).

    currently, KebasData uses regex, and as noted by Leornard Richardson, regex is a double edge sword; it can help and it can also cause more trouble ;)

  2. taprackbang

    This is great. After spending great deal of time to setup all the prerequisite packages, i was able to scrape the amazon’s webpage to get book title, price easily. However, amazon’s webpage has mismatched form tag which caused tidy to choke, so I had to manually remove all the form tags before tidy and web_parser function calls.

    It works right out of box, and made web scrapping so much easier. Great work. Thanks!!

  3. Sig Auteur de l’article

    taprackbang : I should package this stuff so that it gets installed more easily but I don’t have any precise idea about how to do this. Regarding malformed HTML (Amazon’s mismatched form tag), I suggest that you write an appropriate regex in a « MyProcessor » class-like preprocessor as suggested in the first part of this article. Your pre-processor will help crawling Amazon and saving its pages in a more adequate format.

    Thanks all for your positive feedback : keep on commenting, I enjoy that a lot ! :-)

  4. Sig Auteur de l’article

    Quick notice : Leonard Richardson (BeautifulSoup’s author) gave me a positive feedback about my suggested modifications. He is to include these features into the version 2.0 of BeautifulSoup he is working on. I don’t plan on making any additional contribution on BeautifulSoup in the near future but I will certainly update this web parser as soon as v.2.0 of BS is available. Thank you Leonard !

  5. taprackbang

    the option wasn’t working for me. then I put one line:

    try:
    option = finish[‘pseudotal:option’]
    path.append(option) # my change

    near the end of template.shortest_path(), and it works, becuase extract() looks for that option value. what u think?

  6. Sig Auteur de l’article

    taprackbang : thank you for your suggestion ; it looks better than the current version. But why do you say that the option wasn’t working for you ? Please describe the problem you were encountering.

  7. taprackbang

    since title and authors are mixed together in the div node, so I want to get the whole div tag and process it myself. In my template, i have . but my script still treats title_authors regularly and were not returning whole html string. after I put in my change, then it works return the whole div tag.

  8. taprackbang

    missing template part:
    <div class= »buying » pseudotal:content= »title_authors » pseudotal:option= »FULL_CONTENT »>

  9. alex

    Thanks. Currently working further on a similar idea. Compare two pages from the same site to get the modifier fields. With very litle programmed inteligence you can then get article description, date, title, price … a la Froogle.

  10. Sig Auteur de l’article

    Alex : it sounds nice. Do you have some piece of code so that we can try it ?

  11. Sig Auteur de l’article

    Adam : yes, you may. This piece of code does not care about the technology that generates your web pages. Indeed, once your ASP.Net code has been accessed and run by your web server, the output sent to the web browser is pure HTML. The only technology that it may have problems with is Javascript when there is too much of it in a web page (when most of the HTML is generated at run-time on client-side). Other limit : when your HTML is really really far from valid, then you may have to tidy it with an included pre-processor (see the source code for more explanation on this point).

    So anyway, yes, you can certainly parse HTML generated by ASP.Net pages.

  12. adam

    Dear Sig,
    Thanks for your hard work!
    I have a trouble when I use your the project ‘s ZIP distribution .
    I installed these packages obey your guide step by step,but when I run the « web_parser »,I had been told « ImportError: DLL load failed ».

    The error messages output to shell console is following:
    Traceback (most recent call last):
    File « D:\download\python\web_parser\web_parser.py », line 8, in ?
    import tidy
    File « D:\Python24\Lib\site-packages\tidy\__init__.py », line 38, in ?
    from tidy.lib import parse, parseString
    File « D:\Python24\Lib\site-packages\tidy\lib.py », line 16, in ?
    import ctypes
    File « D:\Python24\lib\site-packages\tidy\pvt_ctypes\ctypes.zip\ctypes\__init__.py », line 13, in ?
    ImportError: DLL load failed: 找不到指定的模块。

    Can you help me,thanks.

  13. Sig Auteur de l’article

    Adam,

    I apologize for having forgotten to answer you sooner (job switch + holidays in-between)…

    The traceback you provides says that your tidy lib complains about some DLL that can’t get loaded. I suggest that you check your uTidylib installation. Maybe try to uninstall/re-install and see if it fixes your problem.

    If it does not work, I suggest that you ask for support to the uTidylib project team.

    Once again I sincerely apologize and I hope that you could fix this problem.

  14. Juancho

    I know this is way after the fact, but Adam, I had that problem as well. uTidy lib wraps a C lib (i think) and therefore uses the Ctypes package to execute C stuff from Python.

    If your installation doesn’t have cTypes installed, then uTidylib has its own version that it tries to use (in pvt_ctypes). But that version didn’t work on my computer either – i think it is designed for python 2.3.

    Solution, just install the latest cTypes library from sourceforge on your computer, and then utidy will use that version, and not its own private version. I hope that helps.

    Sig – thx much for the code. I’ve been playing with it extensively for some time now.

  15. Sig Auteur de l’article

    Juancho: thx for your thx. :)

    I’d hope I will have some opportunity to refresh this code a bit and to extend its functionalities. Unfortunately, at the moment, I am struggling with these clauses in my new job contract that let my employer kind-of own anything I create (in case there would be some software patents to produce out of it…)… :(

  16. emrinho

    Wow pal awesome tutorial. Thanks for the information but I kinda got stuck at step 6. Couldn’t properly added the attiributes. Well thanks anyway. It was a different experience for me.

  17. Sig Auteur de l’article

    Of course. It parses HTML, whatever the application engine behind the pages is.

  18. Sig Auteur de l’article

    DF asks me (by email) :

    I read your articles on Web Scraping with Python and I’m wondering if over the years you’ve come across more advanced ways to solve your problems.

    More specifically, do you know of any practical way to extract the same type of data (ie. camera products) from multiple websites of varying structure, without having to custom code for each website (or perhaps very minimal custom code).

    I know there are some websites capable of doing this all automatically. Take for instance http://www.vast.com – they have loads of data.

    Do you have any insights into this kind of technology?

    Thanks Sig – I’d love to hear what you have to say…

    Here is my answer :

    First of all, the ideal solution remains to have the sites publish this data in a structured way (say RDF/OWL or JSON for instance). Most often, if they don’t publish in such a structured way, it may mean that they don’t allow you to scrape their data and you may get into legal troubles because of copyright laws.

    That being said, there have been attempts at easing the process of custom-coding the scraping of specific sites. The 2 most interesting solutions I played with (a couple of years ago) are Openkapow and Dapper.

    The advantage of Openkapow on custom script-based scraping is that it offers a rich scrape-robot development environment (GUI) which eases the process of analyzing the HTML structure. But running and exploiting these robots has revealed to be not as flexible and easy as running your own homemade scrapers.

    Dapper has a significant strength : it allows structure to be learnt by the machine based on examples. You provide dappers with several samples of pages to extract data from and it « automagically » identifies recurrent HTML patterns which allow it to extract data. There must be some machine learning algorithm behind it AFAICS. But the drawbacks of dapper are : these algorithms are OK for 80% cases but the other 20% won’t be parseable by Dapper, and Dapper requires the page to be a list of many items (think paginated list of a search results). Dapper does not seem to be suitable for the technical sheet of a camera for instance. And Dapper scrapers can’t easily be combined : you can’t easily script the navigation in a complex site unless you combine dapper with things like Yahoo pipes.

    As a conclusion, I would say that simple and easily accessible paginated lists of results deserve some dappering some hesitation. Openkapow is the tool to use if you can’t script by yourself. But the definite answer to complex and robust scraping remains homemade scripts.

    There may be other valuable alternatives I don’t know. I have not been spending much time on scraping since I wrote this article.

    Please come and share the results of your own experiments as further comments !

  19. Tripp Lilley

    « I am struggling with these clauses in my new job contract that let my employer kind-of own anything I create (in case there would be some software patents to produce out of it…)… :( »

    Don’t worry about patentability… Template::Extract, a Perl module, available at CPAN, predates this by a year, and so should give your employer no grounds for claiming novelty of the invention.

    Version 0.36, the first release available at CPAN:
    http://search.cpan.org/~autrijus/Template-Extract-0.36/

    I, too, thought that « reverse templating » would be a good way to approach this set of problems. I was part way into the « thinking about how it would work » process when I found Template::Extract, which freed me up to think about other problems :-)

Les commentaires sont fermés.