WWW::Search::Scraper - framework for scraping results from search engines.
WWW::Search::Scraper('engineName');
``Scraper'' is a framework for issuing queries to a search engine, and scraping the data from the resultant multi-page responses.
As a framework, it allows you to get these results using only slight knowledge of HTML and Perl. (All you need to know you can learn by reading this document.)
A Perl script, ``Scraper.pl'', uses Scraper.pm to investigate the ``advanced search page'' of a search engine, issue a user specified query, and parse the results. (Scraper.pm can be used by itself to support more elaborate searching Perl scripts.) Scraper.pl and Scraper.pm have a limited amount of intelligent to figure out how to interpret the search page and its results. That's where your human intelligence comes in. You need to supply hints to Scraper to help it find the right interpretation. And that is why you need some limited knowledge of HTML and Perl.
The front-end of Scraper is the part that figures out the search page and issues a query. There are three ways to implement this end.
WWW::Search::Sherlock can read Sherlock
plugins and access those sites the same way.
This is simplest way to get up and running on a search engine. You just name the plugin, provide your query,
and watch it fly! You do not even need a sub-module associated with the search engine for this approach.
See WWW::Search::Sherlock for examples of how this is done.
There are about a hundred plugins available at http://sherlock.mozdev.org/source/browse/sherlock/www/, contributed by many in the Open Source community.
There are a few drawbacks to this approach.
If you run into these limitations, then you may want to use one of the following approaches.
native_setup_search() method of your search module,
find the URL of the ACTION= in the <FORM>, and plug that into your search module's {_option}{_base_url} attribute.
Also provide the METHOD= value from the <FORM> into the <_http_method} attribute.
You'll find the input fields in the <FORM> as <INPUT> elements. Supply values to
these fields via the {'option'=>'value'} parameter of the next_result() method, and you're on your way.
See the EXAMPLES below for these two latter approaches.
The back-end of Scraper.pm receives the response from the search engine, handling the multiple pages which it may be composed of, parses the results, and returns to the caller an appropriate Perl representation of these results (``appropriate'' means an array of hash tables of type WWW::Search::SearchResults). Scraper.pl (or some other Perl client) further processes this data, or presents in some human readable form.
There are a few common ways in which search engines return their results in the HTML response. These could be detected by Scraper.pm if it were intelligent enough, but unfortunately most search engines add so much administrative clutter, banner ads, ``join'' options, and so forth to the result that Scraper.pm usually needs some help in locating the real data.
The Scraper scripting language consists of both HTML parsing and string searching commands. While a strict HTML parse should produce the most reliable results, as a practical matter it is sometimes extremely difficult to grok just what the HTML structure of a response is (remember, these reponses are composed by increasingly complex application server programs.) Therefore, it is necessary to provide some hints as to where to start an interpretation by giving Scraper some kind of string searching command.
The string searching commands (BODY, COUNT, NEXT) will point Scraper to approximately the right place in the response page, while HTML parsing commands (TABLE, TR, TD, etc) will precisely extract the exact data. There are also ways to to callbacks into your sub-module to do exactly the type of parsing your engine requires.
Scraper performs its function by viewing the entire response page at once. Whenever a particular section of the page is recognized, it will process that section according to your instructions, then discard the recognized text from the page. It will repeat until no further sections are recognized.
We'll illustrate the exact syntax of this language in later examples, but the commands in this language include:
This is a quick way to get rid of a lot of administrative clutter. Either of the parameters is optional, but one should be supplied or else it's a no-op.
Both start-string and end-string are treated as ``regular expressions''. If you don't know anything about regular expressions, that's ok. Just treat them as strings that you would search for in the result page; see the examples.
It is a regular expression (see comments above). See the examples for some self-explanatory illustrations.
DATA or REGEX on these.)
The first parameter on the TD, DT or DD command names the field in which the garnered data will be placed.
A second parameter provides a reference to optional subroutine for further processing of the data.
There are two forms of the A command since some loosely coded HTML will supply the hyperlink without the quote marks.
This creates some disturbing results sometimes, so if your data is in an anchor where the HREF is
provided without quotes, then the AN operation will parse it more reliably.
The first parameter on the REGEX command is a regular expression. The rest of the parameters are a list naming which fields the matched variables of this regex ($1, $2, $3, etc) will be placed.
See the code for WWW::Search::Sherlock for an illustration of how this works. Sherlock uses this
method for almost all its parsing. A sample Sherlock scraper frame is also listed below in the EXAMPLES.
Scraper accepts its command script as a reference to a Perl array. You don't need to know how to build a Perl array; just follow these simple steps.
As noted above, every script begins with an HTML command
[ 'HTML' ]
You put the command is square brackets, and the name of the command in single quotes. HTML will have a single parameter, which is a reference to a Scraper script (in other words, another array).
[ 'HTML', [ ...Scraper script... ] ]
(You can see this is going to get messy with all these square brackets.)
Suppose we want to parse for just the NEXT button.
[ 'HTML',
[
[ 'NEXT', '<B>Next' ]
]
]
The basic syntax is, a set of square brackets, and a command name in single quotes, to designate a command. Following that command name may be one or two parameters, and following those parameters may be another list of commands. The list is within a set of square brackets, so often you will see two opening brackets together. At the end you will see a lot of closing brackets together (get used to counting brackets!).
Most search engines will not require you to use REGEX. We've used CraigsList here not to illustrate REGEX, but to illustrate the structure of the Scraper scripting syntax more clearly. Just ignore the REGEX command in this script; realize that it parses a data string and puts the results in the fields named there.
[ 'HTML',
[
[ 'BODY', '</FORM>', '' ,
[
[ 'COUNT', 'found (\d+) entries'] ,
[ 'HIT*' ,
[
[ 'REGEX', '(.*?)-.*?<a href=([^>]+)>(.*?)</a>(.*?)<.*?>(.*?)<',
'date', 'url', 'title', 'location', 'description' ]
]
]
]
]
]
]
This tells Scraper to skip ahead, just past the first ``</FORM>'' string (it's only a coincidence that this string is also an HTML end-tag.) In the remainder of the result page, Scraper will find the appoximate COUNT in the string ``found (\d+) entries'' (the '\d+' means to find at least one digit), then the HITs will be found by applying the regular expression repeatedly to the rest.
[ 'HTML',
[
[ 'COUNT', '\d+ - \d+ of (\d+) matches' ] ,
[ 'NEXT', 1, '<b>Next ' ] ,
[ 'HIT*' ,
[
[ 'BODY', '<input type="checkbox" name="check_', '',
[ [ 'A', 'url', 'title' ] ,
[ 'TD' ],
[ 'TABLE', '#0',
[
[ 'TD' ] ,
[ 'TD', 'payrate' ],
[ 'TD' ] ,
[ 'TD', 'company' ],
[ 'TD' ] ,
[ 'TD', 'locations' ],
[ 'TD' ] ,
[ 'TD', 'description' ]
]
]
]
]
]
]
]
]
Note that the initial BODY command, that was used in CraigsLIst, is optional. We don't use it here since most of JustTechJobs' result page is data, with very little administrative clutter.
We pick up the COUNT right away, with a simple regular expression. Then the NEXT button is located and stashed. The rest of the result page is rich with content, so the actual data starts right away.
Because of the extreme complexity of this page (due to its automated generation) the simplest way to locate a data record is by scanning for a particular string. In this case, the string '<input type . . .check_' identifies a checkbox that starts each data record on the JustTechJobs page. We put this BODY command inside of a HIT* so that it is executed as many times as required to pick up all the data records on the page.
Within the area specified by the BODY command, you will find a table that contains the data. The first parameter of the TABLE command, '#0', means to skip zero tables and to just read the first one. The second parameter of the TABLE is a script telling Scraper how to interpret the data in the table. The primitive data in this table is contained in TD elements, as are labels for each of the data elements. We throw away those labels by specifying no destination field for the data.
The page, as composed by Lotus-Domino, literally consists of a form, containing several tables, one of which contains another table, which in turn contains data elements which are themselves two tables in which each of the job listings are presented in various forms (I think). Given such a complex page, this Scraper script is remarkably simple for interpreting it.
[ 'HTML',
[
[ 'BODY', ' matching your query', '' ,
[
[ 'NEXT', 1, '<img src="/images/rightarrow.gif" border=0>' ]
,[ 'COUNT', 'Jobs [-0-9]+ of (\d+) matching your query' ]
,[ 'HIT*' ,
[
[ 'DL',
[
[ 'DT', 'title', \&addURL ]
,[ 'DD', 'location', \&touchupLocation ]
,[ 'RESIDUE', 'residue' ]
]
]
]
]
]
]
]
]
We'll leave this as an exercise for the reader (note that this is the ``brief'' form of the response page.)
WWW::Search::Sherlock, to illustrate how Sherlock uses the Scraper framework.
If you point Sherlock to the Yahoo plugin, it will generate the following Scraper frame to parse the result page.
[
'HTML',
[
[
'CALLBACK', \&resultList,
'Inside Yahoo! Matches',
'Yahoo! Category Matches',
[
[
'HIT*',
[
[
'CALLBACK', \&resultItem,
'<b>',
'<br>',
[
[
'CALLBACK',
\&resultData,
'<b>',
':</b>',
'result_name'
]
],
undef
]
],
'result'
]
]
],
[
'CALLBACK',
\&resultList,
'Yahoo! Category Matches',
'Yahoo! News Headline Matches',
[
[
'HIT*',
[
[
'CALLBACK', \&resultItem,
'<dt><font face=arial size=-1>',
'</a></li><p></dd>',
[
[
'CALLBACK', \&resultData,
'<li>',
'</a></li><p></dd>',
'result_name'
]
],
undef
]
],
'category'
]
]
]
]
]
You'll notice that there are three callback functions in here (six invocations). These are named after the parts-of-speech specified in the Sherlock technotes. These callback functions will process the data a little differently than the standard Scraper functions would.
In Sherlock, for instance, the start and end strings are considered part of the data, so throwing them away causes unfortunate results. Our callbacks handle the data more in the way that Sherlock's creators intended.
'resultList' corresponds to Scraper's BODY, 'resultItem' corresponds to Scraper's TABLE, and 'resultData' corresponds to Scraper's DATA. The next two parameters of each CALLBACK operation indicate the start and end strings for that callback function. (A fourth parameter allows you to pass more specific information from the Scraper frame to the callback function, as desired.) Of course, these callback functions then handle the data in the ``Sherlock way'', rather than the ``Scraper way''.
Note that the start string for the second resultList is the same as the end string of the first resultList. This is but one illustration of how Sherlock handles things differently than Scraper. But by using the CALLBACK operation, just about any type of special treatment can be created for Scraper.
We refer you to the code for WWW::Search::Sherlock for further education on how to compose your own CALLBACK functions.
Glenn Wood, glenwood@alumni.caltech.edu.
Copyright (C) 2001 Glenn Wood. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.