Web extraction tutorial - retrieving articles from www.prweb.com

In this tutorial we would like to show you how we solved a problem, which specified our customer. His requirements were following:

  1. search in www.prweb.com for any keyword (e.g. "black coffe")
  2. all results of found PR articles should be parsed - not just first 10 results, but also next pages of search (e.g. for "black coffe it was more than 1600 found articles)
  3. each found PR article should be opened and content should be stored to file with name: {$year}-{$month}-{$id}.html

Customer asked us, if the specified task is possible to do with Unit Miner. And we answerer big "YES", because it's exactly for what we developed Unit Miner. So let's start with tutorial how to do such task with Unit Miner.

Step 1 - open search

As first step we defined Logger tag, because during development is always good to know what's going on inside Unit Miner. Also if I understand Unit Miner, I do it each time, when I do any new script or if I have to find problem inside script.

Inside first step we will do also small framework of main section, which understand everyone. So we will create script with name: main.w e.g. inside directory Wrapper/samples/prweb


#defines logger, which will log all debug messages
<Logger File>
    Global
    
    #store output of logger to file
    FileName output-debug.txt
    Level debug
</Logger>

#main section
<Section>
    Name main

  

</Section>

Main main

Main story is highlighted in page sorce code:

Step 2 - HTML form

In this step I will show you how to compose small HTML document, which will show small form with one search input field. Also I will show you how to start Unit Miner script, when any search string will be submitted.

So let's create file index.php inside Wrapper/samples/prweb directory. Create directories if they doesn't exist. Inside file will be section of HTML, which prints small form and section with PHP, which executes unitminer script with name main.w.

We like to use only Title of the main story and short text below picture.
As we see, title is enclosed with tags <div class="cnnMainT1Hd"></div> and short text is within the tags <div class="cnnMainT1"></div>


<form>
Search string: <input type="text" name="search" value="<? echo $_REQUEST['search']; ?>"></input>

<input type="submit" name="Execute">
</form>

<?
if (strlen($_REQUEST['search'])) {
    require_once('../../../QUnit/Global.class.php');
    
    $executor = QUnit_Global::newobj('Wrapper_Executor', 'main.w');
    $executor->execute();
}

?>

Step 3 - execute search on www.prweb.com and grab first page with results

So now we will return back to our new script main.w.

We will add to our empty framework command, which will load submitted data (value of field "search"). For this purpouse we will use <Action Template> tag, which will enable us to execute any PHP code. In this tag we will compose url, which is used for searching in www.prweb.com.

As next thing, which we will do in this step we will load content returned by search. For this we will use tag <Action ContentURL> And as last thing we will execute new script (which we don't have yet) named "process_result_page.w".

This new script should process all what's on the result page - means grab urls of PR articles.

Every script contains the main section that downloads the page from specified URL, and then performs some action with the downloaded content. We will see how to match the data in the next step.

<Logger File>
    Global
    FileName output-main.txt
    Level debug
</Logger>

<Section>

    Name main

    #Set variable name $url (loaded value from REQUEST - form input field search)
    <Action Template>
        TemplateText $context->setVariable('$url', \
        "http://www.prweb.com/cgi-bin/search/search.pl?Terms=" . $_REQUEST['search'] . \
        "&Match=0&Realm=prweb_inject&submit=Search");
    </Action>

    <Action ContentURL>

        URL {$url}
    </Action>
    
    #process found resultset
    <Action Eval>
        File process_result_page.w
    </Action>


</Section>


#start execution with section named: main
Main main

Step 4 - grab urls of PR articles

In previouse step we wanted to execute script named process_result_page.w. So let's show you how we will prepare this script.

Create please file with name process_result_page.w in the same directory as main.w. Inside script we will load content result page (url is strored inside context variable $url). We will insert While cycle, which will iterate through all patterns found in document. Inside Section While we will insert tag Pattern, which will search inside loaded content for occurences of defined pattern and load matched variables with values (URLs of PR articles). And finally we will execute next script named "process_pr_page.w", which will process each detail page separatelly.

To match the data we have to specify matching pattern. The following pattern tells the system to match everything between the tags <div class="cnnMainT1Hd"><h2><a*>...</a></h2></div> and store the matched value into variable $main_title

Then we defined action that will do something with the value in this variable. In our case it will only print the result to the standard output using Text command, but you can save the value to the file, make a database insert, or use the vaue to load and mine another page.


<Section>

    Name process_result_page
    
    #load content from url defined in context variable $url
    <Action ContentURL>

        URL {$url}
        
        #from retrieved content remove new line breakes
        RemoveNewLine
    </Action>
    
    #iterate in page until you find all links to PR articles
    <Section While>
    
        #match pattern which contains link to PR article and store link into context variable $link_url
        <Pattern>

            RegExp <dt><b>{$link_number}. <a href="{$link_url}">

        </Pattern>
        
        #execute script, which will process matched URL of PR article
        <Action Eval>
            # some pages have different format and they don't match defined patterns, 
            # therefore continue in execution also if any page fails
            Optional
            
            File process_pr_page.w
        </Action>

 
    </Section>
</Section>

#start execution with section process_result_page
Main process_result_page

Here we defined one more pattern that will match everything between the tags <div class="cnnMainT1">...</div> and store the matched value into the variable $short_text.

Note that we used special modifier :re(.*?) after the variable.

:re tells the system to use regular expression to match the text

.*? is regular expression that matches every character until character ‘<’.

We also defined one more action that will print the matched text to the standatd output.

Step 5 - grab PR article and store it to file

Now we are close to the end of our task. In this step we will need to write content from URL of PR article to file as was specified. So let's create file named "process_pr_page.w" in same directory as all previouse files.

In this script we will load content from url matched in variable $link_url from previouse script. Also we will match data of PR article (title and text). And as last action we will store content to file in new format.


<Section>

    Name process_pr_page

    #load content of page containing PR article
    <Action ContentURL>

        #url is stored in context variable $link_url
        URL {$link_url}
        
        #remove line breaks from loaded content
        RemoveNewLine
    </Action>
    
    #match titel of article
    <Pattern>
        RegExp <h1 class="h1format">{$pr_title}</h1>

    </Pattern>

    
    #match text of PR article
    <Pattern>
        RegExp <div align="left">{$pr_text:regexp(.*?)}</div>
    </Pattern>

    #load variable $link_url as content, because we like to parse from url year, month and id, 
    #which we will use as filename
    <Action ContentVariable>

        Variable $link_url
    </Action>
    
    #match year, month and id in content
    <Pattern>
        RegExp releases/{$year}/{$month}/prweb{$id}.htm
    </Pattern>

    
    #store matched data to file, where filename will have format e.g. 2005-1-212133.html
    #formatting of file defines attribute Text
    <Action Print>
        FileName {$year}-{$month}-{$id}.html
        Text <HTML><BODY><H1>{$pr_title}</H1><br><br>{$pr_text}</BODY></HTML>

    </Action>
    
</Section>

#start execution in this script with section process_pr_page
Main process_pr_page

Before I said, that we are ready.

Yes we are ready, but we forgot on small part of specification, that we have to iterate also through next pages of result set. So let's do also last step in this tutorial.

Step 6 - iterate through all result pages

We have to edit again script main.w, because in this script we loaded first result page.

In this script we will match all urls of next search pages and open them with same script as it was done with first page.

<Logger File>
    Global
    FileName output-main.txt
    Level debug
</Logger>

<Section>
    Name main

    #Set variable name $url (loaded value from REQUEST - form input field search)
    <Action Template>
        TemplateText $context->setVariable('$url', \
        "http://www.prweb.com/cgi-bin/search/search.pl?Terms=" . $_REQUEST['search'] . \
        "&Match=0&Realm=prweb_inject&submit=Search");
    </Action>

    <Action ContentURL>
        URL {$url}
    </Action>
    
    #process found resultset
    <Action Eval>
        File process_result_page.w
    </Action>


    #process also next pages of search result
    <Section>
        Name subpages
        Optional
        
        <Pattern>
            RegExp Results Pages:
        </Pattern>

        #iterate through all next pages of search resultset
        <Section While>

            Name Next_Pages
            EndAt [
            
            #match pattern of next page link and load link into context variable $url
            <Pattern>
                RegExp <a href="{$url}">{$page_number}</a> 
            </Pattern>
            
            #define URL in correct way, because in HTML it's not complete
            <Action Template>

                TemplateText    $trans_tbl = get_html_translation_table(HTML_ENTITIES);\
                                $trans_tbl = array_flip($trans_tbl);\
                                $context->setVariable('$url', "http://www.prweb.com/cgi-bin/search/" . \
                                strtr($context->getVariable('$url'), $trans_tbl));
            </Action>

            
            #execute script, which process results as it was done with first page
            <Action Eval>
                File process_result_page.w
            </Action>
        </Section>

    </Section>
  
</Section>


#start execution with section named: main
Main main

Stay in touch with UnitMiner
© 2004-2012 QualityUnit.com, All rights reserved