<<  Unit Miner  <<  QualityUnit
 
Unit Miner - product box
Unit Miner Service

We can extract data for you!

Fast, reliable & affordable Send us your requirements

Step 1 - identifying the pattern

First we need to identify the part of HTML code that contains our information. You can see on the picture which information we want to mine from the selected source (marked with red arrow).


Screenshot - CNN website

Main story is highlighted in page sorce code:

<div style="background-color: #EAEFF4;"> <div class="cnnMainT1Hd"><h2><a href="/2004/ALLPOLITICS/12/20/bush.ap/index.html" style="color:#000;"> Bush looks to slash deficit</a></h2></div><div style="background-color:#fff;"><img src="http://i.cnn.net/cnn/img/1.gif" alt="" width="1" height="10"></div> <!-- REAP --><a href="/2004/ALLPOLITICS/12/20/bush.ap/index.html"><img src="http://i.a.cnn.net/cnn/2004/ALLPOLITICS/12/20/bush.ap/top.bush.flag.ap.jpg"width="280" height="210" alt="Bush looks to slash deficit" border="0" hspace="0" vspace="0"></a><!-- PURGE: /2004/ALLPOLITICS/12/20/bush.ap/top.bush.flag.ap.jpg --><!-- KEEP --><!-- /REAP --><div class="cnnMainT1"><p> President Bush today covered a range of issues at a news conference in Washington. Bush said:<li> He would submit a budget that cuts the deficit in half and maintains strict spending discipline </li><li> Iraqi elections in January "are the beginning of a process, and it is important for the American people to understand that" </li><li> Defense Secretary Donald Rumsfeld is "doing a very fine job" </li></p><p><a href="/2004/ALLPOLITICS/12/20/bush.ap/index.html" class="cnnt1link">FULL STORY</a></p><p> <b><span class="cnnBodyText" style="font-weight:bold;color:#333;">Video: </span><img src="http://i.cnn.net/cnn/.element/img/1.0/misc/premium.gif" alt="premium content" width="9" height="11" hspace="0" vspace="0" border="0" align="absmiddle"></b> <a href="javascript:LaunchVideo('/politics/2004/12/20/sot.bush.domestic.agenda.cnn.','300k');">Bush sets out economic agenda</a><br> <b><span class="cnnBodyText" style="font-weight:bold;">CNN/Money: </span></b> <a href="/money/2004/12/20/retirement/bush_pressconf/index.htm">Social security questions</a><br> <b>Transcript:</b> <a href="/2004/ALLPOLITICS/12/20/bush.transcript.ap/index.html">Bush news conference</a><br> <b><span class="cnnBodyText" style="font-weight:bold;color:#333;">Special Report: </span></b> <a href="/SPECIALS/2004/bush.term/">Bush: The Second Term</a><br></p></div></div>

We like to use only Title of the main story and short text below picture.
As we see, title is enclosed with tags <div class="cnnMainT1Hd"></div> and short text is within the tags <div class="cnnMainT1"></div>

Step 2 - first version of the script

Now we will start with the script. At first we will define main section that will download content of main page of www.cnn.com

#main section of script
<Section>
    #define name of section
    Name ourMainSection


    # Load content
    <Action ContentURL>
        #load content from the following URL
        URL http://www.cnn.com
        
        #removes newlines from downloaded content for easier matching 
        RemoveNewLine
    </Action>

</Section>

#run section with name “ourMainSection”
Main ourMainSection

Every script contains the main section that downloads the page from specified URL, and then performs some action with the downloaded content. We will see how to match the data in the next step.

Step 3 - matching the title

In step 2 we loaded content from the web page, so now we can try to match title of main story and print it to default output

#main section of script
<Section>

    #define name of section
    Name ourMainSection

    # Load content
    <Action ContentURL>
        #load content from the following URL
        URL http://www.cnn.com
        
        #removes newlines from downloaded content for easier matching 
        RemoveNewLine
    </Action>

    #this pattern should match main story title
    <Pattern>

	#defines expression which should match the data
       RegExp <div class="cnnMainT1Hd"><h2><a*>\
       	  {$main_title}</a></h2></div>

    </Pattern>

    #print matched data to default output
    <Action Print>
        Text Story of the day: {$main_title}\n
    </Action>

</Section>

#run section with name “ourMainSection”
Main ourMainSection


To match the data we have to specify matching pattern. The following pattern tells the system to match everything between the tags <div class="cnnMainT1Hd"><h2><a*>...</a></h2></div> and store the matched value into variable $main_title

Then we defined action that will do something with the value in this variable. In our case it will only print the result to the standard output using Text command, but you can save the value to the file, make a database insert, or use the vaue to load and mine another page.


Step 4 - matching the text under the picture

As a last step we need to add one more Pattern tag that will match the short text of main story.

#main section of script
<Section>
    #define name of section
    Name ourMainSection

    # Load content
    <Action ContentURL>
        #load content from the following URL
        URL http://www.cnn.com
        
        #removes newlines from downloaded content for easier matching 
        RemoveNewLine
    </Action>

    #this pattern should match main story title
    <Pattern>

	#defines expression which should match the data
       RegExp <div class="cnnMainT1Hd"><h2><a*>\
       	  {$main_title}</a></h2></div>
    </Pattern>

    #print matched data to default output
    <Action Print>

        Text Story of the day: {$main_title}\n
    </Action>

    #match short text from main story
    <Pattern>
       RegExp <div class="cnnMainT1">\
       {$short_text:re(.*?)}</div>
    </Pattern>

    #print matched short text
    <Action Print>
        Text Text: {$short_text}\n
    </Action>

</Section>

#run section with name “ourMainSection”
Main ourMainSection

Here we defined one more pattern that will match everything between the tags <div class="cnnMainT1">...</div> and store the matched value into the variable $short_text.

Note that we used special modifier :re(.*?) after the variable.

:re tells the system to use regular expression to match the text

.*? is regular expression that matches every character until character ‘<’.

We also defined one more action that will print the matched text to the standatd output.