Step 1 - identifying the pattern
First we need to identify the part of HTML code that contains our information. You can see on the picture which information we want to mine from the selected source (marked with red arrow).
Main story is highlighted in page sorce code:
We like to use only Title of the main story and short text below picture.
As we see, title is enclosed with tags <div class="cnnMainT1Hd"> … </div>
and short text is within the tags <div class="cnnMainT1"> … </div>
Step 2 - first version of the script
Now we will start with the script. At first we will define main section that will download content of main page of www.cnn.com
#main section of script
<Section>
#define name of section
Name ourMainSection
# Load content
<Action ContentURL>
#load content from the following URL
URL http://www.cnn.com
#removes newlines from downloaded content for easier matching
RemoveNewLine
</Action>
</Section>
#run section with name “ourMainSection”
Main ourMainSection
Every script contains the main section that downloads the page from specified URL, and then performs some action with the downloaded content. We will see how to match the data in the next step.
Step 3 - matching the title
In step 2 we loaded content from the web page, so now we can try to match title of main story and print it to default output
#main section of script
<Section>
#define name of section
Name ourMainSection
# Load content
<Action ContentURL>
#load content from the following URL
URL http://www.cnn.com
#removes newlines from downloaded content for easier matching
RemoveNewLine
</Action>
#this pattern should match main story title
<Pattern>
#defines expression which should match the data
RegExp <div class="cnnMainT1Hd"><h2><a*>\
{$main_title}</a></h2></div>
</Pattern>
#print matched data to default output
<Action Print>
Text Story of the day: {$main_title}\n
</Action>
</Section>
#run section with name “ourMainSection”
Main ourMainSection
To match the data we have to specify matching pattern. The following pattern tells the system to match everything between the tags <div class="cnnMainT1Hd"><h2><a*>...</a></h2></div> and store the matched value into variable $main_title
Then we defined action that will do something with the value in this variable. In our case it will only print the result to the standard output using Text command, but you can save the value to the file, make a database insert, or use the vaue to load and mine another page.
Step 4 - matching the text under the picture
As a last step we need to add one more Pattern tag that will match the short text of main story.
#main section of script
<Section>
#define name of section
Name ourMainSection
# Load content
<Action ContentURL>
#load content from the following URL
URL http://www.cnn.com
#removes newlines from downloaded content for easier matching
RemoveNewLine
</Action>
#this pattern should match main story title
<Pattern>
#defines expression which should match the data
RegExp <div class="cnnMainT1Hd"><h2><a*>\
{$main_title}</a></h2></div>
</Pattern>
#print matched data to default output
<Action Print>
Text Story of the day: {$main_title}\n
</Action>
#match short text from main story
<Pattern>
RegExp <div class="cnnMainT1">\
{$short_text:re(.*?)}</div>
</Pattern>
#print matched short text
<Action Print>
Text Text: {$short_text}\n
</Action>
</Section>
#run section with name “ourMainSection”
Main ourMainSection
Here we defined one more pattern that will match everything between the tags <div class="cnnMainT1">...</div> and store the matched value into the variable $short_text.
Note that we used special modifier :re(.*?) after the variable.
:re tells the system to use regular expression to match the text
.*? is regular expression that matches every character until character ‘<’.
We also defined one more action that will print the matched text to the standatd output.