Step 1 - identifying the pattern

In this second example we will show you simple example from web administrator’s life. Imagine that you face the problem how to easily match lines from apache access.log and transfer these records into database on your server. This is a short snippet from our access log that was captured by our server and we will use it as an example content:

70.242.222.162 - - [01/Jan/2005:21:57:28 +0100] "GET /unitminer/ HTTP/1.1" 200 26080 "http://www.webradev.com/?p=CustomDev" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; YPC 3.0.1; .NET CLR 1.1.4322; yplus 4.1.00b)"

70.242.222.162 - - [01/Jan/2005:21:57:29 +0100] "GET /css/test.css HTTP/1.1" 200 5651 "http://www.unitminer.com/unitminer/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; YPC 3.0.1; .NET CLR 1.1.4322; yplus 4.1.00b)"

70.242.222.162 - - [01/Jan/2005:21:57:33 +0100] "GET /img/qu_logo.png HTTP/1.1" 200 3731 "http://www.unitminer.com/unitminer/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; YPC 3.0.1; .NET CLR 1.1.4322; yplus 4.1.00b)"

 

As first step we can realize that the lines in the file have the same structure. You can find the log format in Apache help:

%h %l %u %t "%r" %>s %b "%{Referrer}i" "%{User-agent}i"

Where:
%h
– IP address of client
%l
- The "hyphen" in the output indicates that the requested piece of information is not available
%u
- This is the userid of the person requesting the document as determined by HTTP authentication
%t
- The time that the server finished processing the request
%r
- The request line from the client
%>s
- This is the status code that the server sends back to the client
%b
- The last entry indicates the size of the object returned to the client, not including the response headers
%{Referrer}i
– HTTP referrer
%{User-agent}i
- This is the identifying information that the client browser reports about itself

We will use <Section While> structure, which repeats execution of sub elements inside section, until each evaluates to true. Let’s write small framework for our script (the best approach is usually to copy it from some existing script)

#define main section of script
<Section>
    #define name of section
    Name AccessLog

    # load content from file on disk
    <Action ContentFile>

        #load content from following file
        FileName c:/temp/access.log
    </Action>

    <Section While>

      #later we will put code that matches and processes log row here

    </Section>
</Section>

#start execution of Section with name AccessLog
Main AccessLog

Main story is highlighted in page sorce code:

Step 2 - adding matching pattern

Now we will add pattern that will match one line inside access log. We put this pattern into <Section While> cycle

We like to use only Title of the main story and short text below picture.
As we see, title is enclosed with tags <div class="cnnMainT1Hd"></div> and short text is within the tags <div class="cnnMainT1"></div>

<Section>
    Name AccessLog

    # load content from file on disk
    <Action ContentFile>
        #load content from following file
        File c:/temp/access.log
    </Action>

    <Section While>

        #match one line inside log file with following pattern
        <Pattern>

            Name LogRow
            RegExp ^{$client_ip:regexp(\S+)} \
                    {$ident:regexp(\S+)} \
                    {$userid:regexp(\S+)} \
                    [{$date:regexp(([^:]+):(\d+:\d+:\d+) ([^\]]+))}] \
                    "{$request:regexp(.+?)}" \
                    {$status:regexp(\S+)} \
                    {$size:regexp(\S+)} \
                    "{$referer:regexp(.+?)}" \
                    "{$client_type:regexp(.+?)}"{:regexp(\s+)}
        </Pattern>    

    </Section>

</Section>

#start execution of Section with name AccessLog
Main AccessLog

Step 3 - displaying the matched data

In the previous step we defined matching pattern, but we don't see any result. So we will create action that will print some matched variables to default output.

Any text can be printed with <Action Print>. The text to print should be given as parameter to to the attribute Text

Every script contains the main section that downloads the page from specified URL, and then performs some action with the downloaded content. We will see how to match the data in the next step.

<Section>

    Name AccessLog

    # load content from file on disk
    <Action ContentFile>
        #load content from following file
        File c:/temp/access.log
    </Action>

    <Section While>

        #match one line inside log file with following pattern
        <Pattern>

            Name LogRow
            RegExp ^{$client_ip:regexp(\S+)} \
                    {$ident:regexp(\S+)} \
                    {$userid:regexp(\S+)} \
                    [{$date:regexp(([^:]+):(\d+:\d+:\d+) ([^\]]+))}] \
                    "{$request:regexp(.+?)}" \
                    {$status:regexp(\S+)} \
                    {$size:regexp(\S+)} \
                    "{$referer:regexp(.+?)}" \
                    "{$client_type:regexp(.+?)}"{:regexp(\s+)}
        </Pattern>

        #print what we matched in pattern
        <Action Print>
            Text {$client_ip}, {$ident}, {$userid}, \
                    {$date}, {$request}, {$status}, {$size}, \
                    {$referer}, {$client_type}<br>

        </Action>


    </Section>
</Section>

#start execution of Section with name AccessLog
Main AccessLog

The defined action will print the matched data to standard output.

Step 4 - inserting the matched data to the database

We see that matched data are parsed from our access.log and are displayed to default output. As last step we will also store data from access.log to database. Let’s say, that we have MySQL database named server_stat and there is table named accesslog with following columns: clientip, request_time, request and created.

To insert data to database we can use tag <Action SaveDbRow>.

<Section>
    Name AccessLog

    # load content from file on disk
    <Action ContentFile>
        #load content from following file
        File c:/temp/access.log
    </Action>

    <Section While>

        #match one line inside log file with following pattern
        <Pattern>
            Name LogRow
            RegExp ^{$client_ip:regexp(\S+)} \
                    {$ident:regexp(\S+)} \
                    {$userid:regexp(\S+)} \
                    [{$date:regexp(([^:]+):(\d+:\d+:\d+) ([^\]]+))}] \
                    "{$request:regexp(.+?)}" \
                    {$status:regexp(\S+)} \
                    {$size:regexp(\S+)} \
                    "{$referer:regexp(.+?)}" \
                    "{$client_type:regexp(.+?)}"{:regexp(\s+)}
        </Pattern>

        #store matched data also to database table
        <Action SaveDbRow>
            # Optional tells the system that if database insert 
            # fails, script can continue execution
            Optional

            #defines name of action
            Name save-to-db
        
            #database server name
            Server localhost

            #database type
            DBType mysql
            
            #Database name
            Database server_stat

            Username root
            Password root

            #store data to table with name accesslog            
            TableName accesslog

            #define mapping between column names and matched variable names
            ColumnDef clientip, $client_ip
            ColumnDef request_time, $date
            ColumnDef request, $request

            # predefined variable $_NOW will return current datetime
            ColumnDef created, $_NOW
      </Action>


    </Section>
</Section>

#start execution of Section with name AccessLog
Main AccessLog

The defined action will print the matched data to standard output.

Stay in touch with UnitMiner
© 2004-2012 QualityUnit.com, All rights reserved