Web Extraction Tutorial 2 - parsing access.log
In this second example we will show you simple example from web administrator’s life. Imagine that you face the problem how to easily match lines from apache access.log and transfer these records into database on your server.
This is a short snippet from our access log that was captured by our server and we will use it as an example content:
70.242.222.162 - - [01/Jan/2005:21:57:28 +0100] "GET /unitminer/ HTTP/1.1" 200 26080 "http://www.webradev.com/?p=CustomDev" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; YPC 3.0.1; .NET CLR 1.1.4322; yplus 4.1.00b)"
70.242.222.162 - - [01/Jan/2005:21:57:29 +0100] "GET /css/test.css HTTP/1.1" 200 5651 "http://www.unitminer.com/unitminer/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; YPC 3.0.1; .NET CLR 1.1.4322; yplus 4.1.00b)"
70.242.222.162 - - [01/Jan/2005:21:57:33 +0100] "GET /img/qu_logo.png HTTP/1.1" 200 3731 "http://www.unitminer.com/unitminer/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; YPC 3.0.1; .NET CLR 1.1.4322; yplus 4.1.00b)"
As first step we can realize that the lines in the file have the same structure.
You can find the log format in Apache help:
%h %l %u %t "%r" %>s %b "%{Referrer}i" "%{User-agent}i"
Where:
%h – IP address of client
%l - The "hyphen" in the output indicates that the requested piece of information is not available
%u - This is the userid of the person requesting the document as determined by HTTP authentication
%t - The time that the server finished processing the request
%r - The request line from the client
%>s - This is the status code that the server sends back to the client
%b - The last entry indicates the size of the object returned to the client, not including the response headers
%{Referrer}i – HTTP referrer
%{User-agent}i - This is the identifying information that the client browser reports about itself
We will use <Section While× structure, which repeats execution of sub elements inside section, until each evaluates to true.
Let’s write small framework for our script (the best approach is usually to copy it from some existing script)
#define main section of script
<Section>
#define name of section
Name AccessLog
# load content from file on disk
<Action ContentFile>
#load content from following file
FileName c:/temp/access.log
</Action>
<Section While>
#later we will put code that matches and processes log row here
</Section>
</Section>
#start execution of Section with name AccessLog
Main AccessLog
Step 2 - adding matching pattern
Now we will add pattern that will match one line inside access log. We put this pattern into <Section While> cycle.
<Section>
Name AccessLog
# load content from file on disk
<Action ContentFile>
#load content from following file
File c:/temp/access.log
</Action>
<Section While>
#match one line inside log file with following pattern
<Pattern>
Name LogRow
RegExp ^{$client_ip:regexp(\S+)} \
{$ident:regexp(\S+)} \
{$userid:regexp(\S+)} \
[{$date:regexp(([^:]+):(\d+:\d+:\d+) ([^\]]+))}] \
"{$request:regexp(.+?)}" \
{$status:regexp(\S+)} \
{$size:regexp(\S+)} \
"{$referer:regexp(.+?)}" \
"{$client_type:regexp(.+?)}"{:regexp(\s+)}
</Pattern>
</Section>
</Section>
#start execution of Section with name AccessLog
Main AccessLog
Step 3 - displaying the matched data
In the previous step we defined matching pattern, but we don't see any result. So we will create action that will print some matched variables to default output.
Any text can be printed with <Action Print>. The text to print should be given as parameter to to the attribute Text
<Section>
Name AccessLog
# load content from file on disk
<Action ContentFile>
#load content from following file
File c:/temp/access.log
</Action>
<Section While>
#match one line inside log file with following pattern
<Pattern>
Name LogRow
RegExp ^{$client_ip:regexp(\S+)} \
{$ident:regexp(\S+)} \
{$userid:regexp(\S+)} \
[{$date:regexp(([^:]+):(\d+:\d+:\d+) ([^\]]+))}] \
"{$request:regexp(.+?)}" \
{$status:regexp(\S+)} \
{$size:regexp(\S+)} \
"{$referer:regexp(.+?)}" \
"{$client_type:regexp(.+?)}"{:regexp(\s+)}
</Pattern>
#print what we matched in pattern
<Action Print>
Text {$client_ip}, {$ident}, {$userid}, \
{$date}, {$request}, {$status}, {$size}, \
{$referer}, {$client_type}<br>
</Action>
</Section>
</Section>
#start execution of Section with name AccessLog
Main AccessLog
The defined action will print the matched data to standard output.
Step 4 - inserting the matched data to the database
We see that matched data are parsed from our access.log and are displayed to default output. As last step we will also store data from access.log to database. Let’s say, that we have MySQL database named server_stat and there is table named accesslog with following columns: clientip, request_time, request and created.
To insert data to database we can use tag <Action SaveDbRow>.
<Section>
Name AccessLog
# load content from file on disk
<Action ContentFile>
#load content from following file
File c:/temp/access.log
</Action>
<Section While>
#match one line inside log file with following pattern
<Pattern>
Name LogRow
RegExp ^{$client_ip:regexp(\S+)} \
{$ident:regexp(\S+)} \
{$userid:regexp(\S+)} \
[{$date:regexp(([^:]+):(\d+:\d+:\d+) ([^\]]+))}] \
"{$request:regexp(.+?)}" \
{$status:regexp(\S+)} \
{$size:regexp(\S+)} \
"{$referer:regexp(.+?)}" \
"{$client_type:regexp(.+?)}"{:regexp(\s+)}
</Pattern>
#store matched data also to database table
<Action SaveDbRow>
# Optional tells the system that if database insert
# fails, script can continue execution
Optional
#defines name of action
Name save-to-db
#database server name
Server localhost
#database type
DBType mysql
#Database name
Database server_stat
Username root
Password root
#store data to table with name accesslog
TableName accesslog
#define mapping between column names and matched variable names
ColumnDef clientip, $client_ip
ColumnDef request_time, $date
ColumnDef request, $request
# predefined variable $_NOW will return current datetime
ColumnDef created, $_NOW
</Action>
</Section>
</Section>
#start execution of Section with name AccessLog
Main AccessLog
The defined action will print the matched data to standard output.