Call Now
+1 415 8303923
» QualityUnit» UnitMiner» Ukážky» Extrakcia webu - chinese.wsj.com

Príklad - správy z čínskeho WALL STREET JOURNAL

Do XML sa extrahuje dnešné národne, regionálne a medzinárodné počasie zo stránku uk.weather.com. Výstup sa obnovuje každých 15 minút.

Výstup:

Zdrojový kód skriptu:

# File: expekt_main.w
# Name: WALL STREET JOURNAL in Chinese
# Description: output html retrieves all news from the left column at http://chinese.wsj.com/gb/strhrd.asp
# Input: URL [http://chinese.wsj.com/gb/strhrd.asp]
# Output format: HTML file
# Output fields: linked url, title, text(description)

#<Logger File>
#    Global
#    FileName wsj_log.log
#    # log all messages up to debug messages
#    Level debug
#</Logger>

<Section>
    Name wsj_main
	
    # define name of output file
    Define $output_file wsj_output.html

    # load content    
    <Action ContentURL>
        URL http://chinese.wsj.com/gb/strhrd.asp
        RemoveNewLine
        AutoRetryHTTPErrors 502
        AutoRetryNoContent 10 10
    </Action>
	
    <Action Php>
        Code $context->setVariable('$output', $context->getVariable('$output')\
            .'<head><meta http-equiv="Content-Type" content="text/html; charset=GB2312"></head><body>\n'); 
    </Action>
	
    <Pattern>
        RegExp <div id="t2lnews2">
    </Pattern>
    
    # finds all dates
    <Section While>
        EndAt <div id="top2right">
        NoContext
		
        # pattern for linked url
        <Pattern>
            RegExp <a href="{$relative_link_url:re([^"]*)}" target=_blank><img style
        </Pattern>
		
        # pattern of title
        <Pattern>
            Optional
            RegExp <span style="font-weight:bolder;">{$title:re([^<]*)}</span></div></a>
        </Pattern>
		
        # pattern of text under title
        <Pattern>
            Optional
            RegExp <div style="font-size:14px;line-height:140%;padding-bottom:10px;\
                   text-decoration:none;color:#555;  font-weight:normal;">{$text:re([^<]*)}<span
        </Pattern>
        
        <Action Php>
            Code $context->setVariable('$output', $context->getVariable('$output')\
                .'<a href="http://chinese.wsj.com/gb/'.$context->getVariable('$relative_link_url')\
                .'"><b>'.$context->getVariable('$title').'</b></a> - '\
                .$context->getVariable('$text').' \n<br/><br/>'); 
        </Action>
    </Section> 

    <Action Php>
        Code $context->setVariable('$output', $context->getVariable('$output')\
            .'</body></html>\n'); 
    </Action>	

    # clean output file
    <Action Print>
        FileName {$output_file}
        FileMode Write  
    </Action>

    # write result to output file
    <Action Print>
        FileName {$output_file}
        Text {$output}
    </Action>

</Section>

Main wsj_main    
© 2010 QualityUnit.com, All rights reserved
Quick menu
Spoločnosť
Produkty
Support
Blog
Partnership