疱疹病毒可以根治吗:抓取工具Web-Harvest - dayang2001911 - JavaEye技术网站

来源:百度文库 编辑:偶看新闻 时间:2024/05/05 05:29:50

抓取工具Web-Harvest

文章分类:互联网



 Overview

总览

 

This section describes the motive, the notions and concepts used in Web-Harvest.

 

本章描述了在Web-Harvest涉及的动机、观念和概念。

Rationale

理念

 

World Wide Web, though by far thelargest knowledge base, is rarely regarded as database in traditionalsense - as source of information used for further computing. Web-Harvest isinspired by practical need for having right data at the right time. Andvery often, the Web is the only source thatpublicly provides wanted information.

 

万维网,尽管是目前最大的知识基地,但是仍然难以将它视为传统意义上的数据库,从而作为深入计算的所使用的信息来源。Web-Harvest受启发满足实用性的需要成为在正确的时间获取正确的数据。web经常是唯一给公众提供所需要的信息来源。

Basic concept

基本概念

 

The main goal behind Web-Harvest is to empower the usage of alreadyexisting extraction technologies. Its purpose is not to propose a newmethod, but to provide a way to easily use and combine the existingones. Web-Harvestoffers the set of processors for data handling and controlflow. Each processor can be regarded as a function - it has zero or moreinput parameters and gives a result after execution. Processors couldbe combined in a pipeline, making the chain of execution. Foreasier manipulation and data reuse Web-Harvest provides variable context wherenamed variables are stored. The following diagram describes onepipeline execution:

 

Web-Harvest的总体目标的是要能使用已经存在的抽取技术。它的目标不是提供一个新的方法,而是提供一种可以简单使用并整合已经存在的技术的新方式。Web-Harvest提供一系列数据处理和控制流程的处理器。每个处理器可以看做是一个方法-它有零个或多个输入参数并能在执行后提供一个结果。处理器可以组装为一个管道,形成执行链。为了更加简单地操作以及数据重用,Web-Harvest提供了变量上下文,那些被命名的变量可以存储在这个上下文中。下图描述了一个管道的执行过程:

 

 

 

The result of extraction could be available in files created duringexecution or from the variable context if Web-Harvest is programmatically used.

在执行期间,抽取的结果可以存在于文件,如果Web-Harvest 采用编程方式进行使用时,抽取的结果也来自于变量上下文。

Configuration language

配置语言

 

Every extraction process is defined in one or more configurationfiles, using simple XML-based language. Each processor is describedby specific XML element or structure of XML elements. For theillustration, here is presented an example of configuration file:

 

每个抽取过程都定义在一个或多个配置文件中,并且使用简单的基于XML的语言。每个处理器都被特定的XML元素或XML元素的结构所描述。为了说明,下面展示了一个配置文件的例子:

 

 

 

This configuration contains two pipelines. The first pipelineperforms the following steps:

 

这个配置包含了两个管道。第一个管道执行了下面的步骤:

  1. HTML content at http://news.bbc.co.uk is downloaded,
  2. HTML cleaning is performed on downloaded content producing XHTML,
  3. XPath expression is searched for, giving URL sequence of page images,
  4. New variable named "urlList" is defined containing sequence of image URLs.

    1.  http://news.bbc.co.uk的网站内容被下载,

    2.  HTML清理

    3.  XPath 表达式用于查找页面图片的URL序列,

    4.  新命名urlList变量用于定义包汉了图片URL的序列。

 

The second pipeline uses result of the previous execution in order tocollect all page images:

  1. Loop processor iterates over URL sequence and for every item:
  2. Downloads image at current URL,
  3. Stores the image on the file system.

第二个管道为了收集所有的页面图片,使用了前面执行的结果:

    1.  Loop处理器迭代了所有的URL序列并且对于每项都:

    2.  下载当前URL的图片,

    3.  在文件系统中保存图片。

 

 

This example illustrates some procedural-language elements of Web-Harvest, likevariable definition and list iteration, few data management processors (fileand http) and couple of HTML/XML processing instructions (html-to-xmland xpath processors).

 

For slightly more complex example of image download, where some otherfeatures of Web-Harvestare used, see Examplespage. For technical coverage of supported processors, see Usermanual.

 

这个例子说明了Web-Harvest中一些过程化语言的元素,比如变量定义和列表迭代,少量数据管理的处理器(文件和http)以及一些HTML/XML处理指令。(HTML到XML和XPATH处理器)

想了解在Web-Harvest中更加复杂一点的图片下载,以及用到的一些特点,见Examples页。想了解所支持的处理器的技术覆盖范围,看Usermanual。

Data values

All data produced and consumed during extraction process in Web-Harvest havethree representations: text, binary and list. There is also special datavalue empty, whose textual representation is empty string,binary - empty byte array and list - zero length list. Which form ofdata is used - it depends on processor that consumes the data. Inprevious configuration html-to-xml processor uses downloadedcontent as text in order to transform it to HTML, loopprocessor uses variable urlList as a list in order to iterateover it and file processor treats downloaded images as binarydata when saving them to the files. In most cases proper representationof the data is chosen by Web-Harvest. However - in some situations it must beexplicitly stated which one to use. One example is fileprocessor where default data type is text and the binarycontent must be explicitly specified with type="binary".

Variables

Web-Harvestprovides the variable context for storing and using variables. There isno special convention for naming variables like in most of theprogramming languages. Thus, the names like arr[1], 100or #$& are valid. However, if aforementioned variableswere used in scripts or templates (see next section), where expressionsare dynamically evaluated, the exception would be thrown. It istherefore recommended to use usual programming language naming in orderto avoid any difficulties.

When Web-Harvestis programmatically used (from Java code, not from command line)variable context may be initially set by user in order to add customvalues and functionality. Similarly, after execution, variable contextis available for taking variables from it.

When user-defined functions are called (see Usermanual) separate local variable context is created (like in manyprogramming languages, including Java). The valid way to exchange databetween caller and called function is through the function parameters.

Scripting and templating

Before Web-Harvest 0.5 templating mechanism was based on OGNL (Object-Graph NavigationLanguage). From the version 0.5 OGNL is replaced by BeanShell, and starting fromversion 1.0, multiple scripting languages are supported, givingdevelopers freedom to choose the favourite one.

Besides the set of powerful text and XML manipulation processors, Web-Harvestsupports real scripting languages which code can be easily intergratedwithin scraper configurations. Languages currently supported are BeanShell,Groovy and Javascript. BeanShell is probably theclosest to Java syntax and power, but Groovy and Javascripthave some other adventages. It is up to the developer to use preferedlanguage or even to mix different languages in the single configuration.

Templating allowes evaluating of marked parts of the text (text"islands" surrounded with ${ and }). Evaluation isperformed using the chosen scripting language. In Web-Harvest all elements' attributes are implicitlypassed to the templating engine. In upper configuration, there are twoplaces where templater is doing the job:

  • path="images/${i}.gif" in file processor, producing file names based on loop index,
  • url="${sys.fullUrl('http://news.bbc.co.uk', link)}" in http processor, where built-in functionality is called to calculate full URL of the image (see User manual to check all built-in objects).