Information on Web data extraction techniques

Web data extraction is widely used technique that harvest unstructured information or data from web pages and converts them into structured format for various purposes. To navigate and collect information automatically from Deep Web sites researchers use Web wrappers and Data extraction techniques. Web Wrappers are the simplest way to perform web research but has two drawbacks: their implementation is site-specific and also their source code requires regular upgrade to support new changes on the proposed website.

Today commercial websites like eBay, Amazon, etc offer comprehensive development tools (called APIs) that allows its users to create simple applications to access their services. They also offer services that can be used in relation to various Mashup tools such as Yahoo Pipes, Google Mashups, etc. However, many websites do not offer API tools because their focus is only on operation through human users.

To operate web wrappers you need developers with sufficient domain knowledge. Wrappers are site specific services that largely depend on web structure and hence need gradual maintenance to keep up with changes on the accessed site. Modern web wrappers are transforming to resolve above said drawbacks and become more user-friendly.

Latest web wrappers try to define a formalization of the accessed site web structure that can be utilized for semantic annotation of Deep sites. This allows clear-cut representation of sites navigation along with user development tools. The result is easy implementation of wrappers biased to the site model, removing any possibility of overlapping with website structure.

Basic factors to consider in web data extraction techniques:

  • The navigation model should support changes in the sites content. Flash and AJAX are widely used client side techniques.
  • The model should not interfere with site structure and must not require change in site structure.
  • End users must collaborate in the generation of site annotations since they will not have access to the server thereafter.
For any queries contact us at info@outsourcingwebresearch.com

Source: http://hubpages.com/hub/Information-on-Web-data-extraction-techniques

Posted in |

1 comments:

  1. Carly Fiorina Says: This comment has been removed by a blog administrator.