Web data extraction and Wrapping techniques

If your organization wants to design and develop comprehensive information system the first challenge comes to you is extraction of data from World Wide Web. Issues that arise include extraction, validation and management of the large amount of data available on the internet. These data have typically a low quality, format mismatch and content mistakes making things more difficult.

Most popular algorithm in practice for effective Web Data extraction is Regular Expressions or Wrapper. This algorithm offers flexible and scalable mechanisms to harvest necessary data from various web resources such as directories, forums, blogs, etc. Since all these web sources are quite assorted its nearly impossible to build and maintain huge database for business intelligence and market research purpose.

Wrappers are dedicated applications that automatically harvest data from online documents and store the information into a specified structured format. The wrapper application first downloads HTML pages from internet, browses data for extraction and then stores this data in MS Excel, CSV, MySQL or other structured format to facilitate further refinements.

The very common approach to build Wrappers is manual i.e. identify a set of pattern using HTML programming and then harvest particular data manually. However, this is very inefficient technique because small modification in the database make the wrapper fail big way.

A Regular Expression is a intuitive approach to discover a pattern from a particular data or information. Regular expression or simply Regex is a convenient way for many text editors and programming languages to browse and reuse text based information. A wrapper comes with generic operators

Find more details at http://www.outsourcingwebresearch.com/data-extraction.php

Source: Hubpages

Posted in |

0 comments: