Monday 2 September 2013

How Your Online Information is Stolen - The Art of Web Scraping and Data Harvesting

Web scraping, also known as web/internet harvesting involves the use of a computer program which is able to extract data from another program's display output. The main difference between standard parsing and web scraping is that in it, the output being scraped is meant for display to its human viewers instead of simply input to another program.

Therefore, it isn't generally document or structured for practical parsing. Generally web scraping will require that binary data be ignored - this usually means multimedia data or images - and then formatting the pieces that will confuse the desired goal - the text data. This means that in actually, optical character recognition software is a form of visual web scraper.

Usually a transfer of data occurring between two programs would utilize data structures designed to be processed automatically by computers, saving people from having to do this tedious job themselves. This usually involves formats and protocols with rigid structures that are therefore easy to parse, well documented, compact, and function to minimize duplication and ambiguity. In fact, they are so "computer-based" that they are generally not even readable by humans.

If human readability is desired, then the only automated way to accomplish this kind of a data transfer is by way of web scraping. At first, this was practiced in order to read the text data from the display screen of a computer. It was usually accomplished by reading the memory of the terminal via its auxiliary port, or through a connection between one computer's output port and another computer's input port.

It has therefore become a kind of way to parse the HTML text of web pages. The web scraping program is designed to process the text data that is of interest to the human reader, while identifying and removing any unwanted data, images, and formatting for the web design.

Though web scraping is often done for ethical reasons, it is frequently performed in order to swipe the data of "value" from another person or organization's website in order to apply it to someone else's - or to sabotage the original text altogether. Many efforts are now being put into place by webmasters in order to prevent this form of theft and vandalism.



Source: http://ezinearticles.com/?How-Your-Online-Information-is-Stolen---The-Art-of-Web-Scraping-and-Data-Harvesting&id=923976

No comments:

Post a Comment