Friday 3 July 2015

Web Scraping : To Scrape or Not to Scrape?

Web scraping is on the rise and its legality is being debated. The future of big data could hang in the balance.

I decided that I might want to stop writing about all these successful dot-com businesses and get into the act myself. I mean, how hard could it be, aside from the fact that I have no expertise in any particular vertical, no technological knowledge, and no money? That last was a bit of a problem because I was going to need big data, and big data doesn't come cheap.

So, one day I'm talking to a direct-tech wizard and he says, “Why don't you just find a business you want to be in, find the most successful company, go to their website, and scrape some of their data?”

Scrape? My dad was a house painter. I used to help him during summers. The only scraping I knew was done with a putty knife. But that's what God invented the Internet for. Google turned up endless Web scraping services and I went to one called Automation Anywhere.

Its homepage told me not to try scraping on my own, that I could pay them as little as $1,995 for a program that would have me scraping away in minutes with no programming expertise. A video showed me how. Suppose I wanted to locate all the assisted living facilities in Detroit? (Bedpans! There's a wide-open Web business!) Automation Anywhere showed me how I could request a data pattern—name, address, phone, and service area of targets—apply their program to a rehab facility listing, and minutes later be in possession of tidy customer list on a spreadsheet. A box popped up asking if I had any questions for a live account manager. I did.

“I'm interested in this, but is it totally legal?” I asked Shine, the account rep.

Shine was slow to respond and came back disappointingly noncommittal: “You need to install the software at your end. Hence, you will need to check at your end for legal documents for the website.”

Shine had obviously been reading the same European news sites that I had. Last year Irish airline Ryanair filed suit against PR Aviation, a Dutch airfare comparison site, charging it with copyright infringement and breach of contract for scraping flight data from its site. The Court of Justice of the European Union (ECJ) dismissed the suit, saying the scraping amounted to “normal use” of a website. However, the ECJ did potentially leave the door open for businesses with unprotected databases, such as Ryanair, to establish contractual limitations on use of their databases by third parties. That opening, should it be entered into by airlines, might have businesses such as Expedia, Orbitz, and Priceline reimagining their business plans.

And it could have any other businesses that load up third-party data files through scraping activities doing the same. The reality, however, is that that's a big “and.”

“The airlines have been halfway successful taking travel agents to court, but it can take five years and then they lose,” said Gus Cunningham, CEO of ScrapeSentry, one of only “three and a half” companies, in his words, that block scrapers from websites.

Professional scrapers are not only out-front and plentiful—as the Google search demonstrated—they're also nimble and expensive to chase away. Basically, Cunningham said, it's a matter of stopping scrapers at the website door among the airlines, e-coms, real estate sites, and online gambling companies that are scraped the most. Cunningham's company monitors inbound Web traffic and uses an analysis engine to block suspect visitors per parameters set down by clients. ScanSentry has a nine-year-old database that helps it identify bad actors, much like Interpol with its criminal database. Then a human element must enter the process.

Some of Cunningham's clients feed bogus information to competitors identified as scrapers to ferret them out. In many cases, though, they opt to turn their heads. “Airlines have some flights where they just want to get as many butts as they can in the seats, so they won't concentrate their blocking efforts on those. They'll concentrate on the routes that are always jammed,” Cunningham said.

Unlike botnets that steal money by, say, serving bogus websites to siphon off programmatic ad dollars, scraping is not overtly criminal. In the wide sphere of digital commerce, it's probably most common that the scrap-ee is also a scrap-er. How vigilant vulnerable industries become, and how protective courts and law enforcement agencies grow, will depend on how much scraping activity increases. Cunningham said it's growing fast. More than one fifth of visitors to client websites last year were scrapers, according to a ScanSentry study. Among travel companies, meanwhile, scrapers doubled from 15% in 2013 to 33% last year.

“And,” Cunningham noted, “It is stealing.”

Source: http://www.dmnews.com/direct-line-blog/to-scrape-or-not-to-scrape/article/422662/

No comments:

Post a Comment