Sunday, 30 November 2014

The Roots of Web Scraping and the Wisdom behind It

You may be wondering how data mining came into existence. This effective and innovative trend in business and research is indeed something commendable and the genius behind it is worth great reward. To have a clear view of the origin of web scraping, the following important factors that contribute to the creation of this phenomenon called data collection or web scraping are considered.

Foundations

Unlike any other innovation, no specific date can be clearly pointed out as the birthdate of data mining. It has come into existence as a result of several problem solving processes in major data gathering and handling situations. It appears that cyber technology has opened a Pandora box of “anything can happen” experiences. Moreover, the shift from physical to virtual data collection has resulted in a bulk of database that needed to be organized, analyzed and utilized.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/roots-web-scraping-wisdom-behind/

Thursday, 27 November 2014

Scraping Online Communities for your Outreach Campaigns

Online communities offer a wealth of intelligence for blog owners and business owners alike.

Exploring the data within popular communities will help you to understand who the major influencers are, what content is popular and who are the key content aggregators within your niche.

This is all fair and well to talk about, but is it feasible to be manually sorting through online communities to find this information out? Probably not.

This is where data scraping comes in.
What is Scraping and What Can it do?

I’m not going to go into great detail on what data scraping actually means, but to simplify this, here’s a definition from the Wikipedia page:

    “Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.”

Let me explain this with a little example…

Imagine a huge community full of individuals within your industry. Each person within the community has a personal profile page that contains information about their interests, contact details, social profiles, etc.

If you were tasked with gathering all of this data on all of the individuals then you might start to hyperventilating at the thought of all the copy and pasting you’d need to do.

Well, an alternative is to scrape all of this content so that you can automate all of this process and easily export all of this information into a manageable, more consumable format in a matter of seconds. It’d be pretty awesome, right?
Luckily for you, I’m going to show you how to do just that!
The Example of Inbound.org

Recently, I wanted to gather a list of digital marketers that were fairly active on social media and shared a lot of content online within communities. These people were going to be some of my core targets to get content from the blog in front of.

To do this, I first found some active communities online where these types of individuals hang out. Being a digital marketer myself, this process was fairly easy and I chose Inbound.org as my starting place.

Scoping out Data Requirements
Each community is different and you’ll be able to gather varying information within each.

The place to look for this information is within the individual user profile pages. This is usually where the contact information or links to social media accounts are likely to be displayed.

For this particular exercise, I wanted to gather the following information:

    Full name
    Job title
    Company name and URL
    Location
    Personal website URL
    Twitter URL, handle and follower/following stats
    Google+ URL, follower count and list of contributor URLs
    Profile image URL
    Facebook URL
    LinkedIn URL

With all of this information I’ll be able to get a huge amount of intelligence about the community members. I’ll also have a list of social media accounts to add and engage with.
On top of this, with all the information on their websites and sites that they write for, I’ll have a wealth of potential link building prospects to work on.

Inbound.org Profiles

You’ll see in the above screenshot that a few of the pieces of data are available to see on the Inbound.org user profiles. We’ll need to get the other bits of information from the likes of Twitter and Google+, but this will all stem from the scraping of Inbound.org.

Sign Up To My Newsletter
Scraping the Data

The idea behind this is that we can set up a template based on one of the user profiles and then automate the data gathering across the rest of the profiles on the site.

This is where you’ll need to install the SEO Tools plugin for Excel (it’s free). If you’ve not used this plugin before, don’t worry – I’ve put together a full tutorial here.

Once you’ve installed the plugin, you’re good to go on the actual scraping side of things…
Quick Note: Don’t worry if you don’t have a good knowledge of coding – you don’t need it. All you’ll need is a very basic understanding of reading some code and some basic Excel skills.

To begin with, you’ll need to do a little Excel admin. Simply add in some column titles based around the data that you’re gathering. For example, with my example of Inbound.org, I had, ‘Name’, ‘Position’, ‘Company’, ‘Company URL’, etc. which you can see in the screenshot below. You’ll also want to add in a sample profile URL to work on building the template around.

spreadsheet admin
Now it’s time to start getting hands on with XPath.
How to Use XPathOnURL()

This handy little formula is made possible within Excel by the SEO Tools plugin. Now, I’m going to keep this very basic because there are loads of XPath tutorials available online that can go into the very advanced queries that are possible to use.

For this, I’m simply going to show you how to get the data we want and you can have a play around yourself afterwards (you can download the full template at the end of this post).

Here’s an example of an XPath query that gathers the name of the person within the profile that we’re scraping:

=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

A2 is simply referencing the cell that contains the URL that we’re scraping. You’ll see in the screenshot above that this is Jason Acidre’s profile page.

The next part of the formula is the XPath.

What this essentially says is to scrape through the HTML to find a tag that has ‘user-profile’ id attached to it. This could be a div, span, a or whatever.

Once it’s found this tag, it then needs to look at the first h2 tag within this area and grab the text within it. This is Jason’s name, which you’ll see in the screenshot below of the code:

website code

Don’t be put off at this stage because you don’t need to go manually trawling through code to build these queries, there’s a much simpler way.

The easiest way to do this is by right-clicking on the element you want to scrape on the webpage (within Chrome); for example, on Inbound.org, this would be the profile name. Now click ‘Inspect element’.

inspect element

The developer tools window should now appear at the bottom of your browser (or in a separate window). Within that, you should see the element that you’ve drilled down on.

All you need to do now is right-click on it and press ‘Copy XPath’.
copy XPath

This will now copy the XPath code for your Excel formula to the clipboard. You’ll just need to add in the first part of the query, i.e. =XPathOnUrl(A2,

You can then paste in the copied XPath after this and add a closing bracket.

Note: When you use ‘Copy XPath’ it will wrap some parts of the code in double apostrophes (“) which you’ll need to change to single apostrophes. You’ll also need to wrap the copied XPath in double apostrophes.

Your finished code will look like this:
=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

You can then apply this formula against any Inbound.org profile and it will automatically grab the user’s full name. Pretty good, right?

Check out the full video tutorial below that I’ve put together that talks you through this whole process:

[sws_blue_box box_size=””] Want more useful video tutorials? Subscribe to my YouTube channel now![/sws_blue_box]

XPath Examples for Grabbing Other Data

As you’re probably starting to see, this technique could be scaled across any website online. This makes big data much more attainable and gives you the kind of results that an expensive paid tool would offer without any of the cost – bonus!

Here’s a few more examples of XPath that you can use in conjunction with the SEO Tools plugin within Excel to get some handy information.

Twitter Follower Count

If you want to grab the number of followers for a Twitter user then you can use the following formula. Simply replace A2 with the Twitter profile URL of the user you want data on. Just a quick word of warning with this one; it looks like it’s really long and complicated, but really I’ve just used another Excel formula to snip of the text ‘followers’ from the end.

=RIGHT(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"),LEN(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"))-10)

Google+ Follower Count

Like with the Twitter follower formula, you’ll need to replace A2 with the full Google+ profile URL of the user you want this data for.

=XPathOnUrl(H67,"//span[@class='BOfSxb']")

List of ‘Contributor to’ URLs

I don’t think I need to tell you the value of pulling in a list of websites that someone contributes content to. If you do want to know then check out this post that I wrote.

This formula is a little more complex than the rest. This is because I’m pulling in a list of URLs as opposed to just one entity. This requires me to use the StringJoin function to separate all of the outputs with a comma (or whatever character you’d like).

Also, you may notice that there is an additional section to the XPath query, “href”. This pulls in the link within the specific code block instead of the text.

As you’ll see in the full Inbound.org scraper template that I’ve made, this is how I pull in the Twitter, Google+, Facebook and LinkedIn profile links.

You’ll want to replace A2 with the Google+ profile URL of the person you wish to gather data on.

=StringJoin(", ",XPathOnUrl(A2,"//a[@rel='contributor-to nofollow']","href"))

Twitter Profile Image URL
If you want to get a large version of someone’s Twitter profile image then I’ve got just the thing for you.
Again, you’ll just need to substitute A2 with their Twitter profile URL.
=XPathOnUrl(A2,"//*[@class='profile-picture media-thumbnail js-tooltip']","data-resolved-url-large")


Some Findings from the Data I’ve Gathered

With all big data sets will come some interesting findings. Here’s a few little things that I’ve found from the top 100 influential users on Inbound.org.

average followers chart

The chart above maps out the average number of followers that the top 100 users have on both Twitter (12,959) and Google+ (9,601). As well as this, it shows the average number of users that they follow on Twitter (1,363).

The next thing that I’ve looked at is the job titles of the top 100 users. You can see the most common occurrences of terms within the tag cloud below:

Job titlesFinally, I had a look through all of the domains listed within each of the top 100 Inbound.org users’ Google+ ‘contributor to’ sections and mapped out the most frequently mentioned sites.

Here’s the spread of domains that were the most popular to be contributed to:

domain frequency
It Doesn’t Stop There

As you’ve probably gathered, this can be scaled out across pretty much any community/forum/directory/website online.

With this kind of intelligence in your armoury, you’ll be able to gather more intelligence on your targets and increase the effectiveness of your outreach campaigns dramatically.

Also, as promised, you can download my full Inbound.org scraper template below:

[sdfile url=”http://www.matthewbarby.com/goodies/MatthewBarby-Inbound-Scraper.xlsx” redirect=”http://www.matthewbarby.com/thanks-downloading-inbound-scraper/”]

TL;DR

    Online communities hold valuable data on your target audiences – use it!
    Scale out your intelligence gathering by brushing up on your XPath.
    Download my Inbound.org scraper template and let it work its magic.

Source: http://www.matthewbarby.com/scraping-communities-with-xpath/

Wednesday, 26 November 2014

Web Data Extraction: driving data your way

Most businesses rely on the web to gather data such as product specifications, pricing information, market trends, competitor information, and regulatory details. More often than not, companies collect this data manually—a process that not only takes a significant amount of time, but also has the potential to introduce costly errors.

By automating data extraction, you're able to free yourself (and your pointer finger) from hours of copy/pasting, eliminate human errors, and focus on the parts of your job that make you feel great.

Web data extraction: What it is, why it's used, and how to get it right on an ongoing basis

Web data extraction, screen scraping, web harvesting—while these terms may have different connotations they all essentially point to the same thing: plucking data from the web and populating it in an organized way in another place for further analysis or more focused use. In an era where “big data” has become a commonplace concept, the appeal of web data extraction has grown; it’s an extremely efficient alternative to web browsing, and culls very specific data for a focused purpose.

How it's used

While each company’s needs vary, data extraction is often used for:

    Competitive intelligence, including web popularity, social perception, other sites linking to them, and placement of competitor advertisements

    Gathering financial data including stock market movement, product pricing, and more

    Creating continuity between price sheets and online websites, catalogs, or inventory databases

    Capturing product specifications like dimensions, color, and materials

    Pulling tabular data from multiple sources for in-depth analysis

Interestingly, some people even find that web data extraction can aid them in their leisure time as well, pulling data from blogs and websites that pertain to their hobbies or interests, and creating their own library of organized information on a topic. Perhaps, for instance, you want a list of all the designers that George Clooney wears (hey- we won’t question what you do in your free time). By using web scraping tools, you could automatically extract this type of data from, say, a fashion blogger who follows celebrity style, and create your own up-to-date shopping list of items.

How it's done

When you think of gathering data from the web, you should mentally juxtapose two different images: one of gathering a bucket of sand one grain at a time, and one of filling a bucket with a shovel that has the perfect scoop size to fill it completely in one sitting. While clearly the second method makes the most sense, the majority of web data extraction happens much like the first process--manually, and slowly.

Let’s take a look at a few different ways organizations extract data today.

The least productive way: manually

While this method is the least efficient, it’s also the most widespread. On the plus side, you need to learn absolutely nothing except “Ctrl+C/V” to use this method, which explains why it is the generally preferred method, despite the hours of time it can take. Imagine, for instance, managing a sales spreadsheet that keeps inventory up to date so that the information can be properly disseminated to a global sales team. Not only does it take a significant amount of time to update the spreadsheet with information from, say, your internal database and manufacturer’s website, but information may change rapidly, leaving sales reps with inaccurate information regardless.

Finding someone in the organization with a talent for programming languages like Python

Generally, automating a task without dedicated automation software requires programming, and therefore an internal resource with a solid familiarity with programming languages to create the task and corresponding script. While most organizations do, in fact, have a resource in IT or engineering with this type of ability, it often doesn’t seem like a worthy time investment for that person to derail the initiatives he or she is working on to automate web data extraction. Additionally, if companies do choose to automate using in-house resources, that person will find himself beholden to a continuing obligation, since he or she will need to adjust scripting if web objects and attributes change, disabling the task.

Outsourcing via Elance or oDesk

Unless there is a dedicated resource ready to automate and maintain data extraction processes (and most organizations wouldn’t necessarily choose to use their in-house employee time this way), companies might turn to outsourcing companies such as Elance or oDesk to hire contract help. While this is an effective way to automate a task using a resource that has a level of acumen in automation, it represents an additional cost--be it one time or on a regular basis as data extraction requirements change or increase.

Using Excel web queries

Since more often than not, data extracted from the web is often populated into an Excel spreadsheet, it’s no wonder that Excel includes web query tools expressly for that purpose. These tools are particularly useful in pulling tabular data from a website (such as product specifications, legal codes, stock prices, and a host of other information) and automatically pushing the data into a spreadsheet. Excel queries do have limitations and a learning curve, however, particularly when creating dynamic web queries. And clearly, if you’re hoping to populate the information in other sources, such as external databases, there is yet another level of difficulty to navigate.

How automation simplifies web data extraction

Culling web data quickly

Using automation is the simplest way to extract web data. As you execute the steps necessary to perform the task one time, a macro recorder captures each action, automatically generates an easily-editable script, and lets you specify how often you would like to repeat the task, and at what speed.

Maintaining the highest level of accuracy

With humans copy/pasting data, or comparing between multiple screens and entering data manually into a spreadsheet, you’re likely to run into accuracy issues (sometimes directly proportionate to the amount of time spent on the task and amount of coffee in the office!) Automation software ensures that “what you see is what you get,” and that data is picked up from the web and put back down where you want it without a hitch.

Storing web data in your preferred format

Not only can you accurately transfer data with automation software, you can also ensure that it’s populated into spreadsheets or databases in the format you prefer. Rather than simply dumping the data into a spreadsheet, you can ensure that the right information is put into the proper column, row, field, and style (think, for instance, of the difference between writing a birth date as “03/13/1912” and “12/3/13”).

Simplifying data analysis

Automation software allows you to aggregate data from disparate sources or enormous stockpiles of structured or unstructured data in a way that makes sense for your business analysis needs. This way, the majority of employees in an organization can perform some level of analysis on their own, making it easier to surface information that informs business decisions.

Reacting to changes without a hitch

Because automation software is built to recognize icons, images, symbols, and other objects regardless of their position on a screen, it can automate processes in a self-perpetuating manner. For example, let’s say you automate data retrieval from a certain chart on a retailer’s website without automation software. If the retailer decides to move that object to another area of the screen, your task would no longer produce accurate results (or work at all), leaving you to make changes to the script (or find someone who can), or re-record the task altogether. With image recognition capabilities, however, the system “memorizes” the object itself, not merely its coordinates, so that the task can continue to run irrespective of changes.

The wide sweeping appeal of automation software

Companies often pick a comprehensive automation solution not only because of its ability to effectively automate any web data extraction task, but also because it goes beyond data extraction. Automation software can permeate into other areas of the business as well, making tasks such as application integration, data migration, IT processes, Excel automation, testing, and routine tasks such as launching applications or formatting files faster and more accurate. Because it requires no programming experience to use, adoption rates are higher and businesses get more “bang for their buck.”

Almost any organization can benefit from using automation software, particularly as they grow and scale. If you are looking to quit “moving grains of sand” and start claiming back time in your day, there are a few steps you can take:

 Watch a short video that shows how web data extraction is done with automation software

 Download a free trial and start reaping the benefits of downloading even just a couple of tasks today.

 See how tasks are automated with our short, step-by-step how-to-sheets (and then give it a try yourself!)

Source: https://www.automationanywhere.com/web-data-extraction

Sunday, 23 November 2014

Outsourcing Data Mining is a Wise Business Decision

Most businesses nowadays have a large volume of raw data that is never processed, because of the lack of time or resources. If your business is facing a similar situation, then you are missing out on valuable information. Without the right information, your company will be unable to make accurate business decisions.

The right information can play a key role in promoting the growth of your business. When unprocessed data is entered, filtered, classified and converted into a workable format, it can be used to maximize your profits, ameliorate your risks and run a seamless workflow.

Over the years, data mining has proved to be extremely useful in various industries, be it, healthcare, direct marketing, e-commerce, finance, customer relationship management or telecommunications. With the right information, companies have been able to make fast and effective business decisions.

Why outsource data mining?

Data mining requires the expertise of professional business and financial analysts who understand how to acquire important information from vast amounts of data. If data mining is done in-house, it can become expensive and time consuming. It can also shift your focus away from core business activities. Outsourcing data mining on the other hand is more fast, cost-effective and can give you access to professional services.

4 commonly outsourced data mining functions

Most companies outsource one or more of the following data mining functions to India:

1. Data congregation: Data is extracted from various web pages and websites, by using methods like web and screen scraping. The collected data is then entered into a database.

2. Contact data collection: Different websites are searched and information concerning contacts is collected.

3. E-commerce data: Data about varied online stores are collected, taking into account information about prices, discounts and products.

4. Data about competitors: Data about business competitors are collected to help a company gauge itself against its competition. With such valuable data, you can effectively re-design your marketing strategy and pricing matrix.

8 advantages of outsourcing data mining to India

With data mining out of your hands, your business can make huge savings in terms of time, money and infrastructure. The following are some of the benefits that you can leverage by outsourcing data mining to India:

    Get qualified and highly skilled data mining experts to work for you at an extremely affordable cost

    Be assured of the quality of information, as Indian data entry companies only extract information from reliable websites and databases

    Save on the cost of investing on the latest data mining software and technology, as your Indian service provider will be making these investments

    Get your data processed within a short turnaround time of 3,6 or 12 hours as Indian data mining companies can provide efficient data mining within a few hours

    When compared to in-house data mining, outsourcing data mining can be a lot cheaper and also bring you better results

    Stay assured about the complete privacy, security and confidentiality of your valuable data as Indian data mining companies use the latest technology to ensure 100% safety

    Get access to data with a wide market coverage as your Indian data mining provider will be serving many business with varied data mining needs

    Improve your overall productivity and generate more profits by making informed decisions about your business

Have you outsourced data mining before? If yes, which data mining service did you outsource? Did you find outsourcing more advantageous that in-house data mining. Let us know.

Source: http://blog.flatworldsolutions.com/outsourcing-data-mining-is-a-wise-business-decision/

Wednesday, 19 November 2014

Web Scraping for SEO with these Open-Source Scrapers

When conducting Search Engine Optimization (SEO), we’re required to scrape websites for data, our campaigns, and reports for our clients. At the lowest level we utilize scraping to keep track of rankings on search engines like Google, Bing, and Yahoo, even keep a track of links on websites to know when it’s completed its lifespan. Then we’ve used them to help us aggregate data from APIs, RSS feeds, and websites to conduct some of our data mining to find patterns to help us become more competitive. 

So scraping is a function majority of companies (SEOmoz, Raventools, and Google) have to do to either save money, protect intellectual property, track trends, etc… Businesses can find infinite uses with scraping tools, it just depends if you’re an printed circuit board manufacturer looking for ideas on your e-mail marketing campaign or a Orange County based business trying to keep an eye out on the competition. which is why we’ve created a comprehensive list of open source scrapers out there to help all the businesses out there. Just keep in mind we haven’t used all of them!

Words of caution, web scrapers require knowledge specific to the language such as PHP & cURL. Take into considerations issues like cookie management, fault tolerance, organizing the data properly, not crashing the website being scraped, and making sure the website doesn’t prohibit scraping.

If you’re ready, here’s the list…

Erlang

    eBot

Java

    Heritrix
    Nutch
    Piggy Bank
    WebSPHINX
    WebHarvest

PHP

    PHPCrawl
    Snoopy
    SpiderMonkey

Python

    BeautifulSoap
    HarvestMan
    Scrape.py
    Scrapemark
    Scrapy **
    Mechanize

Ruby

    Anemone
    scRUBYt

We’ll come back and update this list as we encounter more! If you would like to submit a solution we missed, feel free. Also we’re looking for guides related to each of these, so if you know of any or would be interested in guesting blogging about one, let us know!

Source:http://www.annexcore.com/blog/web-scraping-for-seo-with-these-open-source-scrapers/

Monday, 17 November 2014

How to scrape data without coding? A step by step tutorial on import.io

Import.io (pronounced import-eye-oh) lets you scrape data from any website into a searchable database. It is perfect for gathering, aggregating and analysing data from websites without the need for coding skills. As Sally Hadadi, from Import.io, told Journalism.co.uk: the idea is to “democratise” data. “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.” Different uses for journalists, supplemented by case studies, can be found here.

A beginner’s guide

After downloading and opening import.io browser, copy the URL of the page you want to scrape into the import.io browser. I decided to scrape the search results website of orphanages in London:

001 Orphanages in London

After opening the website, press the tiny pink button in top right corner of the browser and follow up with “Let’s get cracking!” in the bottom right menu which has just appeared.

Then, choose the type of scraping you want to perform. In my case, it’s a Crawler (we’ll be getting data from multiple similar pages on the same site):

crawler

And confirm the URL of the website you want to scrape by clicking “I’m there”.

As advised, choose “Detect optimal settings” and confirm the following:

data

In the menu “Rows per page” select the format in which data appears on the website, whether it is “single” or “multiple”. I’m opting for the multiple as my URL is a listing of multiple search results:multiple

Now, the time has come to “train your rows” i.e. mark which part of the website you are interested in scraping. Hover over an entire “entry” or “paragraph”:hover over entry

…and he entry will be highlighted in pink or blue. Press “Train rows”.

train rows

Repeat the operation with the next entry/paragraph so that the scraper gets the hang of the pattern of your selections. Two examples should suffice. Scroll down to the bottom of your website to make sure that all entries until the last one are selected (=highlighted in pink or blue alternately).

If it is, press “I’ve got all 50 rows” (the number depends on how many rows you have selected).

Now it’s time to focus on particular chunks of data you would like to extract. My entries consist of a name of the orphanage, address, phone number and a short description so I will extract all those to separate columns. Let’s start by adding a column “name”:

add column

Next, highlight the name of the first orphanage in the list and press “Train”.

highlighttrain

Your table should automatically fill in with names of all orphanages in the list:table name

If it didn’t, try tweaking your selection a bit. Then add another column “address” and extract the address of the orphanage by highlighting the two lines of addresses and “training” the rows.

Repeat the operation for a “phone number” and “description”. Your table should end up looking like this:table final

*Before passing on to the next column it is worth to check that all the rows have filled up. If not, highlighting and training of the individual elements might be necessary.

Once you’ve grabbed all that you need, click “I’ve got what I need”. The menu will now ask you if you want to scrape more pages. In this case, the search yielded two pages of search results so I will add another page. In order to this this, go back to your website in you regular browser, choose page 2 (or any next one) of your search results and copy the URL. Paste it into the import.io browser and confirm by clicking “I’m there”:

i'm there

The scraper should automatically fill in your table for page 2. Click “I’ve got all 45 rows” and “I’ve got what I needed”.

You need to add at least 5 pages, which is a bit frustrating with a smaller data set like this one. The way around it is to add page 2 a couple of times and delete the unnecessary rows in the final table.

Once the cheating is done, click “I’m done training!” and “Upload to import.io”.

upload

Give the name to your Crawler, e.g. “Orphanages in London” and wait for import.io to upload your data. Then, run crawler:run crawler

Make sure that the page depth is 10 and that click “Go”. If you’re scraping a huge dataset with several pages of search results, you can copy your URLs to Excel, highlight them and drag down with a black cross (bottom right of the cell) to obtain a comprehensive list. Paste it into the “Where to start?” window and press “Go”.go

crawlingAfter the crawling is complete, you can download you data in EXCEL, HTML, JSON or CSV.dataset

As a result, we obtain a data set which can be easily turned into a map of orphanages in London, e.g. using Google Fusion Tables.

Source:http://www.interhacktives.com/2014/03/06/scrape-data-without-coding-step-step-tutorial-import-io/

Saturday, 15 November 2014

Is Web Scraping Legal?

Web scraping might be one of the best ways to aggregate content from across the internet, but it comes with a caveat: It’s also one of the hardest tools to parse from a legal standpoint.

For the uninitiated, web scraping is a process whereby an automated piece of software extracts data from a website by “scraping” through the site’s many pages. While search engines like Google and Bing do a similar task when they index web pages, scraping engines take the process a step further and convert the information into a format which can be easily transferred over to a database or spreadsheet.

It’s also important to note that a web scraper is not the same as an API. While a company might provide an API to allow other systems to interact with its data, the quality and quantity of data available through APIs is typically lower than what is made available through web scraping. In addition, web scrapers provide more up-to-date information than APIs and are much easier to customize from a structural standpoint.

The applications of this “scraped” information are widespread. A journalist like Nate Silver might use scrapers to monitor baseball statistics and create numerical evidence for a new sports story he’s working on. Similarly, an eCommerce business might bulk scrape product titles, prices, and SKUs from other sites in order to further analyze them.

Legality of Web ScrapingWhile web scraping is an undoubtedly powerful tool, it’s still undergoing growing pains when it comes to legal matters. Because the scraping process appropriates pre-existing content from across the web, there are all kinds of ethical and legal quandaries that confront businesses who hope to do leverage scrapers for their own processes.

In this “wild west” environment, where the legal implications of web scraping are in a constant state of flux, it helps to get a foothold on where the legal needle currently falls. The following timeline outlines some of the biggest cases involving web scrapers in the United States, and allows us to achieve a greater understanding on the precedents that surround the court rulings.

Terms of Use Tug-of-War—2000-2009

For years after they first came into use, web scrapers went largely unchallenged from a legal standpoint. In 2000, however, the use of scrapers came under heavy and consistent fire when eBay fired the first shot against an auction data aggregator called Bidder’s Edge. In this very early case, eBay argued that Bidder’s Edge was using scrapers in a way that violated Trespass to Chattels doctrine. While the lawsuit was settled out of court, the judge upheld eBay’s original injunction, stating that heavy bot traffic could very well disrupt eBay’s service.

Then in 2003’s Intel Corp. v. Hamidi, the California Supreme court overturned the basis of eBay v. Bidder’s Edge, ruling that Trespass to Chattels could not extend to the context of computers if no actual damage to personal property occurred.

So in terms of legal action against web scraping, Tresspass to Chattels no longer applied, and things were back to square one. This began a period in which the courts consistently rejected Terms of Service as a valid means of prohibiting scrapers, including cases like Perfect 10 v. Google, and Cvent v. Eventbrite.

The Takeaway: The earliest cases against scrapers hinged on Trespass to Chattels law, and were successful. However, that doctrine is no longer a valid approach.

Facebook Web Scraping2009—Facebook Steps In

In 2009, Facebook turned the tides of the web scraping war when Power.com, a site which aggregated multiple social networks into one centralized site, included Facebook in their service. Because Power.com was scraping Facebook’s content instead of adhering to their established standards, Facebook sued Power on grounds of copyright infringement.

In denying Power.com’s motion to dismiss the case, the Judge ruled that scraping can constitute copying, however momentary that copying may be. And because Facebook’s Terms of Service don’t allow for scraping, that act of copying constituted an infringement on Facebook’s copyright. With this decision, the waters regarding the legality of web scrapers began to shift in favor of the content creators.

The Takeaway: Even if a web scraper ignores infringing content on its way to freely-usable content, it might qualify as copyright infringement by virtue of having technically “copied” the infringing content first.

2011-2014— U.S. v Auernheimer

In 2010, hacker Andrew “Weev” Auernheimer found a security flaw in AT&T’s website, which would display the email addresses of users who visited the site via their iPads. By exploiting the flaw using some simple scripts and a scraper, Auernheimer was able to gather thousands of emails from the AT&T site.

Although these email addresses were publicly available, Auernheimer’s exploit led to his 2012 conviction, where he was charged with identity fraud and conspiracy to access a computer without authorization.

Data ScrapingEarlier this year, the court vacated Auernheimer’s conviction, ruling that the trial’s New Jersey venue was improper. But even though the case turned out to be mostly inconclusive, the court noted the fact that there was no evidence to show that “any password gate or code-based barrier was breached.” This seems to leave room for the web scraping of publicly-available personal information, although it’s still very much open to interpretation and not set in stone.

The Takeaway: Using a web scraper to aggregate sensitive personal information can lead to a conviction, even if that information was technically available to the public. While there is hope in the court’s observation that no passwords or barriers were broken to retrieve this information, the waters here are still very volatile.

2013—Associated Press vs. Meltwater

Meltwater is a software company whose “Global Media Monitoring” product uses scrapers to aggregate news stories for paying clients. The Associated Press took issue with Meltwater’s scraping of their original stories, some of which had been copyrighted. In 2012, AP filed suit against Meltwater for copy infringement and hot news misappropriation.

While it’s already been established that facts cannot be copyrighted, the court decided that the AP’s copyrighted articles—and more specifically, the way in which the facts within those articles were arranged—were not fair game for copying. On top of this, Meltwater’s use of the articles failed to meet the established fair use standards, and could not be defended on that front either.

The Takeaway: Fair use is limited when it comes to web scrapers, and copyrighted content is not always open to be scraped.

~~

By closely observing the outcomes of previous rulings, you’ll find that there are a few guidelines that a scraper should attempt to adhere to:

    Content being scraped is not copyright protected
    The act of scraping does not burden the services of the site being scraped
    The scraper does not violate the Terms of Use of the site being scraped
    The scraper does not gather sensitive user information
    The scraped content adheres to fair use standards

While all of these guidelines are important to understand before using scrapers, there are other ways to acclimate to the legal nuances. In many cases, you’ll find that a simple conversation with a business software developer or consultant will lead to some satisfying conclusions: Odds are, they’ve used scrapers in the past and can shed light on any snags they’ve hit in the process. And of course, talking with a lawyer is always an ideal course of action when treading into questionable legal territory.

Source:http://blog.icreon.us/2014/09/12/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/

Thursday, 13 November 2014

Interactive Crawls for Scraping AJAX Pages on the Web

Crawling pages on the web has become an everyday affair for most enterprises. Too often do we come across offline businesses as well who’d like data gathered from the web for internal analyses. All this eventually to serve customers faster and better. At times, when the crawl job is high-end cum high-scale, businesses also consider DaaS providers to supplement their efforts.

However, the web landscape too has evolved with newer technologies that provide fancy experiences to web users. AJAX elements are one such common aid that leave even the DaaS providers perplexed. They come in various forms from a user’s point of view-

1. Load more results on the same page

2. Filter results based on various selection criteria

3. Submit forms, etc.

When crawling a non-AJAX page, simple GET requests do the job. However, AJAX pages work with POST requests that are not easy to trace for a normal bot.

Difference between GET request and POST request- Scraping

GET vs. POST

At PromptCloud, from our experience with a number of AJAX sites on the web, we’ve crossed the tech barrier. Below is a quick review about the challenges that come with AJAX crawling and its indicative solutions-

1. Javascript Emulations- A bot essentially emulates human browsing to fetch pages. When this needs to be done for Javascript components on a page, it gets tricky. Headless browser, which emulates human interaction with a web page without an interface, is the current approach. These browsers click on various elements/ dropdown lists that are embedded within Javascript code and capture responses to be transferred to programs. Which headless browser to pick depends on what fits well into your current stack.

2. Fetch Bandwidths- Unlike GET requests which complete pretty quickly, POST requests take quite a bit of time due to the number of events involved per fetch. Hence a good amount of bandwidth needs to be allocated in order to receive the response. For the same reason, wait times need to be taken care of too else you might end up with incomplete responses.

3. .NET Architectures- This is a more complex scenario related to maintaining the View State. Most of the postbacks come with an event and its validation. The bot needs to track the view state and pass validations for the event to occur so that the code can be executed and results captured. This is achieved by adopting a mechanism to restore states if things break midway.

4. Page Encoding- Request and response headers need to be taken care of on AJAX pages. The request needs to be sent in the exact format as expected by the server (Content-type or media type, accept fields, etc.) and similarly responses need to be parsed based on the content-type.

A Use Case

One of our clients who is into sale of event tickets at discounted rates had us crawl one of the ticketing sites on the web weekly; one of the most complex AJAX crawling we’ve dealt with so far. For the data that was to be extracted, multiple AJAX fetches were needed depending on the selections made. Requests had to be made for a combination of items from the dropdown box. These came with cookies and session IDs. To add to the challenge the site was extremely dynamic and changed its structure every week making it difficult for us to follow what data was where on the page.

We developed an AJAX crawler specific to this site to take care of all the dynamics. Response times were taken care of so that we didn’t miss any relevant information. We included an ML component to improve the crawler which is now pretty stable irrespective of changes on the site.

Overall, AJAX crawling requires more compute power in addition to the tech expertise. And because there’s no uniformity on the web, there’s always a new challenge to overcome in this landscape. It wouldn’t be an overrating if we said we’ve done a good job at that so far and have developed the knack :)

Reach out to us for any kind of web scraping/ crawling- either AJAX or not. We’ll take care of the complexities.

Source: https://www.promptcloud.com/blog/web-scraping-interactive-ajax-crawls/

Wednesday, 12 November 2014

Web scraping services-importance of scraped data

Web scraping services are provided by computer software which extracts the required facts from the website. Web scraping services mainly aims at converting unstructured data collected from the websites into structured data which can be stockpiled and scrutinized in a centralized databank. Therefore, web scraping services have a direct influence on the outcome of the reason as to why the data collected in necessary.

It is not very easy to scrap data from different websites due to the terms of service in place. So, the there are some legalities that have been improvised to protect altering the personal information on different websites. These ‘rules’ must be followed to the letter and to some extent have limited web scraping services.

Owing to the high demand for web scraping, various firms have been set up to provide the efficient and reliable guidelines on web scraping services so that the information acquired is correct and conforms to the security requirements. The firms have also improvised different software that makes web scraping services much easier.

Importance of web scraping services

Definitely, web scraping services have gone a long way in provision of very useful information to various organizations. But business companies are the ones that benefit more from web scraping services. Some of the benefits associated with web scraping services are:

    Helps the firms to easily send notifications to their customers including price changes, promotions, introduction of a new product into the market. Etc.
    It enables firms to compare their product prices with those of their competitors
    It helps the meteorologists to monitor weather changes thus being able to focus weather conditions more efficiently
    It also assists researchers with extensive information about peoples’ habits among many others.
    It has also promoted e-commerce and e-banking services where the rates of stock exchange, banks’ interest rates, etc. are updated automatically on the customer’s catalog.

Advantages of web scraping services

The following are some of the advantages of using web scraping services

    Automation of the data

    Web scraping can retrieve both static and dynamic web pages

    Page contents of various websites can be transformed

    It allows formulation of vertical aggregation platforms thus even complicated data can still be extracted from different websites.

    Web scraping programs recognize semantic annotation

    All the required data can be retrieved from their websites

    The data collected is accurate and reliable

Web scraping services mainly aims at collecting, storing and analyzing data. The data analysis is facilitated by various web scrapers that can extract any information and transform it into useful and easy forms to interpret.

Challenges facing web scraping

    High volume of web scraping can cause regulatory damage to the pages

    Scale of measure; the scales of the web scraper can differ with the units of measure of the source file thus making it somewhat hard for the interpretation of the data

    Level of source complexity; if the information being extracted is very complicated, web scraping will also be paralyzed.

It is clear that besides web scraping providing useful data and information, it experiences a number of challenges. The good thing is that the web scraping services providers are always improvising techniques to ensure that the information gathered is accurate, timely, reliable and treated with the highest levels of confidentiality.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/191-web-scraping-services-importance-of-scraped-data/

Monday, 10 November 2014

How to scrape Amazon with WebDriver in Java

Here is a real-world example of using Selenium WebDriver for scraping.
This short program is written in Java and scrapes book title and author from the Amazon webstore.
This code scrapes only one page, but you can easily make it scraping all the pages by adding a couple of lines.

You can download the souce here.

import java.io.*;
import java.util.*;
import java.util.regex.*;

import org.openqa.selenium.*;
import org.openqa.selenium.firefox.FirefoxDriver;


public class FetchAllBooks {

    public static void main(String[] args) throws IOException {

        WebDriver driver = new FirefoxDriver();
      

driver.navigate().to("http://www.amazon.com/tag/center%20right?ref_=tag_dpp_cust_itdp_s_t&sto

re=1");

        List<WebElement> allAuthors =  driver.findElements(By.className("tgProductAuthor"));
        List<WebElement> allTitles =  driver.findElements(By.className("tgProductTitleText"));
        int i=0;
        String fileText = "";

        for (WebElement author : allAuthors){
            String authorName = author.getText();
            String Url = (String)((JavascriptExecutor)driver).executeScript("return

arguments[0].innerHTML;", allTitles.get(i++));
            final Pattern pattern = Pattern.compile("title=(.+?)>");
            final Matcher matcher = pattern.matcher(Url);
            matcher.find();
            String title = matcher.group(1);
            fileText = fileText+authorName+","+title+"\n";
        }

        Writer writer = new BufferedWriter(new OutputStreamWriter(new

FileOutputStream("books.csv"), "utf-8"));
        writer.write(fileText);
        writer.close();

        driver.close();
    }
}

Source: http://scraping.pro/scraping-amazon-webdriver-java/

Friday, 7 November 2014

Web Scraping Enters Politics

Web scraping is becoming an essential tool in gaining an edge over everything about just anything. This is proven by international news on US political campaigns, specifically by identifying wealthy donors. As is commonly known, election campaigns should follow a rule regarding the use of a certain limited amount of money for the expenses of each candidate. Being so, much of the campaign activities must be paid by supporters and sponsors.

It is not a surprise then that even politics is lured to make use of the dynamic and ever growing data mining processes. Once again, web mining has proven to be an essential component of almost all levels of human existence, the society, and the world as a whole. It proves its extraordinary capacity to dig precious information to reach the much aspired for goals of every individual.

Mining for personal information

The CBC News online very recently disclosed that the US Republican presidential candidate Mitt Romney has used data mining in order to identify rich donors. It is reported that the act of getting personal information such as the buying history and church attendance were vital in this incident. Through this information, the party was able to identify prospective rich donors and indeed tap them. As a businessman himself, Romney knows exactly how to fish and where the fat fish are. Moreover, what is unique about the identified donors is that they have never been donating before.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-enters-politics/

Wednesday, 5 November 2014

Web Scraping: The Invaluable Decision Making Tool

Business decisions are mandatory in any company. They reflect and directly influence about the future of the company. It is important to realize that decisions must be made in any business situation. The generation of new ideas calls for new actions. This in turn calls for decisions. Decisions can only be made when there is adequate information or data regarding the problem and the cause of action to be taken. Web scraping offers the best opportunity in getting the required information that will enable the management make a wise and sound decision.

Therefore web scraping is an important part in generation of the practical interpretations for the business decision making process. Since businesses take many courses of actions the following areas call for adequate web scraping in order to make outstanding decisions.

1. Suppliers. Whether you are running an offline business there is need to get information regarding your suppliers. In this case there are two situations. The first situation is about your current suppliers and the second situation is about the possibility of acquiring new suppliers. By web scraping you has the opportunity to gather about your suppliers. You need to know other business they are supplying to and the kind of discounts and prices they offer to them. Another important aspect about consumers is to determine the periods when they have surplus and therefore be able to determine the purchasing prices.

Web scraping can provide new information concerning new suppliers. This will make a cutting edge in the purchasing sector. You can get new suppliers that have reasonable prices. This will go a long way in ensuring a profitable business. Therefore web scraping is an integral process that should be taken first before making a vital decision concerning suppliers.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-invaluable-decision-making-tool/