Naughty Google!
Google has once again been in the news… and not for good reasons. Recently it has been accused of scraping content from websites and publishing them directly on search results. It’s not the first time, and on this occasion, it is by song lyrics website Genius, which had a “genius” way of caching Google red-handed using Morse code to track lyrics online. A recent Wall Street report describes how Genius collected the evidence against Google. Google has denied the accusations, saying they are not knowingly scraping content from websites. But while the investigation continues, I thought it’d be a good opportunity to revive the content scraping debate.
What is content scraping?
Content scraping (aka web scraping) is simply the act of copying content from other websites and publishing it somewhere else. Original content is often stolen mainly because of its value but these days, normal content also gets stolen just because it’s easy to grab.
Content scraping is said to be illegal as it’s done without the consent of the original source or author. Content scrapers typically steal entire content and republish it as their own. However, in the Genius case, Google is being accused of copying content and displaying it right on the search results page. The case is a bit different since Genius do not hold the rights of the songs; it’s going to be interesting to see how the case develops and what will the outcome be.
How does content scraping work?
It involves a technique of extracting data from websites, and usually includes transforming unstructured website data into a database for analysis and/or repurposing content. The job of scraping the web is carried out by a piece of code which is commonly called a ‘scraper’. In most cases, scrapers are programmed to target a specific type of website. The most common ones being:
- Online publications (original content)
- E-commerce websites (product prices, descriptions, reviews)
- Educational content (informative content)
- Blogs and articles (news, opinions)
- Directories (offices, employee details, partners)
Impact of content scraping on your business
The impact this kind of activity can have on your business is wider reaching and far more damaging than many realise. Here below are three key reasons why scraping can have a detrimental effect on your bottom line.
Lower Click Volumes – Web scraping is an issue that can deeply impact publishers on the net. It’s an increasingly common complaint that Google’s rich snippets and information boxes are constantly taking traffic away from the original publishers. In Genius’s case, content from their site was scraped and the song lyrics were entirely displayed on Google’s results, hence eliminating the need for a user to click through to the site, causing a reduction in organic traffic.
Revenue Loss – Stolen content is commonly associated with loss of ranking positions on Google, and this can affect businesses negatively, especially when the website is the primary source of revenue via online traffic and transactions. It is a well-known fact that duplicate content on the web can trigger website devaluation, which in turn, causes drops in ranking positions and click-throughs to the site. Less traffic negatively impacts conversions, and fewer conversions result in less revenue to the business.
Loss of Competitive Advantage – Competitors also love web scraping as a method to compare product prices. Real estate and travel websites for examples, usually see a majority of bot activity that is based on price comparisons. In addition, the competitors scraping your website can gain an advantage, if the content they are publishing somewhere else is credited to them.
Does this sound familiar?
The legal debate
Web crawling and scraping aren’t illegal in themselves. Scraping is actually an essential part of how the internet works: search engines have web scraping techniques to construct their database for search, so not all bots are malicious. After all, someone could crawl or scrape his own site for his own benefit. Issues arise when a website is scraped without obtaining prior permission from the content author, or in disregard of the site’s terms of service. Issues can worsen when original content is scraped.
How to detect content scrapers
If you suspect your site is being scraped for content, here’s what you should look out for:
- Bandwidth Monitoring – Monitoring your server’s bandwidth can give you an early alert about malicious bots scraping your site. If you notice unusual bandwidth occupation, causing throughput problems, it’s normally a red flag.
- Searching for your own content – Check if your content is being found on other websites indexed on Google. Search for exactly the same words, titles and sentences from your content and check if they’ve been republished somewhere else.
- Monitoring Rankings and Search Traffic – If you notice that traffic has gone down considerably for specific content, someone else could have republished your content on a larger website which is ranking better than yours, hence causing a drop in ranking positions and loss of organic clicks to your website.
- Monitoring IP address – If you find a high number of requests from the same IP, this can be an indication of bots coming from the same source. Often these requests have equal intervals, which you should be suspicious of.
- Monitoring Requested Files – If you discover that some specific files never get requested, it can also be an indication of web scraping. Typical files that don’t get requested are favicon.ico, CSS and JavaScript files.
- Implementing ‘Honeypot’ Pages – These are webpages that humans will not normally visit because their links are hidden in CSS files. However, Scrapers can, and usually click through to access them as they go to every link. Once they’ve reached the honey pot page, you can track their information and block them.
- Monitoring Incoming Links – Using a backlink tool to analyse links pointing to your site can determine if your content is being scraped and republished elsewhere because if you have internal links within your content, these links will be pointing to your website from the scraper’s site.
- Monitoring Mentions – Use a content change detection software to get notified of any search results related to your specific content. In this way you can easily detect where your original contents have been republished.
Protecting your Content
The following guidelines should offer some help with content protection.
- Copyright Notice – Include a clear copyright notice on your website, with guidelines specifying what kind of use you allow on your site. This can avoid any confusion and deter some people to copy and reuse your content
- Blocking Suspected Proxies – After detecting who are sending malicious bots to your site, block them to prevent content scraping
- Request Limitation – Limit the number of requests a user-agent from an IP can make in a specific lapse of time. This can filter out scrapers
- Implementing CAPTCHA – Implementing CAPTCHA, where possible, can make it harder for content scrapers. CAPTCHAs separate humans from bots by providing problems which only real humans can easily solve, e.g. reCAPCHA
- Internal Linking – Making sure you have links pointing to other parts of your site, from within your content, can help detect scrapers. This is because internal links will be pointing to your site once they are republished elsewhere. This isn’t totally foolproof, as scrapers may have gone through the trouble of removing all internal links within your content, but at least it goes some way to making their job more difficult
- RSS Configuration – If you have an RSS feed, ensure it’s configured to display post summaries rather than full RSS feed. This can add an extra layer of protection against scrapers
- Image Protection – If you have original images, adding watermarks to them will demonstrate ownership
Content scraping techniques and technologies keep evolving every year, so it’s now more important than ever to keep an eye on your valuable content.
If you’d like to find out more about Content Protection and SEO, please get in touch with one of our experts.
Joe Volcy
Joe Volcy is the founder and CEO of Volvox Digital. Joe is an award-winning SEO and Content Marketing expert with over ten years experience in maximising organic performance and developing inbound marketing strategies for both B2B and B2C clients.