PHPMatters Help You Better Hosting Your PHP-based Sites
How to Deal With Website Scraping to Better Protect Your Site Content

How to Deal With Website Scraping to Better Protect Your Site Content

Website scraping, or also named as content scraping, leads to a high level of frustration to all the webmasters. Just imagine that you work hard on your blog posts by searching the hot topics, writing down the elegant words, going through a strict review process and publishing them along with some well-organized graphical elements. However, just a few hours later or even a few minutes later, some other people steal your outputs and publish on their websites without your permission. This practice simply calls as website scraping – a huge and annoying issue for all the website owners.

As many webmasters do not know how to deal with these thefts, in the following, we have listed some tips about how to deal with website scraping to better safeguard your web content.

Find Out Who Are Scraping Your Site

Who Are ScraperThe practice of website scraping not only frustrates you as your outputs are stolen without your realization, but also negatively affects your website SEO, ranking, traffic, revenue and many more. Therefore, if your site is undergoing the following situations, you’d better check out whether there are some people have scraped your site.

  • Increased bandwidth utilization – Content scraping may lead to the issue of hotlinking. After all, if someone scrapes your posts, the inner HTML still remains unchanging, including the image links that are hosted on your domain. Therefore, every time your posts are checked from the sites of scrapers, your images are downloaded from your server, accelerating your utilization of monthly bandwidth greatly.
  • Decreased search engine ranking – It is a truth that some spam sites that scrape your contents even outrank your original posts.
  • Decreased traffic and page views – Your contents are stolen and your online ranking is surpassed by these thefts, so the next things are your traffic and daily visits.
  • Loss of revenue and website subscription – Again, you must lose something when encountering the website scraping.

Now, you must feel urgent to know who are scraping your site, resulting in the great losses mentioned here. In this case, you can simply follow the below steps.

  • Make use of CopyScape. This is a paid tool that can detect the plagiarized content precisely. Therefore, you only need to enter your blog posts into the check box and click the Premium Search button, then you can know which sites have the similar or even the same contents as yours only a few seconds later, targeting the scraper easily.
  • Make use of Google Alerts tool. If your website is a large one that comes with plenty of posts, then the utilization of CopyScape is not cost-effective. Here, we recommend you to use the Google Alert tool which is free for charging. Again, put your posts into the check box, then the results that have the same contents as yours can come out automatically. Even, you can create alerts that inform you of the fact that some websites steal your contents.The checking frequency and delivery email can be determined freely by you.

Google Alerts

Remove the Scraping Contents

Previously, many large and popular websites do not care about the scraping issue. They just continue producing more quality and meaningful posts and believe that they have the authority in the eyes of Google. But recently, with the update of Google Panda, more and more webmasters find that Google flags them as the scrapers and regards the real scraping sites as the source pages.

To avoid this situation, you have to contact the scrapers directly and ask them to remove the copied contents. Here, you may encounter two occasions.

  • You can find the contact channel, then just send your request directly.
  • You cannot find the contact methods. Then, go to this WHOIS page to check out the contact information of their domain registrars and web hosts.

Next, you may still fall into two situations.

  • The scrapers agree to remove the copied contents.
  • They refuse to do so or simply ignore you. If in this way, you can contact their web hosts, submit the DMCA complaints, and ask them to take things done. Even, you can send the same complaints to Google DMCA if the web hosts do not take actions, for Google will block the scraping pages from their servers directly.

Prevent Website Scraping

Even though you have dealt with all the scraping issues happened on your site, it does not mean that you are totally safe. Now, you have to prevent such an annoying issue from affecting your site anymore. Here, we have listed some tips that can achieve this goal effectively.

Block the IP addresses of scrapers

To avoid those bad guys stealing your contents again, you can figure out their IPs using the WHOIS page and block the addresses. The required code needs to be entered into your .htaccess file.

Deny from xxx.xxx.xxx

Set up the login for accessing the content

Generally, scrapers do not need to identify themselves when accessing your site. However, if you require them to register and to login your content, they need to send some identification information for each access request, resulting in the easy process for you to track back who are scraping your content. Besides, this practice can reduce your chances to be scraped as many scrapers may feel time-consuming to pass your registration process.

However, this method also has a major drawback – lead to some inconveniences to your common readers.

Fight against the scraping tools

Fight Against Scraping ToolsMany bad webmasters rely on some tools to automatically collect the information from your site. Here, you can create the Honey Pot pages to fight against these tools.

The Honey Pot page is a special webpage that can only be checked by the internet robots and web crawlers. Therefore, once you find that your Honey Pot pages have been visited by some clients, you can be sure that they are not human visitors, but might be some scraping tools looking to steal your contents. Then, you can stop all the requests from these clients.

Sometimes, you can also embed your contents into some graphical or media objects to avoid the scraping tools. After all, most of these tools pull a string on textual contents out of the HTML. If you parsing theses words into the media objects, your contents can hardly be stolen. However, this way also has a drawback as your page might be loaded slower.

Summary

Fighting against the website scraping requires time and patience. After all, there are no prevention methods that do not have drawbacks. If you are sick of these things, you can choose to pay some money to purchase a useful tool for protecting your site on a daily basis. Here, we think the ShieldSquare is an option.

This tool categorizes your clients into three groups – common visitors, search engines and bad bots. When receiving an accessing request, it automatically allows the former two options to enter your site and blocks the last group permanently.

However, this practice also has a drawback – the price. It has 6 plans in total, and even the cheapest plan of Squad requires you to pay $59 each month to enjoy the service. Therefore, if your website is large and commercial, you can choose this option. If not, you’d better adopt some free methods as we have mentioned in the above parts.