Web scraping – copying and indexing contents from web sites — is always a tricky issue to discuss, as there’s a significant grey area between good robot (bot) activities and malicious ones.
Legitimate web crawlers, like Googlebot, can benefit websites. Other beneficial bots include media monitoring and measurement spiders that scan the internet for brand mentions and other keyword terms for PR and marketing personnel and measure the effectiveness of PR campaigns for companies.
Malicious bots can damage websites and sometimes cause long-term and even permanent harm. Differentiating between good and bad bots and managing them properly can be a challenge for webmasters.
For PR professionals and marketers, content is arguably the most important asset, and protecting their content should be a top priority. Understanding how to prevent malicious web scraping and keep your content safe will protect your website content asset your hard-earned brand reputation.
We will discuss the best practices to keep your content safe from illegal scrapers while allowing beneficial, crawlers help your site.
1. Investing in a Bot Management Solution
The most effective practice is to invest in a comprehensive bot detection and management solution. Today’s sophisticated gen-4 bots are getting better at mimicking human behaviors by utilizing AI and machine learning technologies. Detecting their presence is challenging because they rotate between thousands, if not hundreds of thousands, of IP addresses. (An IP or internet protocol address identifies the source of the bot.) They also use proxy IPs to camouflage the source.
An AI-powered, machine learning-based web scraping detection and protection solution can counter this phenomenon. The bot management solution should be able to analyze and filter your traffic in real-time without sacrificing your site’s performance.
2. Basic Preventive Methods for Web Scraping
While there is no one-size-fits-all solution to prevent web scraping, here are some basic preventive measures to block unwanted scraping bots while also managing traffic from good web scrapers:
- Install a well-written txt that allows you to block all bots or to selectively block unwanted bots. Malicious bots may not follow directives written on your robots.txt, but it can help control the activities of good bots.
- Don’t make your URLs easily scrapable. For example, if you have URLs ending with blog/1, blog/2, blog/10, and so on, bots can easily and quickly scrape the content of your site.
- Don’t list your content on a single page (i.e. a page listing all your blog posts). Let users use the search function to find your content, and limit the number of search results.
- Limit the repetitive and/or excessive activity from one IP address. Demand authentication (CAPTCHA, 2-factor authentication, etc.) if necessary.
- Monitor your traffic log regularly to track any suspicious activity.
3. Block Hotlinking
Hotlinking happens when another website displays your website content by hotlinking to it and framing it so your content appears under their masthead.
By preventing hotlinking, you prevent the thief from serving resources hosted on your server. A simple but effective practice to prevent this from happening is to replace the image/resource the attacker is targeting with another image. You can even replace it with a sign called “this is stolen” or others by using this code:
|RewriteRule \.(jpg|jpeg|png|gif)$ http://www.yourdomain.com/imagename.png [NC,R,L]
Simply change yourdomain.com with your own domain and the image file name of your choice (The image must be hosted on your site.)
Make sure to prevent hotlinking, so even in the case of web scraping, it won’t burden your resources.
The good old CAPTCHA, which stands for Completely Automated Public Turing test to tell Computers and Humans Apart, is still a basic but effective approach in blocking bot activities while allowing legitimate users. However, there are two key concerns when using CAPTCHAs to prevent web scraping:
- Requiring too many CAPTCHAs on your website can significantly lower your site’s user experience (UX), and might cause a high bounce rate.
- With the presence of CAPTCHA farms, hackers can ask a real human to solve the CAPTCHA before the scraper bot takes over, rendering CAPTCHA useless.
So, although CAPTCHAs can still be effective, know that it’s not a one-size-fits-all answer in preventing web scraping. Also, use them sparingly and try to find the right balance between security and usability.
5. Obfuscation and Restrictions
The typical modus operandi of a web scraper bot is to download the HTML of a URL and then extract the target content. By obfuscating your valuable data and making your web code excessively complicated, you can make this process more difficult without sacrificing your site’s performance.
Another way is to encode the data in images, so the scraper bot would then need to use OCR (Optical Character Recognition) techniques before it can extract the data, which can be difficult to do accurately. However, using images would also mean requiring your web client to download more data and will lower the user experience for legitimate users.
Most webmasters of standard corporate websites find these techniques to be too difficult for the benefits they produce. The techniques are more important to protect semi-private data. Standard username/password access can also prevent bot access to sensitive content.
6. Fictitious Error Messages
When you do block or limit bot activities, don’t tell the website visitor what actually caused the block or they can use the information to fix their scripts and re-attempt the attack. For example, you shouldn’t provide error messages like “UA header not present” or “too many requests from your IP address.”
Instead, use a fictitious error message to:
- to ensure it’s more user-friendly (and less scary) for legitimate users and
- to prevent the web scraper’s operator knowing why you block them
You can say something like “Sorry! Something went wrong. You can contact support at__________ if the problem persists.” In such cases, legitimate users can contact you, lowering the possibility of false positives.
Another consideration is to show a CAPTCHA instead of implementing a hard block immediately, which might be effective in cases when you are not 100% sure whether it is a legitimate user or a bot.
The basic principles in preventing web scraping attacks are to identify and monitor visitors with a high level of activity, while also enforcing terms and conditions that stop malicious web scraping (i.e. robots. txt). It’s important to remember that unsavory web scrapers will attempt to disguise the scraping activities as legitimate users, and detecting this should be the main focus.
Having a proper bot management solution that can differentiate legitimate traffic and bot activities in real-time is the most effective approach to prevent unwanted web scraping. Some AI-powered solutions can use both fingerprinting-based and behavioral-based approaches to detect and manage malicious bot activities.