With 1.8 billion websites out there, you would think humanity is busy constantly creating and consuming internet content. Although, in actuality, of the 1.8 billion, only 200 million websites (11%) are active; and of all the traffic occuring, over 60% is delivered by bots. To be clear: when you detect unusual activity on your website, you should worry. In all likelihood, it is bots attempting to steal your data or content, in a process known as web scraping.
What is web scraping?
Web scraping is the process of extracting or retrieving data from a website. This can be done manually or automatically. The lowest form of web scraping is the manual copy & paste model. At a more “professional” level, there are countless tools, from paid software to free Python libraries. Automated scripts detect your URLs, mask as users and start hitting your URLs to extract the data. This intensive burst of activity often impairs site performance and can easily lead to brand deterioration.
What is web scraping used for?
Web scrapers can be used for a number of legitimate purposes, among which are:
- tracking user behavior for research or marketing purposes;
- market analysis (competitor monitoring, news aggregation, price intelligence, etc.);
- brand monitoring;
- collecting and aggregating free information (such as data mining performed on public data repositories, real estate listings and weather apps collecting information from internet sources);
- tracking website changes.
On the other hand, unethical web scraping retrieves information for purposes such as:
- collecting contact information illegally;
- stealing content;
- degrading a website’s performance;
- draining website resources.
Nowadays, creating original content is not enough. You must also actively protect your content and the information your website contains, from such omnipresent threats. For that, you must understand what web scraping is – and what it is not.
Web scraping vs web crawling
Web crawling is the activity a bot (an automated script) performs to retrieve and index information about a web page. Search engines are able to deliver search results because they crawl and index pretty much the entire internet in search for keyword matches, authority signals, etc..
Web crawling is meant to discover full generic data sets in order to index information about websites. Web scraping, on the other hand, goes further to extract specific data sets in order to analyze and exploit them for a specific purpose.
Web scraping vs data mining
Data mining is the act of finding and extracting large amounts of raw data from literally any source, with the purpose of analyzing data sets to derive information and knowledge.
Web scraping can be used for data mining purposes, too. However, data can also be mined from a variety of other sources, such as private or public data sets (databases) or cookies. Data mining can give information about disease trends, buyer behavior and marketing success, the weather and more.
Web scraping vs screen scraping
Web scraping extracts specific information inside the website via html and https requests. Screen scraping gathers pixels, i.e., screen display data. It detects the visual data displayed on the screen and collects the on-screen elements (text or images).
Screen scraping is often used to track a user’s activity or journey on a website, to extract information about a company’s webpage, or to steal private user information.
How is web scraping done?
Copy/paste is not a scalable scraping technique. The real threat comes from more advanced, cheaper and less resource-intensive forms of scraping.
Programming languages, Python in particular, often are used to extract information with simple regex or grep commands. Parsing helps understand the html code received after the command. Then, information is decoded and reassembled into a legible format.
The whole process can take between minutes and hours, depending on the quantity of data.
How to protect against web scraping
How do you make sure that Google, for instance, is still able to crawl and index your website, but that your content is safe and remains unique? You need to remain alert and actively work to protect your data, using any of the described solutions below that is most appropriate for you.
Dedicated bot management software
- CAPTCHA technology, such as reCAPTCHA Enterprise, is a security layer to prevent scripts from accessing content.
- Cloudflare provides not only CDN and DDoS protection but also bot mitigation security.
- Imperva (formerly Distil Networks) is a tool that polices malicious traffic on the internet, detecting and neutralizing malicious bots.
- DataDome is another service that offers protection against scraping, scalping, credential stuffing, DDoS attacks and carding fraud.
Other anti-scraping techniques
- Robots.txt is a file instructing search engines what to crawl and index, allowing access to legal bots but denying access to suspicious scripts.
- SSL certificates come in the form of a security extension that protects user information. It is useful not only against web scraping but also as a minimum level of general security.
- Detect bot-like browsing patterns, such as unusual volumes of item views, monitor these accounts and block certain IP addresses. For this, you can use geolocation or you can look up a DNS-based block list.
- Block HTTP requests with unwanted User-Agent header.
- Change your html often, at least at the level of id and class. Since scrapers parse your html patterns and break down your markup code, the smallest changes can throw them off track.
- Add honeypots to trap the scrapers. This is usually done by creating fake pages that only a non-human would visit. If you detect activity on these pages, you can safely block the IP.
- Throttle requests, i.e., limit the number of requests/actions in a certain time frame.
- Enforce Terms and Conditions by requiring users to tick a box.
Conclusion
Data is the new gold mine — and it is incredibly easy to steal. Any of the solutions mentioned above will help protect it. The first step, though, is to be aware and alert. Right now, bots are hitting your URLs in search of usable data. Are you prepared to face them off? After all, your brand’s health depends on how well you protect your website’s content and your users’ information.