Notes On Designing Web Scraping System

July 8, 2018 By: Wasim Akram Khan

Web scraping serves many purpose. You could be looking for text, image, videos, etc on the internet. At Paaila Technology, we are basically looking for text while doing web scraping. Over time, we have learned that a web scraping system should have the following features listed in order of their importance.

Avoid Duplicate Links

The worst thing that could happen while scraping the internet is that you could practically be looping over the same page over and over again and never exit that page. This is why you need to keep list of visited links. Visited links serve dual purpose. They help you avoid visiting it multiple times which wastes computing resource. They also help you to resume operation in case the program exits due to any reason.

Resume Operation : The web scraping program can halt due to myriad of reasons : unhandled exceptions, internet outage, erratic power supply and more. You would not want to start scraping from the beginning when that happens because that would be time consuming and inefficient from many point. You will need to store list of visited links for this.

Resource Consumption : The idea is to keep it simple. Do not download media files when they are not part of your output requirement. Utilize all the bandwidth, CPU and memory that is available. Use threading or multiprocessing to achieve this.

Clean scraped data : The internet is a wild place. Most of the time you receive more than you ask for. You have to filter out what you do not require. For instance, while scraping for text you could receive pdf, music, image and video files. Just clean them. Even in the text you could receive characters from multiple languages which is not required most of the time. We use regular expression to clean out the unwanted text.

a robotics & ai company

Notes On Designing Web Scraping System