Open Knowledge Nepal
|Mon Aug 05 2019
The quickness with which information is delivered is crucial to citizens, government and enterprises. Data should provide insights into day to day operations, lives and help citizens make decisions about those operations, and then act accordingly. It provides the flexibility to analyze it in real-time and trace the source to generate valuable insights. Open Knowledge Nepal has been working on collecting/harvesting and disseminating the datasets in open format via a central portal called Open Data Nepal, we realize that the fresh and real-time data is also one of the most sought datasets.
To get started, We priorities the potential datasets based on demand and select three sources; Air Quality Data (http://pollution.gov.np), River Watch Data (http://hydrology.gov.np) and Fruits and Vegetables price (http://kalimatimarket.gov.np). Those selected government websites provide real-time information on the webpage but do not provide archive data in open format. There isn’t any API endpoint to retrieve dataset. So, we decided to develop real-time data scraper to retrieve data from a website.
Real-time scraping is an automated process where a software application constantly watches the webpage on a certain time interval and processes the HTML to extract data for manipulation and convert the web data to another format and copy it into a local database for later retrieval or analysis.
By analyzing the web source to carry out web scraping, beside http://kalimatimarket.gov.np/ website both http://hydrology.gov.np and http://pollution.gov.np are DOM (Document Object Model) and Ajax-based websites. DOM & AJAX content is produced dynamically by the browser. The ordinary crawling system is not able to see any content that is created dynamically.
Small changes on website structure may occur halts in crawling. We have seen changes are quite often on the web source we have chosen for data archival. It is hard to say when the government website will be down because of the design change and server mal-functioning. This will directly affect the retrieving of the data. A complete shutdown is an even most challenging issue. So, data during the shutdown period will not be retrieved and archived.
There are many factors which may produce high latency during the crawling. Slower page load, slower network connection, slower machine performance, extracting and cleaning large data causes delays during the scraping.
Our goal to scrape real-time data, to do that, we need to send multiple requests on the website. Requesting too fast on the webserver puts an unnecessary load onto the servers which directly affect the server performance. To prevent those kinds of activities, the webserver is designed to block client IP if requests are made from the same IP within a short period of time.
Selenium is a Web Browser Automation Tool. Primarily, it is used for automating web applications and testing purposes. The website we tended to scrape is AJAX and DOM-based website. The website needs to execute JavaScript and produce content on the browser dynamically. To do that, we used selenium to automate the browse and run website headless.
As explained above, We have used selenium to automate the browser, therefore we need separate browser components called selenium web driver to process and perform automation headless action on the web browser. We have used Gecko web driver.
We have distributed the crawler in two different machines which reduce the crawler latency as well as also fix IP blacklist issue. Both crawler systems are synchronously requesting the data source server in 60 seconds of interval time. In the case of http://pollution.gov.np data scraped from the website is taken as it’s particular read time by the crawler, not exact to the station updated time. So, time may vary 60 to 120 seconds. But in the case of http://hydrology.gov.np and http://kalimatimarket.gov.np/ data are labeled with the exact update date and time in the website itself. The read time is different, we mapped the data with its actual date and time which is shown on the original website.