Saturday 27 January 2018

StockReader: Respecting Robots.txt and Initial Scraping

Robots has some rules to follow, or break if they are nasty. Those rules are defined in the robots.txt file on the particular web site.

I'll start by creating a stub python script that will access a web page and print that to a console.

First, I need to install a module: requests. To do that, I need to install a Python package manager, PIP:
In https://bootstrap.pypa.io/, download and run getpip.py.

To install a package,
Navigate to the folder where pip.exe is located.
Run pip install requests

Now, requests can be imported and used

In my case, I'll use DI.se. The corresponding robots.txt file will be analyzed. If that page allows, I'll download the stock list and analyse the stocks in the list

No comments:

Post a Comment