Saturday 25 July 2020

StockReader: Handling Broken Scaping

I've seen that the web scraper StockReader script sometimes halts due to various reasons. For example. it can be timeouts from the server.
The connection error is seen occasionally.

I'll handle this by running the script again and again until I get the data I need, while skipping the records that I've already collected.

The first step is to make StockReader check for existing files:

The second step is to acquire the stock names along with the stock ID's for the source web page. Using the names, I can tell whether the particular stock has a record or not. I did that using regular expressions in the script.



The third step is to open the file in read mode (if it exists), and store the contents of that file. After that, I'll close the file.

The fourth step is to open the file in append mode and iterate over all stock names, and fetch the stock data for the missing stock records.

The program works best if I run it some five times. Then it is likely that I'll catch most, if not all stock records for that day.

Issue:
For long stock names, the name is truncated in the web page that lists all stocks:
From the list of active stocks: A3 Allmänna IT- och..
From the in stock info web page: A3 Allmänna IT- och Telekomaktiebolaget
I resolved that issue by adding a second list of truncated stock names. If a stock was found in the truncated stock name list, I iterated over all stock names that already has a stock record. If one of those matched the stock name, the program didn't query that stock information.

No comments:

Post a Comment