Cute Trains and Wide Dreams: July 2020

Saturday, 25 July 2020

StockReader: Handling Broken Scaping

I've seen that the web scraper StockReader script sometimes halts due to various reasons. For example. it can be timeouts from the server.

The connection error is seen occasionally.

I'll handle this by running the script again and again until I get the data I need, while skipping the records that I've already collected.

The first step is to make StockReader check for existing files:

The second step is to acquire the stock names along with the stock ID's for the source web page. Using the names, I can tell whether the particular stock has a record or not. I did that using regular expressions in the script.

The third step is to open the file in read mode (if it exists), and store the contents of that file. After that, I'll close the file.

The fourth step is to open the file in append mode and iterate over all stock names, and fetch the stock data for the missing stock records.

The program works best if I run it some five times. Then it is likely that I'll catch most, if not all stock records for that day.

Issue:
For long stock names, the name is truncated in the web page that lists all stocks:
From the list of active stocks: A3 Allmänna IT- och..
From the in stock info web page: A3 Allmänna IT- och Telekomaktiebolaget
I resolved that issue by adding a second list of truncated stock names. If a stock was found in the truncated stock name list, I iterated over all stock names that already has a stock record. If one of those matched the stock name, the program didn't query that stock information.

Saturday, 11 July 2020

StockAnalyzer: Chasing errors

The work on Stock Analyzer will focus on two tracks in parallel:

Fixing existing data to fit into the database (Scope of this blog post)
Visual the data that is in the database

I see errors when the program is trying to parse values to the database. For example:

2017-06-05 - 2017-08-14: Unable to parse '2017-08-15'.

2017-08-19 - 2017-11-01: Unable to parse '2017-11-07'

2017-11-06 and later: Unable to parse '2018-02-22'

The issue is that the program tries to parse a date into an integer since the date is on the wrong position in the file.

The first step is to check which stocks it could be that has the faulty data. I can easily do that using grep in my WSL2 environment on my Windows 10 computer.

The faulty data for June contains the string "2017-08-15".
This string is found in the following stocks:
Eniro, Concordia Maritime, Clavister Holding, MindMancer and Tethys Oil

When importing the different stock records to a spread sheet to see which of the stocks are faulty.

Tethys Oil seems to have some missing data.

I did some maths on the remaining data to check if it is possible to recover the missing bits, but it wasn't. I'll simply remove the records for Tethys Oil for the missing period of time. A new grep instance gives:

This means that the stock records for the company Tethys Oil are corrupted between May 9th, 2017 and November 9th, 2017. Now, I want to remove those lines for the files ranging from 20170509.csv up to 20171109.csv.

For a given file, it is easy to remove lines containing "Tethys Oil":

sed '/Tethys Oil/d' -i file.txt

To identify the files in the desired age span, I tried to find a way to do lexicographic comparisons of the file names in the bash shell. It turned out not to be trivial, so created a Python script instead.

The output are commands that removes all lines containing Tethys oil from the files in the interval. After pasting them to bash, the erroneous lines has been removed from the files.