Saturday 27 July 2019

StockAnalyzer: Lessons Learned from Web Scraping

After spending too many hours fixing flawed data, I've learnt some lessons:

Use English when describing the data
Using another language is a very common mistake for non-English natives, especially if one believes that the data will have a very limited audience. Applications has a tendency to grow and involve more people.

Use standardized formats for input data
I've seen some strange data files for other applications, and it is a nightmare to decode them. Unless there are strong reasons to save storage space, I recommend using plain text files such as:

  • csv - Compact for data. However, missing separators will shift all data to the left. This can be difficult to recover
  • json - Quite compact for data, but slightly harder to implement. This way, it is clear what the different values represent.
  • xml - This will generate some overhead and may be difficult to implement, unless there is some handy xml class that can handle the xml coding.
  • Database - Feeding the scraped data directly into a database. This is a bit complex but has some advantages: It will indicate if there are errors, the data labeling is clear and it will be compact. One disadvantage is that it takes some time to set up the database.

Handle bad or missing data directly
If there are flaws in the web scraper, or the web server, the data will be flawed. This will be a big problem if the web scraping continues - big chunks of data may be missing or erroneous.

Test the data and the format
This will help finding flawed data early. Tests may check the number of data points, the type of data (integer/float/names) and how the data is changing over time.

Cross-verify numbers
It can be useful to scrape some extra data for cross-checks. For example, scraping Price, Earnings and Price/Earnings will make it easy to check that those values are valid.

Handle changes in the data
This is quite obvious. The web services that you want to scrape will change its interface once in a while and that will break your web scraper. As mentioned above, you need to see that immediately, otherwise you'll have tonnes of data that needs to be fixed.

Automate the web scraping
After all, the purpose of computers is to liberate people from manual labor. This applies to web scraping too. If it isn't started automatically, it is easy to forget it.

It is easy to set up the web scraping script using cron (Linux) or the Task Scheduler. I strongly recommend that you check that the tasks are really started as they should, at least the Task Scheduler isn't totally reliable.

Make the data database-friendly
It is very likely that you will send the data to a database sooner or later. Considering how to do that early will save you a lot of effort.

Side note:
One very common issue is missing yield values. That can span over several months of data. I've created a python script to resolve that issue by adding an empty piece of data for the missing piece of information.

Saturday 20 July 2019

StockAnalyzer: Fixing Flawed Data (continued)

My work on the program StockAnalyser is currently in a iterative cycle:
  • The program adds records to the database until it crashes because of some flaw in the input data
  • Fix that issue 
  • Repeat
This means that most of the blog posts for now focus on handling new cases of flawed input data. Please bare with me.

Each record is a line of comma separated values. For some records, one value is missing. By looking at the remaining data, I'm able to see what data is missing and replace that manually.

By adding a semicolon before the profit margin (in this case), the program will be able to populate the missing data with a null value.
Fixing hundreds, or thousands of records isn't possible to do by hand, and I'll need to create a script that can re-populate the missing data. I'll develop that script in the next blog post.

Saturday 13 July 2019

StockAnalyser: Recovering Missing Data

In this blog post, I'll add code that protects the program when parsing strings to float values and I'll also add code that tries to recover missing information.

First, I'll add try-catch clauses to handle parsing errors from string to floats.

After that, I'll add functions to verify and populate null entries of the key numbers.

For the triplet of P, P/E and E. there are eight possible scenarios:
P = null, P/E = null and E = null - Impossible to recover
P = null, P/E = null and E != null -Impossible to recover
P = null, P/E != null and E = null - Impossible to recover
P = null, P/E != null and E != null - Recover P

P != null, P/E = null and E = null - Impossible to recover
P != null, P/E = null and E = !null - Recover P/E
P != null, P/E != null and E = null - Recover E
P != null, P/E != null and E != null - No need to recover

I'll start to count how many null values I have. If there are two or more null values, I can't recover any data. If there is no null value at all, I don't need to recover any data.

I will add two boolean flags later.
One will indicate if any data in the stock record has been recovered and
 another flag will indicate if the data is so bad that it shouldn't be sent into the database.
The screen shot below illustrates how the earning value is recovered (example from 2018-06-13, AAK, a Swedish company that is producing vegetable oil and fat.)


Similar functions are added for the Price, Price per Capital (P/JEK) and the Capital per share.

Now, only profit margin is missing. Since that is calculated as the profit (P) divided by the total revenue (not recorded), I can't recover that value. I'll simply parse "Null" to the database, if that is missing.



After fixing this, I was able to scan stocks until 2019-02-15. For that date, most of the information is missing.

The fix here is to add a flag that indicates whether the stock record contains sufficient information to be added to the database. 

Side note: The board of AAK decided to perform a 6:1 share split, where one old share generated six new shares. This explains why some data was missing for that date. The next project will detect and handle share splits.



Saturday 6 July 2019

StockAnalyzer: Validating the Raw Data - Broken input data and Missing Report date

My Stock project is progressing and I'm currently adding fixes for flaws in the input data that I want to add to a SQL database. In my previous blog post, I added a fix for records that has no information about the stock price.

In this blog post, I address some more issues:
Input Data Replaced by HTML Tag
The program tries to parse ass="BrodTextWhite" width="12">? to a float number. I don't know why this string has replaced the data.

I will handle this by simply removing the bad record. I haven't seen that for other stocks and there is no information in the string that can recover the missing data.

Invalid Data for Dividend or Report Dates
Another issue that I saw was that some records had an invalid date for the report. I solved that by checking that the string can be parsed to a date. If not, I'll feed the NULL value to the database.

Now, the database has a NULL value when the date is missing.


Empty Profit Margin
Now, I was able to scan three stock records for almost five years (2013-07-28 to 2018-06-13). For AAK, the program is trying to parse an empty string to a value.

There are some empty values: Earnings per share, Capital per share, Dividend per share and Profit Margin. I can recover everything except for the Profit Margin, that requires information about sales in relation to the production costs. Since this information is important, I will add the stock information but set the profit margin to NULL.

Populating Missing Data?
I will create functions for filling in some data that is missing, if corresponding data can help calculating that data.

For the AAK record of 2018-06-13, the Earnings per share is missing.

I will handle this in two steps (next blog post):
First, I'll add try-catch clauses to handle parsing errors from string to floats.
After that, I'll add functions to verify and populate null entries of the key numbers.