Saturday 29 August 2020

StockAnalyzer: Identifying Stock Splits

I've spent the last weeks on handling missing data and other inconsistencies from the raw data files that I've got from the web scraping over the years. Data that I am confident that I can restore is restored (for example earnings per share, a number that changes only every third month). For the other data, I delete it if I can't recover it.

Being able to use some Linux commands (sed, grep) on my Windows environment has saved me hours of manual work,

I am now able to get more than 250 000 stock records into the database, so I have enough to start running some simple scripts for analyzing the data.

The features X are the existing data, such as key numbers and time to next report. The results y will be the daily increase in the stock price over one week.

Identify Splits:
The first task is be to identify stock splits. If I don't take splits into account, the y values will be skewed and mess up my machine learning (If the stock price falls because of a split, I may train the system for a price drop that never happened).

I'll compare two cases for a stock to illustrate the difference between a split and an actual price drop for a stock:

Both Stock A and Stock B are originally valued to $100 per stock. The earnings per stock is $10 and the capital per stock is $40. Thus, the P/E is 10 and the P/C is 2,5.

After a stock split, the new stocks A' are valued to $50 per stock. The earnings per stock is now only $5,  and the capital per stock is $20. The P/E is still 10 and the P/C is still 2,5.

The stock B has a price drop to $50. The earnings per stock is still $10 and the capital per stock is $40. The P/E is now 5 and the P/C is now 1,25.
The Troax stock had a 3:1 split (one stock became three) on 2019-06-18.



After a stock split, I expect the P/E and P/C to be roughly the same. So if there is a significant change in stock price without corresponding changes in the P/E and P/C values, but with changes in the E and C values, I have likely found a split.

The Code:
The script starts with iterating over all unique stock names in the database. 

For each stock, I query all stock records and I check whether the price has changed more than a predefined threshold. 

If it has changed, I check whether the capital and earnings per share has changed. If it has, I'll record the date and stock name to a list for later use (it's hard to estmate the split ratio, so I don't save that for now). The splits are saved to a comma-separated file. 
Sometimes, it can be hard to make an accurate estimate of the split ratio. 

To verify that the splits are OK, I queried Skatteverket (The Swedish Tax Authorities) manually for several splits. As expected, the reported splits were correct.

Now, the pre-processing of the data is done. It took much more time than I assumed and that is mainly because that I didn't check the data when scraping.