Being able to use some Linux commands (sed, grep) on my Windows environment has saved me hours of manual work,
I am now able to get more than 250 000 stock records into the database, so I have enough to start running some simple scripts for analyzing the data.
The features X are the existing data, such as key numbers and time to next report. The results y will be the daily increase in the stock price over one week.
Identify Splits:
The first task is be to identify stock splits. If I don't take splits into account, the y values will be skewed and mess up my machine learning (If the stock price falls because of a split, I may train the system for a price drop that never happened).
I'll compare two cases for a stock to illustrate the difference between a split and an actual price drop for a stock:
Both Stock A and Stock B are originally valued to $100 per stock. The earnings per stock is $10 and the capital per stock is $40. Thus, the P/E is 10 and the P/C is 2,5.
After a stock split, the new stocks A' are valued to $50 per stock. The earnings per stock is now only $5, and the capital per stock is $20. The P/E is still 10 and the P/C is still 2,5.
The stock B has a price drop to $50. The earnings per stock is still $10 and the capital per stock is $40. The P/E is now 5 and the P/C is now 1,25.
The Troax stock had a 3:1 split (one stock became three) on 2019-06-18. |
After a stock split, I expect the P/E and P/C to be roughly the same. So if there is a significant change in stock price without corresponding changes in the P/E and P/C values, but with changes in the E and C values, I have likely found a split.
The Code:
The script starts with iterating over all unique stock names in the database.
For each stock, I query all stock records and I check whether the price has changed more than a predefined threshold.
If it has changed, I check whether the capital and earnings per share has changed. If it has, I'll record the date and stock name to a list for later use (it's hard to estmate the split ratio, so I don't save that for now). The splits are saved to a comma-separated file.
Sometimes, it can be hard to make an accurate estimate of the split ratio. |
To verify that the splits are OK, I queried Skatteverket (The Swedish Tax Authorities) manually for several splits. As expected, the reported splits were correct.
Now, the pre-processing of the data is done. It took much more time than I assumed and that is mainly because that I didn't check the data when scraping.