Saturday, 12 September 2020

StockPredictor: Selecting Algorithm and Some Design Considerations

This is the third part of the stock project, where I will use the validated data to train a machine learning system to predict the performance of a stock based on the data that was available at the date of the stock record.

The first step is to make some design considerations for the algorithm. The earliest program will only look at the current stock record for features (X) and the results (y) will be the daily change of the stock.

If I select the prediction window to one week, one example of features and results would be:


The results from a query in the database will look like this:

The X data can consist of values of P, P/E, P/C, Yield, PMI, RSI and the time to/since the last dividend. The y data for the stock record of 2018-10-31 will be the change (%) divided by the days: 
The daily increase of ABB's stock price was 0,054% during one week in November, 2018.
It would have been more mathematically correct to take the seventh
root of the quota, but this time I'll prefer simplicity.

For training data to be useful, there must be a price record in the future that can be used ad y.

I'm saving the data to a huge panda dataframe (the process of populating it from the database (with some necessary processing) takes ~35 minutes. To avoid repeating this cumbersome process every time, I save the dataframe to a csv file, that can be used directly to train the machine learning algorithm. Loading that file takes less than a second.

The Data Set
I suspect that some of the data are highly correlated to each other. For example, Price, P/E and P/C are correlated in the short-term, since earnings and capital per share seldom changes. 

Sometimes, the data contains zeros or null values. 

Regression or Neural Networks

Both regression and neural networks has some pros for this dataset. 

Linear regression

A linear regression would be intuitive for predicting the change in stock price. But there are some zeros in the underlying data that I fear will skew the results. It also seems to be tricky to handle XOR relationships between features. 

Multi-Layer Neural Networks can model complex relationships such as XOR relations and zeroed records. I'll start with this one initially. The

SciKit-Learn

Sklearn has a neural-network-ish regressor that I will investigate.


No comments:

Post a Comment