Predicting the next decade in the stock market ~ Data based investing

Making accurate predictions using the vast amount of data produced by the stock markets and the economy itself is difficult. In this post we will examine the performance of five different machine learning models and predict the future ten-year returns for the S&P 500 using state of the art libraries such as caret, xgboostExplainer and patchwork. We will use data from Shiller, Goyal and BLS. The training data is between the years 1948 and 1991, and the test data set is from 1991 and only until 2009, because the target variable is lagged by ten years.

Different investing strategies tend to work at different times, and you should expect the accuracy of the model you are using to move in cycles; sometimes the connection with returns is very strong, and sometimes very weak. Value investing strategies are a great example of a strategy that has not really worked for the past twelve years (source, pdf). Spurious correlations are another cause of trouble, since for example two stocks might move in tandem by just random chance. This highlights the need for some manual feature selection of intuitive features.

We will use eight different predictors; P/E, P/D, P/B, the CAPE ratio, total return CAPE, inflation, unemployment rate and the 10-year US government bond rate. All five of the valuation measures are calculated for the entire S&P 500. Let's start by inspecting the correlation clusters of the different predictors and the future ten-year return (with dividends), which is used as the target.

The different valuation measures are strongly correlated with each other as expected. All expect P/B have a very strong negative correlation with the future 10-year returns. CAPE and total return CAPE, which is a new measure that considers also reinvested dividends, are very strongly correlated with each other. Total return CAPE is also slightly less correlated with the future ten-year return than the normal CAPE.

The machine learning models

First, we will create a naïve model which predicts the future return to be same as the average return in the training set. After training the five models we will also make one ensemble model of them to see if it can reach a higher accuracy as any of the five models, which is usually the case.

The models we are going to use are quite different from each other. The glmnet model is just like the linear model, except it shrinks the coefficients according to a penalty to avoid overfitting. It therefore has a very low flexibility and also performs automated feature selection (except if the alpha hyperparameter is exactly zero as in ridge regression). K-nearest-neighbors makes its predictions by comparing the observation to similar observations. MARS on the other hand takes into account nonlinearities in the data, and also considers interactions between the features. XGBoost is a tree model, which also takes into account both nonlinearities and interactions. It however improves each tree by building it based on the residuals of the previous tree (boosting), which may lead to better accuracies. Both MARS and SVM (support vector machines) are really flexible and therefore may overfit quite easily, especially if the data size is small enough. The XGBoost model is also quite flexible but does not overfit easily since it performs regularization and pruning.

Finally, we have the ensemble model which simply gives the mean of the predictions of all the models. Ensemble models are a quite popular strategy in machine learning competitions to reach accuracies beyond the accuracy of any single model.

The models will be built using the caret wrapper, and the optimal hyperparameters are chosen using time slicing, which is a cross validation technique that is suitable for time series. We will use five timeslices to capture as many periods while having enough observations in each of them. We will do the cross validation on training data consists of 70 percent of the data, while keeping the remaining 30 percent as a test set. The results are shown below:

Results

Click to enlarge images

The predictions are less accurate after the red line, which separates the training and test sets. The model has not seen the data on the right side of the line, so its accuracy can be thought as a proxy for how well the model would perform in the future.

We will examine the model accuracies on the test set by using two measures; mean absolute error (MAE) and R-squared (R²). The results are shown in the table below:

Model	MAE	R²
Naive model	5,16 %	-
Ensemble	2,15 %	48,2 %
GLMNET	3,00 %	29,7 %
KNN	3,37 %	10,6 %
MARS	10,70 %	90,2 %
SVM	10,80 %	13,1 %
XGBoost	2,17 %	60,1 %

The two most flexible models, MARS and SVM, behave wildly on the test set and show signs of overfitting. Both of them have mean absolute errors that are about twice as high when compared to the naïve model. Even though MARS has a high R-squared, the mean absolute error is high. This is why you cannot trust R-squared alone. Glmnet has quite plausible predictions until the year 2009, most likely because of the rapid growth of the P/E ratio. K-nearest-neighbors has not reacted to the data too much but still achieves a quite low MAE. Out of the single models, the XGBoost has performed the best. The ensemble model however has performed slightly better as measured by the MAE. It also seems to be the most stable model, which is expected since it combines the predictions of the other models. Let's then look at the feature importances. They are calculated in different ways for the different model types but should still be somewhat comparable. The plotting is done using the library patchwork, which allows plotting to be done by just adding the plots together using a plus sign.

Upon closer inspection of the feature importances, we see that the MARS model uses just the CAPE ratio as a feature, while rest of the models use the features more evenly. Most of the models perform some sort of feature selection, which can also be seen from the plot.

Future predictions

Lastly, we will predict the next ten years in the stock market and compare the predictions of the different models. We will also look closer at the best performing single model, XGBoost, by inspecting the composition of the prediction. The current values of the features are mostly obtained from the sources listed in the first chapter, but also from Trading Economics and multpl.

Model	10-year CAGR prediction
Ensemble	2,20%
GLMNET	1,47 %
KNN	4,04%
MARS	-9,85%
SVM	6,46%
XGBoost	8,86%

The MARS model is the most pessimistic, with a return prediction that is quite strongly negative. The model should however not be trusted too much since it uses only one variable and does not behave well on the test data. The XGBoost model is surprisingly optimistic, with a prediction of almost nine percent per year. The prediction of the ensemble model is quite low but would be three percentage points higher without the MARS model.

Let's then look at the XGBoost model more closely by using the xgboostExplainer library. The resulting plot is a waterfall chart which shows the composition of a single prediction, in this case the predicted CAGR (plus one) for the next ten years. The high CAPE ratio reduces the predicted CAGR by seven percentage points, but the P/B ratio increases it by six percentage points. This is because the model contains interactions between the CAPE and P/B ratios. The effect of the interest rate level is just a bit positive at two percentage points, but the currently high P/E ratio reduces it back to the same level. The rest of the features have a very small effect on the prediction.

The benefit of predicting the returns of a single stock market is mostly limited to the fact that you can adjust your expectations for the future. However, predicting the returns of multiple stock markets and investing in the ones with the highest return predictions is most likely a very profitable strategy. Klement (2012) has shown that the CAPE ratio alone does a quite good job at predicting the returns of different stock markets. Adding more variables that are sensible to the model is likely to make the model more stable and perhaps better at predicting the outcome.

Be sure to follow me on Twitter for updates about new blog posts like this!

The R code used in the analysis can be found here.

15 comments

Data-based investorDecember 31, 2019 at 1:44 PM
The predictions for the next ten years for using full data (training + test) in the models as should be done when a model is put into production are as follows:

Ensemble 5.65%
GLMNET 3.81%
KNN 6.98%
MARS 6.15%
SVM 1.23%
XGBoost 10.0%
QuantDecember 31, 2019 at 6:27 PM
Thanks for the study. Could you change the target from 10-year returns to 1-year returns and redo the analysis? People who use models to predict returns and set their asset allocation will run such models at least yearly.
QuantDecember 31, 2019 at 6:58 PM
Could you provide or explain how to get the file bls_data.xlsx that is read by the script StockMarketLongTermForecast.R ? Thanks.
UnknownJanuary 1, 2020 at 4:32 AM
What about Market Capitalization/GDP? Isn't that a much better individual indicator?
Richard WJanuary 2, 2020 at 2:30 PM
Hi, thank you for sharing this analysis with us. I have one question: how can you explain the high importance of the intercept in the xgboost model above? Is this some kind of average CAGR over the train sample?
Richard WJanuary 2, 2020 at 6:02 PM
This comment has been removed by the author.
CurmudgeonlyTrollJanuary 7, 2020 at 7:41 PM
good stuff

for economic indicators I would use the 10y-3mo yield spread and either leading indicators or industrial production or maybe unemployment claims. unemployment rate is a lagging indicator.

I would use walkforward crossvalidation, train for first 20 years and predict 1 month, walk forward month by month.

test set results are irrelevant in this context.

Data based investing

Tuesday, December 31, 2019