Tuesday, December 31, 2019

Predicting the next decade in the stock market


Making accurate predictions using the vast amount of data produced by the stock markets and the economy itself is difficult. In this post we will examine the performance of five different machine learning models and predict the future ten-year returns for the S&P 500 using state of the art libraries such as caret, xgboostExplainer and patchwork. We will use data from Shiller, Goyal and BLS. The training data is between the years 1948 and 1991, and the test data set is from 1991 and only until 2009, because the target variable is lagged by ten years.

Different investing strategies tend to work at different times, and you should expect the accuracy of the model you are using to move in cycles; sometimes the connection with returns is very strong, and sometimes very weak. Value investing strategies are a great example of a strategy that has not really worked for the past twelve years (source, pdf). Spurious correlations are another cause of trouble, since for example two stocks might move in tandem by just random chance. This highlights the need for some manual feature selection of intuitive features.

We will use eight different predictors; P/E, P/D, P/B, the CAPE ratio, total return CAPE, inflation, unemployment rate and the 10-year US government bond rate. All five of the valuation measures are calculated for the entire S&P 500. Let's start by inspecting the correlation clusters of the different predictors and the future ten-year return (with dividends), which is used as the target.

The different valuation measures are strongly correlated with each other as expected. All expect P/B have a very strong negative correlation with the future 10-year returns. CAPE and total return CAPE, which is a new measure that considers also reinvested dividends, are very strongly correlated with each other. Total return CAPE is also slightly less correlated with the future ten-year return than the normal CAPE.

The machine learning models

First, we will create a naïve model which predicts the future return to be same as the average return in the training set. After training the five models we will also make one ensemble model of them to see if it can reach a higher accuracy as any of the five models, which is usually the case.

The models we are going to use are quite different from each other. The glmnet model is just like the linear model, except it shrinks the coefficients according to a penalty to avoid overfitting. It therefore has a very low flexibility and also performs automated feature selection (except if the alpha hyperparameter is exactly zero as in ridge regression). K-nearest-neighbors makes its predictions by comparing the observation to similar observations. MARS on the other hand takes into account nonlinearities in the data, and also considers interactions between the features. XGBoost is a tree model, which also takes into account both nonlinearities and interactions. It however improves each tree by building it based on the residuals of the previous tree (boosting), which may lead to better accuracies. Both MARS and SVM (support vector machines) are really flexible and therefore may overfit quite easily, especially if the data size is small enough. The XGBoost model is also quite flexible but does not overfit easily since it performs regularization and pruning.

Finally, we have the ensemble model which simply gives the mean of the predictions of all the models. Ensemble models are a quite popular strategy in machine learning competitions to reach accuracies beyond the accuracy of any single model.

The models will be built using the caret wrapper, and the optimal hyperparameters are chosen using time slicing, which is a cross validation technique that is suitable for time series. We will use five timeslices to capture as many periods while having enough observations in each of them. We will do the cross validation on training data consists of 70 percent of the data, while keeping the remaining 30 percent as a test set. The results are shown below:

Results

Click to enlarge images

The predictions are less accurate after the red line, which separates the training and test sets. The model has not seen the data on the right side of the line, so its accuracy can be thought as a proxy for how well the model would perform in the future.

We will examine the model accuracies on the test set by using two measures; mean absolute error (MAE) and R-squared (R²). The results are shown in the table below:

Model MAE
Naive model 5,16 % -
Ensemble 2,15 % 48,2 %
GLMNET 3,00 % 29,7 %
KNN 3,37 % 10,6 %
MARS 10,70 % 90,2 %
SVM 10,80 % 13,1 %
XGBoost 2,17 % 60,1 %

The two most flexible models, MARS and SVM, behave wildly on the test set and show signs of overfitting. Both of them have mean absolute errors that are about twice as high when compared to the naïve model. Even though MARS has a high R-squared, the mean absolute error is high. This is why you cannot trust R-squared alone. Glmnet has quite plausible predictions until the year 2009, most likely because of the rapid growth of the P/E ratio. K-nearest-neighbors has not reacted to the data too much but still achieves a quite low MAE. Out of the single models, the XGBoost has performed the best. The ensemble model however has performed slightly better as measured by the MAE. It also seems to be the most stable model, which is expected since it combines the predictions of the other models.
Let's then look at the feature importances. They are calculated in different ways for the different model types but should still be somewhat comparable. The plotting is done using the library patchwork, which allows plotting to be done by just adding the plots together using a plus sign.

Upon closer inspection of the feature importances, we see that the MARS model uses just the CAPE ratio as a feature, while rest of the models use the features more evenly. Most of the models perform some sort of feature selection, which can also be seen from the plot.

Future predictions

Lastly, we will predict the next ten years in the stock market and compare the predictions of the different models. We will also look closer at the best performing single model, XGBoost, by inspecting the composition of the prediction. The current values of the features are mostly obtained from the sources listed in the first chapter, but also from Trading Economics and multpl.

Model 10-year CAGR prediction
Ensemble 2,20%
GLMNET 1,47 %
KNN 4,04%
MARS -9,85%
SVM 6,46%
XGBoost 8,86%

The MARS model is the most pessimistic, with a return prediction that is quite strongly negative. The model should however not be trusted too much since it uses only one variable and does not behave well on the test data. The XGBoost model is surprisingly optimistic, with a prediction of almost nine percent per year. The prediction of the ensemble model is quite low but would be three percentage points higher without the MARS model.

Let's then look at the XGBoost model more closely by using the xgboostExplainer library. The resulting plot is a waterfall chart which shows the composition of a single prediction, in this case the predicted CAGR (plus one) for the next ten years. The high CAPE ratio reduces the predicted CAGR by seven percentage points, but the P/B ratio increases it by six percentage points. This is because the model contains interactions between the CAPE and P/B ratios. The effect of the interest rate level is just a bit positive at two percentage points, but the currently high P/E ratio reduces it back to the same level. The rest of the features have a very small effect on the prediction.

The benefit of predicting the returns of a single stock market is mostly limited to the fact that you can adjust your expectations for the future. However, predicting the returns of multiple stock markets and investing in the ones with the highest return predictions is most likely a very profitable strategy. Klement (2012) has shown that the CAPE ratio alone does a quite good job at predicting the returns of different stock markets. Adding more variables that are sensible to the model is likely to make the model more stable and perhaps better at predicting the outcome.


Be sure to follow me on Twitter for updates about new blog posts like this!

The R code used in the analysis can be found here.


15 comments

  1. The predictions for the next ten years for using full data (training + test) in the models as should be done when a model is put into production are as follows:

    Ensemble 5.65%
    GLMNET 3.81%
    KNN 6.98%
    MARS 6.15%
    SVM 1.23%
    XGBoost 10.0%

    ReplyDelete
  2. Thanks for the study. Could you change the target from 10-year returns to 1-year returns and redo the analysis? People who use models to predict returns and set their asset allocation will run such models at least yearly.

    ReplyDelete
    Replies
    1. Sure that could be done, but the accuracies would be much lower. In fact, I ran it using one-year return as the target and the R-squareds ranged from 7%-16% and MAEs from 0.21-0.94, both on the test set. So, it is quite hard to predict one-year returns, at least with these simple models. Of course I don't know about the potential returns of switching the allocation to the countries with highest one-year predictions each year.

      Delete
  3. Could you provide or explain how to get the file bls_data.xlsx that is read by the script StockMarketLongTermForecast.R ? Thanks.

    ReplyDelete
    Replies
    1. The data is available here, you just have to change the "from" year to 1948 and click "go" and "download" just above the table. Also the file name has to be changed to "bls_data".

      Delete
  4. What about Market Capitalization/GDP? Isn't that a much better individual indicator?

    ReplyDelete
    Replies
    1. It's a good indicator and I could have included it. It might however be troublesome to use it between countries since some have "naturally" small stock markets as compared to GDPs, so you would have to use it relative to history for each market.

      Delete
  5. Hi, thank you for sharing this analysis with us. I have one question: how can you explain the high importance of the intercept in the xgboost model above? Is this some kind of average CAGR over the train sample?

    ReplyDelete
    Replies
    1. I had accidentally typed that the ten-year return does not include reinvested dividends, which it does. The CAGR in the training sample is on average 10.6%, which is most likely the reason the intercept is so high.

      Delete
    2. Right, I see. What is the scale of the xgboost explainer? We have 1.12 for the intercept and 1.09 for the whole prediction. Thus the prediction is more or less: intercept (average) + a bit of CAPE and a bit of PB. MARS also uses CAPE and arrives at a totally different prediction. Although importance indicates that CAPE dominates the prediction. Maybe MARS missed the intercept?

      Delete
    3. The 1.12 stands for 12% return, and 1.09 for 9% return. You're correct that the prediction comes mostly from the intercept, however because the scale is a bit weird the contribution is actually smaller than it seems from the graph, at 12 percentage points. I should have scaled the graph better so that the intercept would be at 0.12.

      MARS has a strongly negative forecast most likely since the only time we'we had CAPE ratios this high has been during the tech bubble (since the 1930s is not in the training data).

      Delete
    4. Thank you for this clarification.

      Delete
  6. This comment has been removed by the author.

    ReplyDelete
  7. good stuff

    for economic indicators I would use the 10y-3mo yield spread and either leading indicators or industrial production or maybe unemployment claims. unemployment rate is a lagging indicator.

    I would use walkforward crossvalidation, train for first 20 years and predict 1 month, walk forward month by month.

    test set results are irrelevant in this context.

    ReplyDelete
    Replies
    1. also would separately predict nominal returns and real returns , and probably pick only 1 of CAPE and TR_CAPE and 1 of P/E, P/D because they are so highly correlated

      Delete