This post will introduce one way of forecasting the stock index returns on the US market. Typically, single measures such as CAPE have been used to do this, but they lack accuracy compared to using many variables and can also have different relationships with returns on different markets. Furthermore, it is possible to train different types of models and combine them to increase the accuracy even more, as is done in this post.

We'll use a variety of time series models, with a goal of forecasting future returns for the S&P 500. The variable to be forecasted is the annual future ten-year return, and all of the used models except for ETS are dynamic, i.e. they also use some regressors such as valuation multiples, which are mostly the same ones as in this post. We'll also use the same data sources as in the mentioned post, which I highly recommend before reading this one.

The way the models will be constructed is that the sample, which consists of observations between the years 1948 and 2010, will be split using a 65/35 split into training and test sets. These values were chosen to ensure that the test set has both low and high values of the future ten-year returns so that the model can be properly assessed. Time series cross-validation could have been used to get more accurate accuracy metrics, but for our purpose a simple train/test split is good enough.

Due to the value to be forecasted representing the

We'll use five different models plus a combination model, which is the average of these models. In addition, we'll also include an ETS model that is not included in the combination model, just to see how using the return of the last ten years as a reference point, which is something that people do, can be dangerous. It is also included to show that there are no clear patterns in the data, such as seasonalities or trends that can be used to make the forecasts. Some of the models are quite complicated so we won't go into too much details in this post, but they all operate on the same idea that information about the future values of the variable to be forecasted exist in the regressors, but also in the past values of the said variable.

We'll use a variety of time series models, with a goal of forecasting future returns for the S&P 500. The variable to be forecasted is the annual future ten-year return, and all of the used models except for ETS are dynamic, i.e. they also use some regressors such as valuation multiples, which are mostly the same ones as in this post. We'll also use the same data sources as in the mentioned post, which I highly recommend before reading this one.

The way the models will be constructed is that the sample, which consists of observations between the years 1948 and 2010, will be split using a 65/35 split into training and test sets. These values were chosen to ensure that the test set has both low and high values of the future ten-year returns so that the model can be properly assessed. Time series cross-validation could have been used to get more accurate accuracy metrics, but for our purpose a simple train/test split is good enough.

Due to the value to be forecasted representing the

*future*ten-year returns, we have to further split the test set, separating the first ten years from the rest. This is done to avoid data leakage, or more accurately, the look-ahead bias. This is because we cannot forecast future ten-year returns using last month's future ten-year returns due to not knowing them at the moment of the forecasting, but we can forecast them using the ten-year future returns from ten years ago, which correspond to the returns of the past ten years. In the plots, the gray vertical line is used to separate the test set from the training set, and the red vertical line is used to separate the look-ahead bias-free test set, which is also the set the accuracy measures are calculated with.We'll use five different models plus a combination model, which is the average of these models. In addition, we'll also include an ETS model that is not included in the combination model, just to see how using the return of the last ten years as a reference point, which is something that people do, can be dangerous. It is also included to show that there are no clear patterns in the data, such as seasonalities or trends that can be used to make the forecasts. Some of the models are quite complicated so we won't go into too much details in this post, but they all operate on the same idea that information about the future values of the variable to be forecasted exist in the regressors, but also in the past values of the said variable.

Let's fit the models and take a look at the results:

*Click to enlarge images*

The Prophet model by Facebook, the neural network-based nnetar and the TSLM which takes into account just the regressors and the seasonality and trend, seem to be the most inaccurate models if we exclude the ETS model. The nnetar model has the thinnest prediction intervals that are however quite inaccurate, perhaps due to the fact that neural network models usually require much more data. Contrastingly the Vector Autoregression and ARIMA models both seem to be quite accurate but have realistic prediction intervals. Of these two, the Vector Autoregression model seems to be more confident about its predictions, most likely since the model is more complicated than ARIMA. Notice that the combination forecast has not been plotted along the other models just yet.

The ETS model shows that forecasting based on the past ten-year return would have been horribly wrong. The forecast was made when the ten-year return was at nearly an all-time high, yet the first actual ten-year return was very close to its all-time low. Using the average ten-year return instead of the latest one would have resulted in more accurate forecasts due to mean reversion in the ten-year return.

Let's then take a look at the accuracy measures of all the models on the non-biased test set, sorted from the most to the least accurate based on Mean Absolute Error:

Model | ME | RMSE | MAE | MPE | MAPE | MASE | R-squared |
---|---|---|---|---|---|---|---|

Combination | -0.0033158 | 0.0183 | 0.0129 | -0.3281969 | 1.24 | 0.679 | 0.896 |

ARIMA | 0.0114286 | 0.0363 | 0.0176 | 0.9895734 | 1.6 | 0.926 | 0.739 |

VAR | 0.0009312 | 0.0282 | 0.0234 | 0.0740454 | 2.23 | 1.23 | 0.768 |

nnetar | 0.0241653 | 0.0417 | 0.0389 | 2.1628762 | 3.64 | 2.05 | 0.744 |

Prophet | -0.0380064 | 0.0474 | 0.0417 | -3.5044745 | 3.87 | 2.19 | 0.661 |

TSLM | -0.0150980 | 0.0609 | 0.0443 | -1.3630049 | 4.17 | 2.33 | 0.458 |

ETS | -0.1346834 | 0.142 | 0.135 | -12.9110549 | 12.9 | 7.09 | NA |

The combination forecast was the most accurate, which is not surprising considering that some of the models were biased upwards and some downwards from the actual values. The magnitude and direction of the bias can be seen from the Mean Error. The MAE directly tells us how many percentage points the forecast was off on average. So if the actual ten-year return would have been ten percent, the forecast would have on average been just 1.29 percentage points off, which is almost twice as good as the accuracies reached by a machine learning ensemble model that was introduced in the previous post. The combination forecast also has less outliers as shown by the considerably lower RMSE compared to other models. It also has a much higher correlation with the actual values as shown by the R-squared.

It would be possible to further increase the accuracy of the combination model by combining it with pure machine learning models such as XGBoost, which were tested in the post mentioned before. Another way would be not including all the models but rather some combination of them, such as the ones with the highest accuracies and opposite biases. This would however require further splitting the data and is beyond the scope of this post.

Lastly, we'll forecast the future returns using all the models, but we'll plot just the results of the combination model. The models will now be trained with the full data (training + test sets) so that we can get as accurate results as possible. Since the data is not complete for all of the predictors for the past few months, we'll also have to fill the latest values in using data from multpl and FRED, and impute the missing values in between using spline imputation. The results are as follows:

The zoomed part includes the bias-free test set trained on just the training set and the future forecasts in blue. During the past few years, the expected return according to the model has increased from than five percent to over eight percent. The part we are especially interested about is the latest forecast, which is the forecast for the next ten years starting from this moment. These forecasts for the different models are shown in the table below, sorted from lowest to highest:

Model | 10-year CAGR forecast |
---|---|

VAR | 2.41% |

Prophet | 7.08% |

nnetar | 7.68% |

Combination | 8.35% |

TSLM | 10.75% |

ETS | 13.50% |

ARIMA | 13.82% |

The Vector Autoregression model makes the lowest forecast, while the ARIMA model has the highest forecast. This is especially interesting since these were the two most accurate models based on the test set, yet they make such different forecasts. The combination forecast is about three percentage points below the historical return of the S&P 500, which is not that bad all things considered.

It is important to note that the future forecasts will be most likely less accurate due to some of the predictors such as unemployment rate and interest rates falling outside their historical ranges. This uncertainty is also visible on the individual model prediction intervals which are not plotted here. However, compared to pure machine learning models which suffer of the same problem, the time series models are likely more accurate and more robust to these types of structural changes.

If you liked this post, be sure to follow me on Twitter for updates about new blog posts like this!

The R code used in the analysis can be found here, together with the code for the machine learning models from the previous post.

*The predictors PE and TR_CAPE have been excluded from all the other models except ARIMA since it seemed to react to the multicollinearity caused by them better than the other models. All the other predictors as in the last post were used.*

*Since the nnetar and Prophet make different distributional forecasts than rest of the models, we cannot compute the prediction intervals as easily for the combination model, so they have been left out of the plot with the combination forecast.*