How to Win Friends and Influence Economists: Thoughts on Nowcasting with Random Forests
Welcome to another edition of “In the Minds of Our Analysts.”
At System2, we foster a culture of encouraging our team to express their thoughts, investigate, pen down, and share their perspectives on various topics. This series provides a space for our analysts to showcase their insights.
All opinions expressed by System2 employees and their guests are solely their own and do not reflect the opinions of System2. This post is for informational purposes only and should not be relied upon as a basis for investment decisions. Clients of System2 may maintain positions in the securities discussed in this post.
Today’s post was written by Seth Leonard.
The team at System2 has a range of skills from computer science to accounting to, believe it or not, weightlifting. My own niche happens to be economics, and though I’ve worked in the private sector since the completion of my PhD, I try to stay loosely in touch with the academic and policy side of the profession, which is exactly what I’ve been doing for the past two weeks. I was fortunate enough to be invited to the 2024 ECONDAT conferences hosted by King’s College and the Bank of England in London. Then, in Vienna, I presented at the 6th Workshop on High-Dimensional Time Series in Macroeconomics and Finance. And there would have been a third conference to report on this week in Washington, DC, had Amtrak been running.
The King's College event in London saw a host of interesting papers, with ML and LLMs featuring heavily, including work on the relationship between monetary policy surprises and markets, using LLMs to model economic agents, and the risks of trusting too much in the new AI gods. Vienna also saw its share of compelling topics, including some very forward work using neural networks in macroeconomics and of course, my own piece (with Philipp Wegmuller of SECO) on nowcasting with random forests (including a few tweaks to the standard random forest model). The presentation made four main points:
For models with smaller data (i.e., not LLMs trained on the whole of the web, but the standard monthly or quarterly time series stuff), data quality (including processing) still matters. The latest ML isn’t alchemy; you still need good inputs.
When dealing with mixed frequencies and incomplete data, period-to-date aggregation is simple, allows us to use any of the standard uniform frequency models, and works pretty well.
When using random forests, we can accommodate missing data by making observations used by each terminal node or leaf, conditional on available data.
Allowing for a linear relationship between splitting variables and left-hand side (LHS) variables seems like a good idea, but the results are underwhelming.
Below I’ve added a little more color on each of these topics.
Data Quality
“Garbage in, garbage out” is a popular phrase in statistics and econometrics for good reason. Even the cleverest ML model won’t be able to generate meaningful insights from randomness, though it might do some overfitting in-sample and trick us into thinking it’s on to something. Data quality goes beyond the obvious issue of having meaningful inputs. How we process the data is crucial. This ties back to the idea of stationarity in time series econometrics.
Skipping formal definitions, what we mean by stationarity is that one subset of the data will be representative of another subset. This is important because all models are trained on historical observations. However, if historical observations do not indicate how variables behave in the future, the model is useless. As an example, only one of the below processes is stationary. In the following, a model fit on the sample A could produce reasonable forecasts of the data in B. The same is not true of any other series. Before fitting to a model, we’d have to alter the data to make it stationary. Finding the right adjustment is central to building an effective forecast.
Mixed Frequency
There are many methods for handling mixed frequency. One of the papers from Oxford mathematician Sam Cohen presented a simple (conceptionally, at least) approach using signature methods. Our approach (which is common in the alternative data space) is far less clever but simple and flexible. We use period-to-date aggregations. For example, suppose we want to predict inflation in May and we have oil spot prices through the 20th. We’d simply take the average price over the first 20 days of the month and do the same in the historical data (i.e., April 1 through 20). An alternative would be to use complete data for previous months. However, in the latter case, your forecast accuracy will be misleading; you’ll be using more information in historical periods than you have in the forecast period. It might not sound that sophisticated (it’s not), but it is effective. However, be sure to do any transformation (differences, log differences/percent change, seasonal adjustment) after aggregation, not before!
Missing Observations
In the standard, python and R random forest packages don’t accept missing observations. One can impute missing values, but typically this isn’t a great idea. We’re again telling the model we have more information than we, in fact, do. Even if point estimates aren’t that misleading, forecast accuracy will be. Our solution is twofold. First, we re-estimate the model conditional on the data that are available to construct our forecast. This rules out computationally heavy models (since we re-estimate each period), but with a few hundred series, random forests aren’t too burdensome. This means our data are “square” at the end or tail. That is, there are no missing observations in the latest period because we simply drop those series.
However, different series start at different times, which means we may have missing observations at the beginning of the data. Our solution is to make estimates at each node conditional on the data available to reach that node. Recall that a random forest estimates our LHS (target) variable by splitting on features (right-hand side variables). Our final estimate averages all the LHS observations that correspond to (get us to a node following our model’s splitting rule) features in the forecast period.
In the above illustration, green nodes use every row of data. However, if our model splits on the Ascential Price Index, we only use a subset of rows. Thus, our average LHS value at the yellow node will only use rows in which these data are observed. And if we use Linkup's new job postings, we have even fewer rows over which to take average LHS values. This flexible approach allows us to incorporate missing data without imputing anything.
Regression Leaves
Our second tweak to the standard random forest model was to allow for a linear relationship between splitting variables and the LHS variable. As mentioned above, our random forest estimate is just the mean of LHS observations at a given node. An implication is that the model can’t make a prediction outside of what it’s already seen (because we’re just averaging past observations). Our thought was that allowing for a simple OLS regression at each node might allow the model to extrapolate a bit more. Here’s what it looks like:
As illustrated, at each split we regress the LHS variable on the splitting variable. Since our simple linear model is just y = a + b x, the standard model is just the special case of b=0. Unfortunately for our paper, while not worse in any way, we didn’t get much mileage out of generalizing nodes to linear models when it comes to predicting inflation.
In Summary
What we do at System2 is different from academic publishing in many ways. The biggest is probably turnaround. One of the Vienna presenters commented, “We’ve only been working on this project for four months, so it’s brand new." But System2’s typical turnaround is within the week, at least for an initial analysis. It’s constructive to keep an eye on what’s happening in the academic world and, hopefully, to share some insights from our own work.