Using Alternative Data for Systematic Investing
System2 participated in (and won) the Eagle Alpha Next Level Hackathon, blending quantitative data from six different alternative (and traditional) data vendors to evaluate potential gains from using alternative data for systematic investing. Quant strategies are a departure from our regular work of answering questions about company fundamentals, but part of doing data science on demand is being flexible, so here’s what we did.
Data vendors ranged from the new-to-market (NewMark Risk) to the very established (S&P Global). Each vendor provided is with a range of daily data at the ticker level. Categories included fund flows (EPFR, S&P Global), corporate bond spreads and bond implied asset pricing (S&P Global), options implied pricing and indicators (NewMark Risk), and ML-derived news, event, and report analysis (Brain, eventVestor, Helios). From these datasets we worked with the vendors to narrow the selection down to around 50 series with implications for excess returns at a one to two-week horizon.
System2's Team PandaGon opted to focus on excess returns as we were tasked with ticker-specific analysis; returns against the S&P 500 strip out broader market trends that were not of interest for the data we worked with. Similarly, our one-to-two-week investment horizon was motivated by the fact that we had daily data. Intra-day strategies would not be feasible, and a reasonable investment horizon means we’re less worried about alpha decay due to timing or trading costs due to high turnover.
The Model
The question was then how to derive quantitative insights from the data. One hears about machine learning ad nauseam, but that’s a generic term. Neural networks, the typical workhorse of ML models, require large amounts of training data and are prone to overfitting. For that reason, System2's team opted for Bayesian additive regression trees. These models are conceptually simple yet powerful, and more robust to overfitting (due to limited training data and many right-hand side variables), an important consideration with only 1-2k training data points.
The Results
In true hackathon style, System2's team had two days to bring the data together from AWS s3, sftp, and thousands of csv/txt files into a convincing backtested model. Our toolbox was whatever got the job done quickly: python, R, SQL, and command line tools were all part of the solution. We trained the model on two subsets of the data using observations through 2021. The first subset was restricted to series with a history back to 2010 (a longer training set and fewer observations are less prone to overfitting); the second subset used all available data. Unsurprisingly, using all series meant directional results tended to be better, but overfitting lead to higher forecast volatility. Due to reasonable vendor restrictions, we narrowed our selection to 50 (randomly chosen) US tickers, 30 of which had good overlap between vendors. For each of these 30 tickers, we forecasted excess returns at one- and two-week horizons. Given the nature of the data (and the project itself) our one-week horizon forecasts performed best out of the sample, so we put together a simple long-only portfolio: invest evenly in any of the 30 candidate stocks with a positive expected (forecasted) return.
The results were interesting: a portfolio that not only outperforms the S&P 500, but is in fact negatively correlated with the broader market (this latter point is probably an artifact of the randomly selected tickers, but merits further investigation). Due to the fact that System2's team is only holding around 15 names at any one time, our portfolio is concentrated (and not selectively as the 30 candidate stocks were random) and thus has high volatility, but with a Sharpe ratio of 1.5 and negative correlation with the market, it makes a powerful case for alternative data.