Fitting to Noise or Nothing At All
Machine Learning in Markets
Academic finance literature naively applying machine learning (ML) and artificial neural network (ANN) techniques to market price prediction is a dumb farce. While this probably won’t surprise anyone who has done a paper replication in the past 6+ years, despite all of the advancements in algorithms and hardware, and despite all of the new domains ANN’s have conquered, financial academics still insist on throwing feces at the wall. In fact, their simian proclivities might be getting worse.
A typical situation goes something like this:
“We noticed [some machine learning technique] has had success in [something unrelated to finance]. So we take [a small/arbitrary set of securities] over [a small/arbitrary window of time] and apply a [random, obnoxiously large, or empirically unjustified feature space] to said technique to predict price movements of said securities. We show that under [our ad hoc and unreasonable assumptions] said technique can sometimes predict price movements. Publish me please.” 1
Those familiar with the replication crisis and The Garden of Forking Paths should immediately spot the numerous potential “researcher degrees of freedom” that inevitably prove these results not robust. Indeed, all it takes in order to break most of these papers is adding a few similarly behaved securities or applying the methodology just a few months before or after the paper’s sample period. But these type of failures have been covered in literature, so for that see Noah’s review of the spurious and the fleeting.
Instead, for this post I’d like to focus on one example where I’ll fire off all the things that make papers like this hopelessly fucked before they even begin. 2
In “Classification-based Financial Markets Prediction using Deep Neural Networks“, the authors (Matthew Dixon, Diego Klabjan, and Jin Hoon Bang) attempt to use deep neural networks to predict short term price movement over a basket of securities traded on the Chicago Mercantile Exchange. However, rather than display prediction accuracy or strategy viability, this paper serves as a warning for what happens when deficiencies in domain knowledge coincide with poor measurement and unreasonable assumptions. 3
It starts with a familiar-sounding abstract:
Deep neural networks (DNNs) are powerful types of artificial neural networks (ANNs) that use several hidden layers. They have recently gained considerable attention in the speech transcription and image recognition community (Krizhevsky et al., 2012) for their superior predictive properties including robustness to overfitting. However their application to algorithmic trading has not been previously researched, partly because of their computational complexity. This paper describes the application of DNNs to predicting financial market movement directions. In particular we describe the configuration and training approach and then demonstrate their application to backtesting a simple trading strategy over 43 different Commodity and FX future mid-prices at 5-minute intervals.
Quick summary of the paper:
- Data period from March 31, 1991 to September 30, 2014 (but only using most recent 15 years)
- Trains 9,895 features on 25,000 observations, predicts on next 12,500 observations, then rolls forward and incremented by 1,000
- Feature space uses lagged returns, moving averages, and correlations between instruments.
- Predicts -1, 0, or +1 for 5-minute periods, corresponding to whether the mid-price will go down, remain unchanged, or go up.
- Compares prediction accuracy to “white noise” with equal chances of each class.
- Creates ad hoc “trading strategy” with all executions taking place instantly at mid-point, no cost.
- Claims average prediction accuracy of 42% and annualized Sharpe Ratios of 3.29
The claims about accuracy and high Sharpe Ratios don’t withstand any amount of scrutiny. Further, there’s even reason to doubt that the data the authors use is appropriate for the intended application.
Of Noise And Nothing
One of the most immediately suspicious figures in the paper is Table 1:
The classification accuracy for Copper shown in this table is merely the highest value out of ten. However, Figure 2 in the paper shows that the median accuracy for Copper is ~33%, or no better than guessing if all three outcomes were equally likely.
Perhaps most importantly, the other four instruments don’t actually trade. Take the “Transco Zone 6 Natural Gas (Platts Gas Daily) Swing” for example. According to the CME’s website, not a single person is holding a contract for that symbol:
Congratulations on being able to predict something that doesn’t move. I’ll leave it as an exercise for the reader to decide which is better to be proud of: noise or nothing?
Throughout the paper, Dixon et al. compare their classification accuracies to what would be expected if each of the three conditions had an equal probability of occurring. The above example, while extreme, highlights a scenario that the authors overlook: if an instrument rarely trades, you can get far greater accuracy than 33% by picking “no change” for every period. And more generally, even if an instrument trades a lot, if it’s trending in one direction for the majority of the data set, you can still beat “random” by picking a constant model.
This is one reason why investment and trading strategies are often compared to “buy and hold” (constant model) rather than a random model. In order for the authors to argue that their DNN model adds value, they need to compare each instrument to more than just white noise.
The authors then try to “demonstrate the application” of their DNN by backtesting a trading strategy based on its -1,0,+1 predictors. Let’s ignore the fact that they only focus the top 5 performing instruments and skip over the massive losses in other instruments.
Instead let’s focus on a few of the assumptions:
- Transaction costs are ignored
- No slippage: the order always gets filled immediately at the mid-price of the current 5-min bar
They justify using these assumptions by saying (emphasis mine):
These assumptions, especially those concerning trade execution and absence of live simulation in the backtesting environment are of course inadequate to demonstrate alpha generation capabilities of the DNN based strategy but serve as a starting point for commercial application of this research.
This is backwards. The correct starting point is that you’re either lifting the other side of the order book, or the price has traded through your passive order. A few pages later shows exactly why:
It’s not a coincidence that from top to bottom is also roughly the order of least liquid to most liquid. By allowing yourself to instantly trade at midpoints in illiquid instruments, you’re giving yourself free edge that doesn’t exist in reality. Without midpoint trading, none of those strategies would make money.
Sketchy Data and/or Formatting (updated)
Yet another potential source of error and concern comes from the data itself. The paper mentions that a window of 25,000 5-minute observation periods corresponds to approximately 260 days. This implies that they’re using 8 hour trading days. But which 8 hours and why? Regular hours on the NYSE have been from 9:30 AM to 4:00 PM EST since 1985. The CME’s electronic market (Globex) has been active 20+ hours a day since the early 90s, but it wasn’t made “open access” until 2000. Moreover, up until 2004 the majority of the volume was still done in the trading pits, which even then were not open for 8 hours a day.
Do the midpoints used in the data set include the trading pits? Or did it start that way and then switch?
Given the major market structure changes covered in this data, there’s a strong scent of arbitrariness here.
Section Update 9/3/2017
After receiving some thoughtful emails and comments, I want add just a few more words to this section. I understand that academic researchers often face considerable limitations with respect to market data, and I don’t criticize the authors’ use of old or low resolution data. The problem is that it’s neither clear what data they actually used nor why they chose to use it. 4
This is the paper’s entire description of the data selection and preparation methodology:
Our historical dataset contains 5 minute mid-prices for 43 CME listed commodity and FX futures from March 31st 1991 to September 30th, 2014. We use the most recent fifteen years of data because the previous period is less liquid for some of the symbols, resulting in long sections of 5 minute candles with no price movement. Each feature is normalized by subtracting the mean and dividing by the standard deviation. The training set consists of 25,000 consecutive observations and the test set consists of the next 12,500 observations. As described in Section 6, these sets are rolled forward ten times from the start of the liquid observation period, in 1000 observation period increments, until the final 37,500 observations from March 31st, 2005 until the end of the dataset.
Here “5 minute mid-prices” is ambiguous. At a given instance in time, the mid-price is typically understood to be half-way between the highest bid and lowest offer — although another popular measure weights the point by the relative volumes of the bid and offer. When we talk about mid-prices over a period of time this becomes less clear. 5 minute mid-prices could mean an instantaneous snapshot every 5 minutes. It could mean something like the average mid-price weighted by time or some sort of volume over those minutes. But then the authors mention “5 minute candles” which are based on traded prices, so it could also be the mid-price of the candles. With no given definition nor reference to a standard nor a data vendor whose data methodology can be looked up, it would be impossible for someone to attempt their own experiment based on the methods in this paper and is thus not useful to anyone including other academic researchers.
And for clarity: when I previously discussed the authors’ peculiar choice of trading session duration, I mentioned a bunch of stuff without stating its biggest implication (though it’s kinda sorta there). Because futures trade over 20 hours a day, and because the paper’s chosen time frame doesn’t explicitly or obviously match to any particular trading pit hours, this means the authors had data for more than 8 hours a day and chose not to use it. They had no problem mentioning that they threw out the first 8 years of data, so I’m not sure why they didn’t think to mention that they were also throwing out data every single day during the relevant analysis period. Regardless, this is very very bad. How did no one catch this?
Stop Flinging Poo
The willingness to feature-stuff NN’s with any random unstructured caca is never too far behind the latest techniques that make it harder to overfit. The Dixon et al. model trains 9,895 features on just 25,000 samples. My personal opinion is that’s absurd. I don’t think a specified model is going to appear out of a septic tank of lagged returns and moving averages. Impose some structure.