Fitting to Noise or Nothing At All: Machine Learning in Markets

Fitting to Noise or Nothing At All

Machine Learning in Markets

puppy replication (link)

Derp Learning

Academic finance literature naively applying machine learning (ML) and artificial neural network (ANN) techniques to market price prediction is a dumb farce. While this probably won’t surprise anyone who has done a paper replication in the past 6+ years, despite all of the advancements in algorithms and hardware, and despite all of the new domains ANN’s have conquered, financial academics still insist on throwing feces at the wall. In fact, their simian proclivities might be getting worse.

A typical situation goes something like this:

“We noticed [some machine learning technique] has had success in [something unrelated to finance]. So we take [a small/arbitrary set of securities] over [a small/arbitrary window of time] and apply a [random, obnoxiously large, or empirically unjustified feature space] to said technique to predict price movements of said securities. We show that under [our ad hoc and unreasonable assumptions] said technique can sometimes predict price movements. Publish me please.”[1]

Those familiar with the replication crisis and The Garden of Forking Paths should immediately spot the numerous potential “researcher degrees of freedom” that inevitably prove these results not robust. Indeed, all it takes in order to break most of these papers is adding a few similarly behaved securities or applying the methodology just a few months before or after the paper’s sample period. But these type of failures have been covered in literature, so for that see Noah’s review of the spurious and the fleeting.

Instead, for this post I’d like to focus on one example where I’ll fire off all the things that make papers like this hopelessly fucked before they even begin.[2]

The Setup

In “Classification-based Financial Markets Prediction using Deep Neural Networks“, the authors (Matthew Dixon, Diego Klabjan, and Jin Hoon Bang) attempt to use deep neural networks to predict short term price movement over a basket of securities traded on the Chicago Mercantile Exchange. However, rather than display prediction accuracy or strategy viability, this paper serves as a warning for what happens when deficiencies in domain knowledge coincide with poor measurement and unreasonable assumptions.[3]

It starts with a familiar-sounding abstract:

Deep neural networks (DNNs) are powerful types of artificial neural networks (ANNs) that use several hidden layers. They have recently gained considerable attention in the speech transcription and image recognition community (Krizhevsky et al., 2012) for their superior predictive properties including robustness to over fitting. However their application to algorithmic trading has not been previously researched, partly because of their computational complexity. This paper describes the application of DNNs to predicting financial market movement directions. In particular we describe the configuration and training approach and then demonstrate their application to backtesting a simple trading strategy over 43 different Commodity and FX future mid-prices at 5-minute intervals.

Quick summary of the paper:

  • Data period from March 31, 1991 to September 30, 2014 (but only using most recent 15 years)
  • Trains 9,895 features on 25,000 observations, predicts on next 12,500 observations, then rolls forward and incremented by 1,000
  • Feature space uses lagged returns, moving averages, and correlations between instruments.
  • Predicts -1, 0, or +1 for 5-minute periods, corresponding to whether the mid-price will go down, remain unchanged, or go up.
  • Compares prediction accuracy to “white noise” with equal chances of each class.
  • Creates ad hoc “trading strategy” with all executions taking place instantly at mid-point, no cost.
  • Claims average prediction accuracy of 42% and annualized Sharpe Ratios of 3.29

The claims about accuracy and high Sharpe Ratios don’t withstand any amount of scrutiny. Further, there’s even reason to doubt that the data the authors use is appropriate for the intended application.

Of Noise And Nothing

One of the most immediately suspicious figures in the paper is Table 1:

Table 1

The classification accuracy for Copper shown in this table is merely the highest value out of ten. However, Figure 2 in the paper shows that the median accuracy for Copper is ~33%, or no better than guessing if all three outcomes were equally likely.

Perhaps most importantly, the other four instruments don’t actually trade. Take the “Transco Zone 6 Natural Gas (Platts Gas Daily) Swing” for example. According to the CME’s website, not a single person is holding a contract for that symbol:

Congratulations on being able to predict something that doesn’t move. I’ll leave it as an exercise for the reader to decide which is better to be proud of: noise or nothing?

Wrong Metrics

Throughout the paper, Dixon et al. compare their classification accuracies to what would be expected if each of the three conditions had an equal probability of occurring. The above example, while extreme, highlights a scenario that the authors overlook: if an instrument rarely trades, you can get far greater accuracy than 33% by picking “no change” for every period. And more generally, even if an instrument trades a lot, if it’s trending in one direction for the majority of the data set, you can still beat “random” by picking a constant model.

This is one reason why investment and trading strategies are often compared to “buy and hold” (constant model) rather than a random model. In order for the authors to argue that their DNN model adds value, they need to compare each instrument to more than just white noise.

Strategy Backtesting

The authors then try to “demonstrate the application” of their DNN by backtesting a trading strategy based on its -1,0,+1 predictors. Let’s ignore the fact that they only focus the top 5 performing instruments and skip over the massive losses in other instruments.

Instead let’s focus on a few of the assumptions:

  • Transaction costs are ignored
  • No slippage: the order always gets filled immediately at the mid-price of the current 5-min bar

They justify using these assumptions by saying (emphasis mine):

These assumptions, especially those concerning trade execution and absence of live simulation in the backtesting environment are of course inadequate to demonstrate alpha generation capabilities of the DNN based strategy but serve as a starting point for commercial application of this research.

This is backwards. The correct starting point is that you’re either lifting the other side of the order book, or the price has traded through your passive order. A few pages later shows exactly why:

Table 2

It’s not a coincidence that from top to bottom is also roughly the order of least liquid to most liquid. By allowing yourself to instantly trade at midpoints in illiquid instruments, you’re giving yourself free edge that doesn’t exist in reality. Without midpoint trading, none of those strategies would make money.

Sketchy Data and/or Formatting

Yet another potential source of error and concern comes from the data itself. The paper mentions that a window of 25,000 5-minute observation periods corresponds to approximately 260 days. This implies that they’re using 8 hour trading days. But which 8 hours and why? Regular hours on the NYSE have been from 9:30 AM to 4:00 PM EST since 1985. The CME’s electronic market (Globex) has been active 20+ hours a day since the early 90s, but it wasn’t made “open access” until 2000. Moreover, up until 2004 the majority of the volume was still done in the trading pits, which even then were not open for 8 hours a day.

Do the midpoints used in the data set include the trading pits? Or did it start that way and then switch?

Given the major market structure changes covered in this data, there’s a strong scent of arbitrariness here.

Stop Flinging Poo

The willingness to feature-stuff NN’s with any random unstructured caca is never too far behind the latest techniques that make it harder to overfit. The Dixon et al. model trains 9,895 features on just 25,000 samples. My personal opinion is that’s absurd. I don’t think a specified model is going to appear out of a septic tank of lagged returns and moving averages. Impose some structure.


Footnotes    (↵ returns to text)

  1. I wish I had to embellish any of that.
  2. Lest my tone betray piety, please note that I too am a sinner in the hands of an angry market god. Years ago we were working on a complex and technically challenging trading model when we hit a wall and couldn’t figure out how to fit all of the puzzle pieces together. But we had a fuzzy idea of what features we wanted and a lot of computing power and high quality data. So we spent the next four embarrassing months coming up with new ways to fail. At one point we had written a custom genetic algorithm to perform high-dimensional clustering while trying to avoid overfitting by concocting the most elaborate and silly fitness function conceived by man. That model eventually went back on the shelf for 5 years.
  3. On Twitter, @carlcarrie‘s feed is a gold mine for quant finance and fintech links. When he originally tweeted this paper over a year ago, I didn’t think much of it aside from the universe selection seeming a little strange — at the time I also received no response from the first author when I asked about it.

3 Pingbacks/Trackbacks

  • Simon Hughes

    Great article. This covers a lot of the pitfalls of this kind of work. The other common pitfall that isn’t mentioned (and perhaps the author’s were smart enough to avoid) is using including future data points when normalizing features. When done on prices, it is very easy to produce a model with very high accuracy on the training data that doesn’t predict anything useful on the test data.

    • ZHD

      Thanks Simon. That is an important thing to watch out for. (and why the term “warm-up period” is close to our hearts) Some of the language in the paper wasn’t clear to me even after a couple reads, so I let it slide.

  • PY

    there’s one give-away: “algorithmic trading” ~ if you see this term (and “program trading”) being used as a synonym for quantitative or systematic trading, then you may delete immediately on grounds of zero domain knowledge.

    • ZHD

      I’d say it depends on who the target audience is supposed to be. Now I just say “financial software”

  • Jortiz3

    Great piece you got here. People have screwed up statistics forever. Our brains just aren’t wired for it evolutionarily. Also…

    “The Dixon et al. model trains 9,895 features on just 25,000 samples. My personal opinion is that’s absurd. ”

    Aren’t we being a little too nice here? (Laughs)

  • wtpayne

    I think by now we can conclude that everybody is as good as everybody else when it comes to sophistication of analysis; and that — as usual — the competitive differentiator is all about the data that you have access to. (Which makes me wonder why google aren’t cleaning up by front-running based on company-name searches originating from financial centres ….)

    • ZHD

      haha, I bet it’s hard to recover the right direction from those queries alone.

  • Chris Mesterharm

    Do you have any examples of good ML papers on market prediction? I’ve
    always assumed that if anyone actually had good results they would
    not publish. (Or would publish after they made 5 gazillions dollars
    and mention this fact in the paper.)

  • Pingback: Fitting to Noise or Nothing At All: Machine Learning in Markets – Machine Learning Library()

  • Pingback: Fitting to Noise or Nothing at All – Machine Learning in Markets via /r/economy | Chet Wang()

  • Pingback: Web Picks (week of 7 August 2017) | DataMiningApps()