Accountability, Generalizability, and Rigor in Finance Research: Machine Learning in Markets (Part II)

Accountability, Generalizability, and Rigor in Finance Research

Machine Learning in Markets (Part II)

lol peer review
Who exactly are these peers?

Intro: Accountability and Code

You’re looking to buy a car, and not just any car, you want a customized, durable, high performance machine that no one else has seen before. A young couple enthusiastically responds to your want ad saying that they’ve constructed a masterpiece capable of besting every other car on the street. Their email includes diagrams of the customization and a detailed narrative of their building process where they combined parts from three other cars as well as forged a few new parts of their own. They offer graphs and tables of all sorts of outstanding performance measurements. You’re convinced; this car will be magnificent, and you agree to meet them in a close by parking lot to make the purchase. But when you arrive, you don’t see the car. All the couple brought is a printed copy of the same files they had emailed you. But you don’t want that. You want to drive the car. Because the car is what matters. You can’t see its performance. You can’t even be sure it exists.

This is more or less the state of affairs I found in published academic finance research applying black box machine learning techniques to predict market price movements: confusing papers describing elaborate data transformations, stringing together increasingly complex learning architectures, and then reporting highly improbable performance metrics without providing any code to verify the results. That journals do not have code review or publishing standards at either the editorial or peer review stages for this line of research is kinda shambolic. Considering that the professional software industry produces “15 – 50 errors per 1000 lines of delivered code,” there’s no reason to believe any of these papers produce accurate results.

To keep things moving along, I’ll refrain from philosophizing over this until later.

(I realize this is a very niche area of finance. There aren’t a tremendous amount of these papers getting published. However, one editor I spoke with said there’s been a massive uptick in Derp Learning submissions — he desk rejects pretty much all of them. Knowing this, we have a couple opportunities: document the mistakes in the papers that do slip through; start thinking more systematically about the many levels and causes of failure. Eventually we might be able to help our fellow researchers to stop wasting time on this mucho kaka)

Two Papers, No Cars

During backtesting of trading algorithms, look-ahead bias (or peeking) occurs when an algorithm makes a decision or prediction at a point in time based on information or data that would only be available after that point in time. In other words: it’s easy to predict the future if you already know what’s going to happen. Guarding against this bias is one of the most fundamental tenets of quantitative trading research. But not for these two fetid swamp monsters. Every time I’ve gone back through these papers I’ve found new mistakes and more possibilities for errors. So numerous are the potential failure points that I can’t be exactly sure where the core problems live. But the general sloppiness of the manuscripts and the carelessness about the finer analytic details gives me high confidence that the results are bogus.

As of 1/7/2018: Investigations

PLoS ONE has opened investigations into both of the following papers based on the issues I’ve raised. So far one set of authors has admitted to at least one mistake in their published manuscript; however, they deny making any mistakes in their code — which of course they don’t provide. Updates will follow.

Paper: Forecasting East Asian Indices Futures via a Novel Hybrid of Wavelet-PCA Denoising and Artificial Neural Network Models  — Chan Phooi M’ng J, Mehralizadeh M

  • Attempts to predict daily prices of various futures using a combination of filtering and a recurrent neural network
  • Applies a wavelet PCA with a dozen variants to denoise the Open-High-Low-Close series, then uses that as inputs into the recurrent neural network
  • Other inputs into the RNN include technical analysis indicators: “RSI, MACD, MACD Signal, Stochastic Fast %K, Stochastic Slow %K, Stochastic %D, and Ultimate Oscillator calculated by original OHLC signals”
  • To make predictions at each step, “the input series x(t) is denoised open-high-low signals together with technical indicators… while y(t) is the denoised close of the futures time series, which is considered as the target to be predicted.”
  • Creates a trading strategy: “The strategy buys when the next period predicted value (target) is larger than the current market close and sells when the next period predicted value is smaller than the current market close”
    • y*(t + 1) > y(t) : buy
    • y*(t + 1) < y(t) : sell
  • Compares trading strategy performance against a strategy which buys and holds the futures for the length of the evaluation period.
  • Claims annual returns averaging 48.2% for the Nikkei and 34.7% for the Hang Seng

The primary suspect for a potential look-ahead bias here is with the input series x(t). It contains technical indicators that are calculated based on the closing price at time t. So even though the target y(t) is the denoised close, the predictor variables already contain information about the closing price. Since the technical indicators must witness the close before they can be used as inputs, this can hardly be considered a “prediction” about a future value. (Similarly, the authors assume the high and low necessarily come before the close. Often not true) Further, the prediction logic creates a conundrum for the trading strategy. If the network must witness x(t) in order to get a prediction for y(t), then it must witness x(t+1) in order to get a prediction for y(t+1). This means that the prediction for y(t+1) is based entirely on variables that occur after the close at y(t). Simulating trades in this manner amounts to going back in time and deciding what to have traded once you know what the future is.

But even if there’s no peeking at that level, here’s the graphical hell beast the authors provide describing their research framework:

Fig 9. Research framework for models 1 and 2
Fig 9. Research framework for models 1 and 2

Every single one of those steps introduces the possibility for an endemic coding error. Despite the workflow appearing sequential, the code was written in Matlab and did not use an event-based simulation framework. This means that all of that processing takes place first, then the trading strategies are evaluated on the output. In an event-based framework, the input data are read sequentially and each new data point triggers a subsequent series of transformations until a prediction is made, then the next data point is read in. This is the most reliable way of preventing look-ahead bias. It is inconceivable that the authors did not make a mistake here.

Among littler problems, the authors appear to not understand the basics of how futures contracts work — specifically, that they expire. There’s a period known as “the roll”, typically near a contract’s expiration date, where traders unwind their positions on the expiring contract and begin to trade a contract with a further expiration date. While the closing price difference between the old and new contract can sometimes be pretty small, it can often be very large. This is problematic for both this paper’s buy-and-hold calculations as well as its primary buy-and-sell algorithm that assumes the strategy buys (sells) at the close of one day and then sells (buys) at the close of the following day. Without specific procedures to adjust for the roll, the paper’s logic will be buying and selling completely different contracts.

Together all of these problems not only highlight the terrible state of this type of research, but also of a peer review system where inexperienced and unqualified people are chosen to make determinations about the efficacy and quality of research practices in a complex and difficult field.

Paper: A deep learning framework for financial time series using stacked autoencoders and long-short term memory  — Bao W, Yue J, Rao Y

  • Attempts to predict daily prices of various futures using a combination of wavelet transforms, stacked autoencoders, and a long-short term memory network
  • Applies a wavelet transform to the open, high, low, close series (but the description in the methodology section appears to be copied word for word from another paper) which is then fed into the first autoencoding layer
  • Other inputs include OHLCV, technical indicators, and macroeconomic variables
  • No output or target is specified beyond “one step ahead output”
  • The prediction procedure is said to follow the method in the above Chan et al. paper
  • Creates a trading strategy: “buy when the predicted value of the next period is higher than the current actual value… sell when the predicted value is smaller than the current actual value.”
    • Buy_signal : y*_t+1 > y_t
    • Sell_signal : y*_t+1 < y_t
    • It’s unclear what the predicted or actual values are.
  • Compares trading strategy performance to a strategy which buys and holds the underlying stock index.
  • Claims annual returns averaging 45.997% for the S&P 500 and 59.437% for the Nikkei

Clearly this paper is a convoluted mess. Like the previous paper, this paper’s trading logic for the primary algorithm has the same problem with buying one contract and selling a different one. And given its numerous small errors and heavy reliance on the logic from Chan et al., we can assume that this suffers from similar endemic problems. But there’s one quirk that makes this paper extra special. In the profit and loss formula:

The definition of strategy earnings is:

where R is the strategy returns. b and s denote the total number of days for buying and selling, respectively. B and S are the transaction costs for buying and selling, respectively.

Never mind that they’re summing across percent returns. Never mind that it’s preferable for short term strategies trading futures products to report returns in terms of raw dollars.1 Did you catch the real beauty? Instead of subtracting transaction costs, the authors add the costs to their profits. We should all be so lucky.

Even if all of the issues thus far didn’t exist, there appears to be a fatal flaw in the pre-processing step that renders all subsequent transformations useless. The authors first apply their denoising procedure over the entire OHLC data set. This means that every data point at time t has been adjusted according to a procedure that considers data after time t. Thus, there is a look-ahead bias embedded in the time series itself, and the neural networks are likely predicting the wavelet transformation procedure that was performed ahead of time.

Spurious By Design

Some of my earliest hands-on work with machine learning algorithms involved conducting replications of trading papers which used boosting and bagging on decision trees to make short term price predictions. At the time, I was working at a firm with the means and infrastructure to quickly obtain and warehouse all varieties of data referenced in the papers. The goal of the exercises was to assess parameter sensitivities and try to pick out a useful trick or two. But we soon learned that even though many of these papers did actually replicate over their specific data sets, once we applied the algorithms to data just a few months before or after the sample period, they broke down spectacularly. The following paper has all the trappings of the same.

Paper: A Double-Layer Neural Network Framework for High-Frequency Forecasting — Hao Chen, Keli Xiao, Jinwen Sun, and Song Wu

  • Attempts to predict 5-minute price movements of a set of US equities via a double-layer neural network with a custom hierarchical structure
  • Has data for all S&P 500 stocks but chooses 100 for the model
  • Uses 5-minute OHLCV bars from January 1, 2013 through May 31, 2013 (8112 time points)
  • Other inputs: EMAs, intra-interval proportions, advance/decline (w/ vols and ratios)
  • Builds hierarchy by snooping through the inputs and ordering them according to how much variance they explain over the sample period
  • Trains 1000 models over 2000 periods, uses next 100 periods to select top 50% to average predictions and vote on direction
  • Reports absolute return, Sharpe ratio, and prediction accuracy (correct direction)
  • Compares to ARMA-GARCH, ARMAX-GARCH, Single-Layer NN, Regular Double-Layer NN

Lack of rigor aside, this paper isn’t an obvious disaster — the authors deserve credit for providing the pseudocode outlining most of their algorithms and procedures (actual code is still better). In the absence of coding errors, the calculations at least appear correct. However, there are significant problems arising from extrapolating from a tiny sample, cherry picking a subset of available securities, structuring their algorithm based on snooping, and inadequately investigating where the performance comes from.

Early in the paper, the authors claim “our results are very robust since they are based on a large dataset that includes 100 stocks with the largest capitals in the S&P 500.” This is very wrong. A five month data set of 5-minute bars is not considered large by any standard. Then there’s a one month burn-in period before their predictions begin, leaving an evaluation period of only four months. By only including the largest market cap constituents of an index that returned 16% over the short sample period, the paper’s results are highly susceptible to the market’s unidirectional drift — the effects of which the authors fail to investigate. Finally, the fact that the universe only includes the largest market cap companies means the results are precisely not robust. There’s no reason to believe that companies with smaller market caps follow the same behavior — and some reasons to believe they don’t.

But what really makes this paper unpublishable is that the instrument set, independent variables, and model construction appear to have all been chosen after looking through the sample data and seeing “what works.” The authors write:

Initially, we tried to build a regular double NN (RDNN) without the hierarchical structure, i.e., all inputs at the bottom layer and two hidden layers in the middle, however, the performance of RDNN is not satisfactory. This motivates us to differentiate the inputs in the NN design. We built linear regression models on 100 stocks with largest capitals in S&P 500 and found that on average, AD indicators could explain more variances than EMA, which in turn explain more variances than the Q variables. Therefore we adopted the current DNN design to reflect the “closeness” of each variable to the outcome, and the new DNN structure seems to work well.

Points for honesty, but this isn’t great data science. Exploratory data analysis is great and useful and necessary, but the paper offers profitability metrics as if this were a tradable strategy. Again, there’s no reason to believe this logic will generalize to data after the end of the sample period or on the other 400 instruments they chose to ignore. It shouldn’t be the responsibility of saps like me to have to find out. Thus, this is spurious until proven otherwise.

For all the comparisons this paper makes to other types of models, it doesn’t really investigate what effects are doing the heavy lifting. By treating the 5-minute buckets as a continuous series, the authors ignore the effects of when the market is closed during overnights and weekends. It is best practice to break down the predictions to those right before market close and see if the neural networks are learning a seasonality effect, as it is often the case that certain market regimes will make most of their gains from the market close to the following open.

Another clue that the authors should have looked at the effects of drift is the high correlations between the accuracy rates of the models:

Chen Figure 5
Fig 5

Over the sample period, the S&P closed higher on 60% of days. While this likely doesn’t extend down to the 5-minute level, the authors should have looked into that. Two more comparisons would have shed some light on potentially interesting effects: a constant model that always bets the same direction; and a naive momentum model that bets the return of time t+1 will be the same as time t. This is what we mean when we ask: where does the P/L come from? Any model, black box or otherwise, needs to be thoroughly probed to uncover what effects it is actually capturing. It’s insufficient to wave your hands and say “it’s better than this other one that no one actually uses.”

A Step In The Rigor Direction

Occasionally we’re asked to vet trading strategies that firms and funds are considering putting money behind. To discern the quality of intraday algorithmic trading strategies that don’t have a significant track record of live performance, using simulated results are challenging but not impossible. The first thing I always look at are the cost assumptions. Cost per share is a big one: if you buy and sell at the same price, the cost of that round-trip is non-zero. Slippage is another: are you placing marketable orders that cross the bid-ask spread or are you modeling passive orders? Every strategy has a ceiling as to how much money it can move before market impact overwhelms its alpha. It’s rare for academic research to clearly identify and handle the many dimensions of cost. So when I came across the following paper and saw that the authors were paying attention to this, I knew some careful work was going on.

Paper: Improving Factor-Based Quantitative Investing by Forecasting Company Fundamentals — John Alberg and Zachary C. Lipton

  • Attempts to predict fundamentals as part of a factor model
  • Constructs a “clairvoyant model” to establish an upper bound of value for how much predicting these fundamentals would be worth
  • Uses factors motivated by previous literature and adds some momentum factors (by rank)
  • In sample period of 1970-1999 and out of sample 2000-2017
  • Considers several classes of deep neural networks
  • Adds hyperparameters to up-weight certain factors it wants to predict more
  • Each month, forecasts the factors a year out and then ranks the stocks according to the model, dividing capital evenly between the top 50
  • After a year holding period, stocks that fall off the top 50 are sold and then capital is reallocated to those that are now on the list
  • Has a fixed cost of trading at 1 cent per share, an additional slippage factor that increases as volume percentage increases, and a max volume participation per month
  • Shows a modest outperformance of 2.7% a year versus the standard factor model

Oh S#!T, they’re not predicting price returns! The authors even say why:

In previous experiments, we tried predicting price movements directly with RNNs and while the RNN outperformed other approaches on the in-sample period, it failed to meaningfully out perform a linear model

That’s right. When you have no coding errors, no look-ahead bias from silly wavelet transforms, no cherry picking factors from small samples, you probably won’t be able to predict price returns for any meaningful length of time. This is what I ominously hinted at in the closing line of my last piece. By imposing the structure of a factor model, the authors were able to use ML techniques to enhance the performance rather than try to create the God Model of price return predictions.

Since this paper is only 4 pages long, I encourage you to read it to see what’s going on. I’m just going to focus on a few things I would like for the authors to have explored and reported.

Even though the out of sample period is quite long, the authors did look at a lot of different architectures and variables:

For both MLPs and RNNs, we consider architectures evaluated with 1, 2, and 4 layers with 64, 128, 256, 512 or 1024 nodes. We also evaluate the use of dropout both on the inputs and between hidden layers. For MLPs we use ReLU activations and apply batch normalization between layers. For RNNs we test both GRU and LSTM cells with layer normalization. We also searched over various optimizers (SGD, AdaGrad, AdaDelta), settling on AdaDelta. We also applied L2-norm clipping on RNNs to prevent exploding gradients.

To account for the fact that we care more about our prediction of EBIT over the other fundamental values, we up-weight it in the loss (introducing a hyperparameter 1). For RNNs, because we care primarily about the accuracy of the prediction at the final time step (of 5), we upweight the loss at the final time step by hyperparameter 2

Overall this is a very high-dimensional parameter space. Given the modest outperformance of the final models relative to the standard factor model, it would be best practice to report the performances of all of those different architectures and the variable sensitivities for everything they looked at. While it might seem like 17 years is sufficient to claim generalizability, the monthly time frame they’re using means there’s only 204 prediction points in the out of sample period. And while this is certainly the best paper applying ML to financial research I’ve seen in a while, they still need to rigorously probe the parameter space before we can get an idea of generalizability.

Also: submit the code, dudes.

Outro: It’s Cars All The Way Down

Should academic journals place a moratorium on finance papers which predict price movements using black box machine learning techniques until the journals establish and enforce both open data and code standards, as well as clearly define a set of minimal criteria which satisfy a contribution to financial research or knowledge? Maybe. It sounds like editors are already doing a decent job of desk rejecting the vast majority of it. But the fact that they have to is also a problem. It means researchers are wasting a lot of time on stuff that is almost guaranteed to go nowhere. I see two main areas of concern:

The first is what I’ve been returning to in every section: code. When just a few years ago one of the most famous results in empirical economics was a spreadsheet error, we need procedures in place to guard against this. The complexity of these models is not decreasing, and the amount of code required to produce results is only going to increase in the short term. It’s unreasonable to expect academic researchers to produce error free code — they simply have no experience with rigorous testing and development. In a fantastic exposé about one of the most successful software development groups of all time, Lockheed Martin’s “onboard shuttle group”, an error that makes it into production is not blamed on a person, but on the process that allowed it to happen. In financial research, and quantitative research in general, as code becomes increasingly inseparable from the underlying research, we should begin to think about that sort of process as well.

The second area of concern is specific to papers trying to build short term trading algorithms. With a few exceptions, researchers interested in applying ML techniques to market prices need a lot of data. So as researchers start using whatever higher frequency data they can find, they’re still operating under assumptions from the older monthly and quarterly factor model days. I’ll try to avoid any obvious analogies to physics, but what happens as the data become increasingly granular, market microstructure effects begin to overwhelm the assumptions. These effects produce all sorts of challenges when it comes to being able to report profitability or merely accuracy metrics. Thus far in practice and in theory, the microstructure world has been almost entirely a practitioner’s game, but I’d like to see some work that offers guidance to future researchers.

And if anyone is interested in working on either of those things, don’t hesitate to get in touch.


I’d like to thank Ernie Chan for challenging me to comb through the first two papers. And special thanks to Zachary Winston, Joshua Loftus, Chris Mullins, Mike Trenfield, and Gabriel Mathy for proof reading various parts of this.

This post originally appeared on Zachary David’s Market Fails & Computational Gibberish.


  1. I had the unfortunate experience of working with a professional trader who calculated all of his simulated returns by: (SellPrice/BuyPrice -1). While this is technically equivalent to ((SellPrice-BuyPrice)/BuyPrice), it’s not equivalent to when you go short and your return should be denominated by the SellPrice. This had the effect of making his short side winners look bigger and his losers look smaller. Eventually we all figured out he was a fraud in several other ways.

    The lesson here is that you should really try to keep things in terms of dollars at first, then calculate portfolio returns in terms of percent at the end.

  • newt0311

    With futures I would imagine that percentage returns are even less useful: you never pay the “price” to go long or short. It’s always margin. I feel like an accurate assessment of a futures strategy would look at drawdown and try to maintain say something like 1.5x that much (and then report if/when the strategy runs out of money…).

    • ZHD

      Yeah you’re thinking along the right lines. That’s why we always kept things in terms of dollars and/or ticks (when at the higher-frequency level). At the institutional level, you tend to negotiate with your prime/clearing broker in terms of max order size and max position, but then you can have positions which offset other positions when you’re deploying strategies that are hedged. So it can get complex very quickly. This is why we tend to favor looking at things in terms of risk units per dollar.

  • P A Schrodt: “kitchen sink analysis”