Know Thy Model: Specificity and the Importance of Using Fake Data

tyle=”text-align: center;”>Specificity and the Importance of Using Fake Data

I reconstruct a model used in an academic paper on HFT “quote stuffing.” Then I apply several scenarios of mock data to show that the authors’ conclusion is invalid. There is still no evidence of quote stuffing.[1]

Underpants Quotes
HFT Gnomes

Introduction

A few weeks ago I wrote a post explaining how routine software bugs in high-frequency trading systems can lead to seemingly malicious scenarios. When reviewing a supposed practice known as “quote stuffing” — where HFT firms intentionally flood the market with large amounts of orders that are cancelled shortly after in an attempt to slow down other trading systems — I remarked that there isn’t any evidence to suggest this practice exists.[2] In the comments, Dave Lauer pointed me to a 2012 paper by Mao Ye, assistant professor of finance at the University of Illinois, in which he claims to have found the evidence. Mao’s model attempts to measure the “co-movement” of message flows along individual channels on the NASDAQ OMX. According to his faculty page, Mao has presented this paper 10 times. And Dave even refers to it in a CNN Money interview. Guys, it’s time to stop. I can apply a random number generator to Mao’s methodology and achieve the same results. This post will show that there is a significant amount of work that still needs to be done before you can make such a conclusion. Retract until then.[3]

(And as always there is a non-zero probability that I’m wrong in part or whole. My Matlab code can be downloaded for replication. I appreciate anyone who takes the time to engage ideas and I issue corrections immediately as they develop)

Specifically Specificity

I argue that the model Mao Ye presents lacks specificity and doesn’t account for normal market conditions. A model’s “specificity” refers to its propensity to avoid false positives. A false positive occurs when a model positively signals the presence of a condition where one is not present. If I have a model that always predicts the market will go down, it will be right whenever the market does go down, but it will contain a lot of false positives.

Sensitivity and Specificity (Wikipedia)
Sensitivity and Specificity (Wikipedia)

The importance of model specification is ubiquitous throughout research fields. In empirical studies, having a misspecified or underspecified model can lead researchers to conclude that they have found evidence of something that doesn’t exist. In psychiatry, Neuroskeptic covered a recent statistical scandal in which researchers incorrectly reported the specificity of a model which tried to predict suicides:

They reported sensitivity, but not specificity. Instead they reported something they call ‘raw specificity’. What is this? Well… it doesn’t exist. Thorell et al appear to have made it up. The term is unknown in statistics: it does not appear on Google Scholar in any other paper (there are a few ‘hits’ but upon closer inspection they are all referring to the old-fashioned specificity of some ‘raw’ variable.)

So did the test work? Well, the actual specificity (maybe Thorell et al call this the ‘cooked’ specificity?) of the electrodermal test was 33% over all patients. The sensitivity was 74%. The sum of sensitivity and specificity was, then, 107%. Any entirely random ‘test’ will get a sum of sensitivity and specificity equal to 100%. A perfectly accurate test would get a sum of 200%. So the electrodermal test’s true performance is just 7% better than flipping a coin.

But Mao’s case is certainly not as egregious as the misreporting of results. I believe the mistakes in the paper are an honest misunderstanding of normal expected market behavior. Still, the existence of these cases in which the model will signal false positively merits retraction until further analysis.

The Paper’s Model

The model attempts to measure excess “message flows” in individual channels of NASDAQ listed stocks using a factor regression. A “channel” is a feed that contains all of the messages for a specific number of stocks. Over the period Mao’s data covers, the NASDAQ system had 6 channels for 2,377 stocks, where the stocks are assigned to channels by the first letter of their ticker symbol:

Channel 1 handles ticker symbols from A to B; Channel 2 handles ticker symbols from C to D; Channel 3 handles ticker symbols from E to I; Channel 4 handles ticker symbols from J to N; Channel 5 handles ticker symbols from O to R; Channel 6 handles ticker symbols from S to Z.

The intuition of the model is that an HFT firm will slow down other players by stuffing one channel full of quotes so that it can trade on other channels faster than the competition. If the regression coefficient of a channel is positive for a stock in that channel after controlling for the total number of messages in the market at that time, the claim is that this is evidence of quote stuffing. The test is setup as such:

We divide each trading day into one-minute intervals and count the number of messages in each interval for all 2,377 stocks in the 55 trading days between March 19, 2010 and June 7, 2010. For each stock i, the channel message flow is the sum of all messages for stocks in Channel j minus the message flow of stock i, if stock i is in Channel j. We make this adjustment to avoid mechanical upward bias to find that a stock has higher correlations with message flows in its own channel. The market message flow is the sum of the messages for all stocks. For each stock i, we run the following two stage regressions following Bekaert, Hodrick, and Zhang (2009):

We first regress the total number of messages of Channel j on the market message flow:

regressionStep1

We save the residual of this regression as a new variable,

residIn the second step, we run the following six regressions for each stock i:

regressionStep2

Where [f_i,t] stands for the number of messages in stock i at time t. [gamma_i,j] measures the channel-level effect after controlling for the market-wide effect. We are particularly interested in [gamma_i,j] when stock i belong to Channel j. However, we also run the regression for stock i on other channels as a falsification test.

Then they create a 6 x 6 matrix of the average gamma coefficient for the stocks in one channel regressed against the stocks in all the channels. The top left to bottom right diagonal are the coefficients of interest. This is shown as Table 6:

***,**,* are 1%, 5%, and 10% significance levels
***,**,* correspond to 1%, 5%, and 10% significance levels

While significance testing attempts to limit the occurrence of false positives, its value is local to the model — i.e. limited to the variables used to construct the model. Improper specification will not insulate you from false positives. This is exactly what happened here.

The first problem is with the variables themselves. They use the total number of messages for a channel for a given period — including executions and the orders which caused those executions. But as the name implies, “quote stuffing” must be the sending of quotes that don’t generate trades. If trades are generated then that violates the theoretical assumptions used to motivate this study. It’s not logically consistent.

Building on that, there’s an implicit assumption that the expected messaging behavior of the stocks in each channel is drawn from the same distribution. But this neglects the Pareto Principle — also called the 80-20 rule — which is a heuristic that says 80% of the effects come from 20% of the causes. Activity in securities markets follows closely to this idea. Order activity is strongly correlated with the total volume of a stock. When someone makes an aggressive order, they are putting information into the market and other players tend to react by adjusting the prices they are willing to buy and sell at.

We can go to the NASDAQ Most Active Stocks page and find the historical volumes for the highest 100 stocks on a given day. Taking a sample from the first 10 days of the paper’s data range (Mar 19 and onward), and then averaging the 10 volumes in each position and plotting, the volume curve looks like this:

VolumesOfStocks

I excluded SIRI because it was number 1 every day over that period and made the scaling look goofy (but I have included it in the downloadable data set in the next section). The volume, and thus messaging, continues to drop according to an exponential function until you get to stocks that trade only a few thousand shares per day. Remember: there’s 2,377 stocks covered in the paper.

To further complicate things, the number of high volumes stocks cannot be assumed to be uniformly distributed across the channels for a given day. Indeed, assigning the channel letter to the top 100 stocks for a day gives us a range of 10 to 25 for a channel. Even further, some stocks are consistently high volume (like QQQQ), while others come and go according to various economic events and corporate actions and announcements (like SIRI, which drops to number 6 by the end of the paper’s sample period).

Mock Data in Matlab

 Download The Files

To test how some more realistic assumptions might produce the same results as the paper, I created a series of tests using Matlab. With a 55 day sample period, 1 minute buckets, and 6.5 hour trading days (9:30 AM to 4PM) we have a total of 21,450 data points. I evenly distributed the stocks across six channels for a total of 396 stocks in each channel. Then I filled each stock with a number of messages for each period using various random distributions according to several scenarios. Finally, I generate the channel_sum and market_sum variables according to the paper, perform the regressions, and then generate the 6 x 6 table.

Test 1: All Stocks From Same Distribution

To make sure I correctly replicated the model, I did an initial test that filled each stock for each channel from the same distribution ~N(5000,1000) (normal distribution with 5,000 mean and 1,000 standard deviation). We would expect to see no channel relationship.

Test 1
Test 1

This is consistent with what we expect. No positive excess channel coefficients.

Test 2: Quick and Dirty “High Volume” Stocks

Going off of the findings from my cursory look into how many stocks per channel are in the top 100 each day (between 10 and 25), I first populated each stock according to Test 1. Then I assigned the first 10 stocks of a channel to have messaging drawn from a distribution an order of magnitude greater ~N(50000,10000). Then every 390 periods (equivalent to 1 day) I randomly assigned between 1 and 30 stocks in the next 200 to be drawn from the same bigger distribution.

Test 2
Test 2

Interesting. We’re seeing positive channel coefficients on the same order as the paper. But this is certainly not quote stuffing. They’re just random numbers.

Test 3: A Rough Exponential Distribution of Volume Regimes

Finally, to test something more representative of stocks being able to smoothly transition from various higher and lower messaging periods, I use several distributions with means {50k, 40k, 30k, 20k, 10k, 5k} and standard deviations {10k, 8k, 6k, 4k, 2k, 1k,}. Then each day I generate five uniformly random integers from 1 to 25 {n1, n2, n3, n4, n5}. Stocks 1 through n1 are assigned to the highest distribution, stocks (n1+1) through n2 are assigned the next distribution, and so on until after n5 where all stocks are drawn from the smallest distribution.

Test 4
Test 3

Et Voilà. False positives.

Remanded

“In theory, there is no difference between theory and practice.”

Clearly this model is not specific to the detection of quote stuffing. By including the executions and the orders which generated them, the paper doesn’t even measure what it claims to measure. And by not controlling for high activity stocks, it’s simply finding evidence of volume regimes consistent with the Pareto Principle. Further, this type of regression model is inappropriate to the goal. If Mao wants to detect co-movement, he should actually look at movement. Instances of quote stuffing might not be persistent enough to capture in the coefficients of a large n regression. Finally, the explicit assumption that one channel at a time will be targeted does not make sense from a game theoretic perspective. An HFT firm would want to stuff every channel which doesn’t contain the stock it wishes to trade at that time. This further complicates the idea of a channel factor regression controlling for total market messaging.

The conclusion is thus invalid. A retraction must be issued.

  

   

 

Footnotes    (↵ returns to text)

  1. I know I said I wouldn’t write about HFT for a while. But this post isn’t really about trading. It’s about model misspecification. Yeah. That’s it.
  2. It’s important to remember that the idea of quote stuffing comes from the perpetually non peer reviewed research firm Nanex. Several legitimate academic researchers have since attempted to lay the theoretical foundations in which a firm might be able to use the tactic profitably. However, these foundations rely on a significant amount of voodoo (underpants gnome logic) which suggest a lack of understanding of market microstructure by authors. As if that weren’t bad enough, there hasn’t been any research implementing these ideas to confirm that it’s even possible, much less any firm actually doing it profitably.
  3. tyle=”text-align: center;”>Specificity and the Importance of Using Fake Data

    I reconstruct a model used in an academic paper on HFT “quote stuffing.” Then I apply several scenarios of mock data to show that the authors’ conclusion is invalid. There is still no evidence of quote stuffing.[1]

    Underpants Quotes
    HFT Gnomes

    Introduction

    A few weeks ago I wrote a post explaining how routine software bugs in high-frequency trading systems can lead to seemingly malicious scenarios. When reviewing a supposed practice known as “quote stuffing” — where HFT firms intentionally flood the market with large amounts of orders that are cancelled shortly after in an attempt to slow down other trading systems — I remarked that there isn’t any evidence to suggest this practice exists.[2] In the comments, Dave Lauer pointed me to a 2012 paper by Mao Ye, assistant professor of finance at the University of Illinois, in which he claims to have found the evidence. Mao’s model attempts to measure the “co-movement” of message flows along individual channels on the NASDAQ OMX. According to his faculty page, Mao has presented this paper 10 times. And Dave even refers to it in a CNN Money interview. Guys, it’s time to stop. I can apply a random number generator to Mao’s methodology and achieve the same results. This post will show that there is a significant amount of work that still needs to be done before you can make such a conclusion. Retract until then.{{3}}

    (And as always there is a non-zero probability that I’m wrong in part or whole. My Matlab code can be downloaded for replication. I appreciate anyone who takes the time to engage ideas and I issue corrections immediately as they develop)

    Specifically Specificity

    I argue that the model Mao Ye presents lacks specificity and doesn’t account for normal market conditions. A model’s “specificity” refers to its propensity to avoid false positives. A false positive occurs when a model positively signals the presence of a condition where one is not present. If I have a model that always predicts the market will go down, it will be right whenever the market does go down, but it will contain a lot of false positives.

    Sensitivity and Specificity (Wikipedia)
    Sensitivity and Specificity (Wikipedia)

    The importance of model specification is ubiquitous throughout research fields. In empirical studies, having a misspecified or underspecified model can lead researchers to conclude that they have found evidence of something that doesn’t exist. In psychiatry, Neuroskeptic covered a recent statistical scandal in which researchers incorrectly reported the specificity of a model which tried to predict suicides:

    They reported sensitivity, but not specificity. Instead they reported something they call ‘raw specificity’. What is this? Well… it doesn’t exist. Thorell et al appear to have made it up. The term is unknown in statistics: it does not appear on Google Scholar in any other paper (there are a few ‘hits’ but upon closer inspection they are all referring to the old-fashioned specificity of some ‘raw’ variable.)

    So did the test work? Well, the actual specificity (maybe Thorell et al call this the ‘cooked’ specificity?) of the electrodermal test was 33% over all patients. The sensitivity was 74%. The sum of sensitivity and specificity was, then, 107%. Any entirely random ‘test’ will get a sum of sensitivity and specificity equal to 100%. A perfectly accurate test would get a sum of 200%. So the electrodermal test’s true performance is just 7% better than flipping a coin.

    But Mao’s case is certainly not as egregious as the misreporting of results. I believe the mistakes in the paper are an honest misunderstanding of normal expected market behavior. Still, the existence of these cases in which the model will signal false positively merits retraction until further analysis.

    The Paper’s Model

    The model attempts to measure excess “message flows” in individual channels of NASDAQ listed stocks using a factor regression. A “channel” is a feed that contains all of the messages for a specific number of stocks. Over the period Mao’s data covers, the NASDAQ system had 6 channels for 2,377 stocks, where the stocks are assigned to channels by the first letter of their ticker symbol:

    Channel 1 handles ticker symbols from A to B; Channel 2 handles ticker symbols from C to D; Channel 3 handles ticker symbols from E to I; Channel 4 handles ticker symbols from J to N; Channel 5 handles ticker symbols from O to R; Channel 6 handles ticker symbols from S to Z.

    The intuition of the model is that an HFT firm will slow down other players by stuffing one channel full of quotes so that it can trade on other channels faster than the competition. If the regression coefficient of a channel is positive for a stock in that channel after controlling for the total number of messages in the market at that time, the claim is that this is evidence of quote stuffing. The test is setup as such:

    We divide each trading day into one-minute intervals and count the number of messages in each interval for all 2,377 stocks in the 55 trading days between March 19, 2010 and June 7, 2010. For each stock i, the channel message flow is the sum of all messages for stocks in Channel j minus the message flow of stock i, if stock i is in Channel j. We make this adjustment to avoid mechanical upward bias to find that a stock has higher correlations with message flows in its own channel. The market message flow is the sum of the messages for all stocks. For each stock i, we run the following two stage regressions following Bekaert, Hodrick, and Zhang (2009):

    We first regress the total number of messages of Channel j on the market message flow:

    regressionStep1

    We save the residual of this regression as a new variable,

    residIn the second step, we run the following six regressions for each stock i:

    regressionStep2

    Where [f_i,t] stands for the number of messages in stock i at time t. [gamma_i,j] measures the channel-level effect after controlling for the market-wide effect. We are particularly interested in [gamma_i,j] when stock i belong to Channel j. However, we also run the regression for stock i on other channels as a falsification test.

    Then they create a 6 x 6 matrix of the average gamma coefficient for the stocks in one channel regressed against the stocks in all the channels. The top left to bottom right diagonal are the coefficients of interest. This is shown as Table 6:

    ***,**,* are 1%, 5%, and 10% significance levels
    ***,**,* correspond to 1%, 5%, and 10% significance levels

    While significance testing attempts to limit the occurrence of false positives, its value is local to the model — i.e. limited to the variables used to construct the model. Improper specification will not insulate you from false positives. This is exactly what happened here.

    The first problem is with the variables themselves. They use the total number of messages for a channel for a given period — including executions and the orders which caused those executions. But as the name implies, “quote stuffing” must be the sending of quotes that don’t generate trades. If trades are generated then that violates the theoretical assumptions used to motivate this study. It’s not logically consistent.

    Building on that, there’s an implicit assumption that the expected messaging behavior of the stocks in each channel is drawn from the same distribution. But this neglects the Pareto Principle — also called the 80-20 rule — which is a heuristic that says 80% of the effects come from 20% of the causes. Activity in securities markets follows closely to this idea. Order activity is strongly correlated with the total volume of a stock. When someone makes an aggressive order, they are putting information into the market and other players tend to react by adjusting the prices they are willing to buy and sell at.

    We can go to the NASDAQ Most Active Stocks page and find the historical volumes for the highest 100 stocks on a given day. Taking a sample from the first 10 days of the paper’s data range (Mar 19 and onward), and then averaging the 10 volumes in each position and plotting, the volume curve looks like this:

    VolumesOfStocks

    I excluded SIRI because it was number 1 every day over that period and made the scaling look goofy (but I have included it in the downloadable data set in the next section). The volume, and thus messaging, continues to drop according to an exponential function until you get to stocks that trade only a few thousand shares per day. Remember: there’s 2,377 stocks covered in the paper.

    To further complicate things, the number of high volumes stocks cannot be assumed to be uniformly distributed across the channels for a given day. Indeed, assigning the channel letter to the top 100 stocks for a day gives us a range of 10 to 25 for a channel. Even further, some stocks are consistently high volume (like QQQQ), while others come and go according to various economic events and corporate actions and announcements (like SIRI, which drops to number 6 by the end of the paper’s sample period).

    Mock Data in Matlab

     Download The Files

    To test how some more realistic assumptions might produce the same results as the paper, I created a series of tests using Matlab. With a 55 day sample period, 1 minute buckets, and 6.5 hour trading days (9:30 AM to 4PM) we have a total of 21,450 data points. I evenly distributed the stocks across six channels for a total of 396 stocks in each channel. Then I filled each stock with a number of messages for each period using various random distributions according to several scenarios. Finally, I generate the channel_sum and market_sum variables according to the paper, perform the regressions, and then generate the 6 x 6 table.

    Test 1: All Stocks From Same Distribution

    To make sure I correctly replicated the model, I did an initial test that filled each stock for each channel from the same distribution ~N(5000,1000) (normal distribution with 5,000 mean and 1,000 standard deviation). We would expect to see no channel relationship.

    Test 1
    Test 1

    This is consistent with what we expect. No positive excess channel coefficients.

    Test 2: Quick and Dirty “High Volume” Stocks

    Going off of the findings from my cursory look into how many stocks per channel are in the top 100 each day (between 10 and 25), I first populated each stock according to Test 1. Then I assigned the first 10 stocks of a channel to have messaging drawn from a distribution an order of magnitude greater ~N(50000,10000). Then every 390 periods (equivalent to 1 day) I randomly assigned between 1 and 30 stocks in the next 200 to be drawn from the same bigger distribution.

    Test 2
    Test 2

    Interesting. We’re seeing positive channel coefficients on the same order as the paper. But this is certainly not quote stuffing. They’re just random numbers.

    Test 3: A Rough Exponential Distribution of Volume Regimes

    Finally, to test something more representative of stocks being able to smoothly transition from various higher and lower messaging periods, I use several distributions with means {50k, 40k, 30k, 20k, 10k, 5k} and standard deviations {10k, 8k, 6k, 4k, 2k, 1k,}. Then each day I generate five uniformly random integers from 1 to 25 {n1, n2, n3, n4, n5}. Stocks 1 through n1 are assigned to the highest distribution, stocks (n1+1) through n2 are assigned the next distribution, and so on until after n5 where all stocks are drawn from the smallest distribution.

    Test 4
    Test 3

    Et Voilà. False positives.

    Remanded

    “In theory, there is no difference between theory and practice.”

    Clearly this model is not specific to the detection of quote stuffing. By including the executions and the orders which generated them, the paper doesn’t even measure what it claims to measure. And by not controlling for high activity stocks, it’s simply finding evidence of volume regimes consistent with the Pareto Principle. Further, this type of regression model is inappropriate to the goal. If Mao wants to detect co-movement, he should actually look at movement. Instances of quote stuffing might not be persistent enough to capture in the coefficients of a large n regression. Finally, the explicit assumption that one channel at a time will be targeted does not make sense from a game theoretic perspective. An HFT firm would want to stuff every channel which doesn’t contain the stock it wishes to trade at that time. This further complicates the idea of a channel factor regression controlling for total market messaging.

    The conclusion is thus invalid. A retraction must be issued.

      

       

    &