Pretesting the Effectiveness of Genetic Programming
by Michael R. Bryant
One of the questions that arises when using genetic programming (GP) to develop trading strategies is whether or not the GP process is inherently capable of building strategies that are not over-fit to the market. When thousands of trading strategies are built and tested by the GP algorithm, it's sensible to wonder if the out-of-sample results are meaningful or just a consequence of random chance. It's the "infinite monkey theorem": if you let a monkey randomly hit the keys on a typewriter, given long enough, he'll compose the works of Shakespeare. So how do we know that the trading results from GP are not monkey fodder?
Fortunately, statistics provides an answer; in particular, statistical null tests. In the context of GP, these have been called pretests.1 Basically, we want to compare the results of the GP process to an equivalent set of randomly generated trading strategies. If the strategies generated by GP are significantly better than the randomly generated strategies, it suggests that the GP process may have merit.
Asking the Right Questions
When building a trading strategy for a given market, there's always a possibility that the data on which the strategies are built, known as in-sample or training data, may not contain information relevant to the unseen data, known as out-of-sample or testing data. In other words, there may be no exploitable pattern or other information in the in-sample period that makes it possible to generalize the results to the out-of-sample period. For example, the market may be too efficient to exploit, or the in-sample and out-of-sample periods may be fundamentally distinct due to an unfortunate choice in dividing the data. If that's the case, then any successful out-of-sample test will be merely a consequence of random chance.
Even if there is potential in a given market, we need to know if the GP process can effectively exploit it. It may be that the chosen inputs to the GP process are ineffective or that the necessary indicators or other elements required to exploit the market data are unavailable.
With this in mind, the goal of the pretesting is to answer these two key questions regarding the use of GP in developing trading strategies: (1) Does the in-sample data segment contain information that allows trading strategies to be generalized to the out-of-sample data, and (2) Can the GP process effectively utilize that information? Both of these questions must be answered in the affirmative in order for the GP process to be useful as a trading strategy development approach.
An Effective Pretest
The semantic rules of strategy construction ensure that the strategies make basic sense. For example, they prevent the time-of-day from being compared to a moving average of the closing prices and ensure that stochastic oscillators are compared to values in the range 0 to 100 or to other oscillators in the same range. They do not, however, attempt to create strategies that make intuitive trading sense, such as buying breakouts following a short-term down-trend within a longer-term up-trend. That would defeat the purpose of the GP process, which is designed to discover rules of that sort if they provide any benefit to the in-sample performance.
Given that the initial population is randomly generated, its performance on the in-sample and out-of-sample segments will be similarly random. In particular, we can expect some strategies to be profitable and some unprofitable. In most cases, the average performance of the initial population will be unprofitable.
After the initial population evolves over several generations of the GP process, we would expect the out-of-sample results to improve if the GP process is working effectively. While some members of the populations (i.e., strategies) may be profitable and some unprofitable, overall, the average out-of-sample performance should improve. Our pretest, then, will be to compare the average out-of-sample performance of the evolved population to the average out-of-sample performance of the initial population.
Two metrics will be examined: average out-of-sample net profit and the out-of-sample net profit of the strategy ranked highest according to fitness. The fitness is calculated based on the in-sample performance and is the primary driver of the evolutionary process. No out-of-sample data is used in calculating the fitness or in any other aspect of the build process, except for the option of resetting the population based on out-of-sample performance, which will not be used in this study.
To calculate the average out-of-sample performance of the evolved population, only unique strategies will be counted. Some duplicate strategies are inevitable during the build process, and the more generations are employed, the more duplicates tend to arise. To avoid double-counting either positive or negative results and thereby biasing the results, duplicates will be detected based on performance. If all performance metrics (other than the number of inputs, listed as "complexity") are respectively the same (within the numerical precision of each other) for two strategies, the two strategies will be considered to be identical. This method defines duplicates semantically, rather than syntactically (based on the strategy code). For example, two strategies that have code differences that don't affect the results will be considered duplicates. The size of the initial, randomly generated population was set to match the number of unique strategies in the evolved population.
The comparisons between the evolved populations and the randomly generated ones were made using a Student's t test, and all tests were made using Adaptrade Builder software.
Test Cases and Results
Four test cases were run, as shown in the table below.
2Order types restricted to limit entries and exits (filled only when price exceeded) and fixed size money management stops between $12.50 and $175 in size; entry only between 8:30 - 10:00 am (exchange time); exit after 3:00 pm; maximize net profit, number of trades and significance (equal weight), with a small weighting for complexity; and only simple moving average, highest, lowest, price patterns, crosses above/below, and absolute value in indicator build set.
In each case, the population size was100, and five generations were run. The data was split 75/25 between in-sample and out-of-sample. The build was repeated 25 times for each market, with the average out-of-sample net profit over the final population and the out-of-sample net profit for the fittest population member recorded after each trial. The same results were then recorded for 25 initial (randomly generated) populations where the population size was set to match the number of unique strategies in the evolved population, as explained above. To determine if the net profit from the evolved populations was greater than the net profit from the initial populations, the averages were compared using a one-tailed, unpaired Student's t test, as calculated in Excel.
The population size was kept small to speed up the build process. A small number of generations was used in order to ensure adequate variation in the final population. The markets and settings were chosen to illustrate the idea that how well the GP process works depends on such factors. For example, daily bars of the E-mini S&P were chosen because experience has shown that it is fairly easy to find good strategies for this market. Strategies on 1 min bars of the E-mini S&P are much more difficult to generate. The second test with 1 min bars of the E-mini was designed to find strategies with more, shorter trades -- more of a "scalping" strategy -- with a restricted set of indicators in the build set. The forex market was included for variety.
The results are shown below for the average population net profit. In all cases, the average out-of-sample (OOS) net profit across the evolved populations was significantly higher than the average OOS net profit of the randomly generated populations, suggesting that the GP process was effective in evolving strategies that improved upon random results. The fact that the average population net profit was not positive in three of the four cases is discussed in the next section.
3Average value plus/minus one standard deviation; N = 25 in all cases.
The results for the fittest population member are shown below. In the first two cases, the OOS net profit of the fittest member is greater than that of the average net profit, as can seen by comparing the results in the two tables. This is consistent with the prior results in that it suggests a positive relationship between the fitness, which is calculated on the in-sample data, and the OOS net profit. In all cases except for the third one, the OOS net profit of the fittest member is greater in the evolved population than in the randomly generated population, which is also consistent with the prior results, although due to the high standard deviation, the results are not statistically significant.
The table also shows the percentage of runs where the fittest member had a positive net profit out-of-sample. In three of the four cases, there were more runs where this occurred for the evolved population than was the case for the random populations. For example, for the second test case, in 44% of the evolved populations, the fittest member was profitable OOS, whereas the fittest member was profitable OOS in only 28% of the randomly generated populations.
The results that indicate a potential problem with the GP process are highlighted below in red. The third test case has the most problems. Here, the average OOS net profit of the fittest member in the evolved population was less than the average OOS net profit of the evolved population. This suggests that the fitness was not a good indication of future performance, which contradicts the fact that the OOS net profit significantly improved during the build process. Moreover, the average OOS net profit of the fittest member in test case three was actually lower than the average OOS net profit of the fittest member among the random populations. Lastly, there were fewer builds where the OOS net profit of the fittest member was profitable for the evolved populations than in the random populations. All of these results suggest that, although the build process improved upon the initial population, it would be difficult to use the results to choose a good strategy going forward.
4Average value plus/minus one standard deviation; N = 25 in all cases.
What it All Means
The pretest was designed to determine whether or not the GP process was capable of finding and exploiting information in the in-sample period that could be generalized to the out-of-sample (OOS) period. The good news is that in all four test cases, the pretest results suggest that this is in fact the case. The average OOS net profit of the evolved populations was significantly higher than the average OOS net profit of the randomly generated populations in each case.
The bad news is that just because the GP process improves upon the initial population, as measured by the OOS net profit, doesn't mean the resulting strategies are tradable or even profitable. It's just difficult to find good strategies for some markets, as evidenced by the average results for 1 minute bars of the E-mini S&P and 30 minute bars of the EURUSD. In these cases, not only was the average net profit negative, but so was the net profit of the fittest member. However, at least for the first two test cases, the trends were in the right direction. The fact that the average net profit for the fittest member was greater than the average net profit suggests a positive correlation between in-sample and OOS results, which, in turn, suggests it makes sense to choose the strategy based on how well it does historically. Even though the second case didn't produce profitable OOS results based on the fittest member, given that the trends are good, it might be possible to use a larger population with more generations to achieve better results.
The third test case, as discussed in the prior section, presents a counter-example. In this case, although the GP process significantly improved the average OOS net profit, the lack of correlation between the fitness and the OOS net profit suggests the historical results would not be a good guide in selecting a member of the evolved population for trading forward. However, the fact that the GP process was successful in improving the OOS results overall begs the question: why is the fitness not a good indication of OOS performance if the build process, which is based on fitness, was generally successful? The answer may be that the fittest member of the population may be too fit. In other words, while the population overall improved during the generations of GP, the fittest member may have been over-fit to the in-sample data. By being over-fit, it no longer generalized well to the unseen (OOS) data.
This is supported by the fact that in the random populations, the fittest strategy was always better OOS than the average strategy, which suggests a positive correlation between fitness and OOS performance among the randomly generated strategies. Assuming the in-sample data contains information that can be exploited and generalized to the OOS period, which has already been demonstrated, this is what would be expected when the strategies are not over-fit, which is most likely the case for the randomly generated populations. Recall that the third test case differs from the second one, which was not over-fit, only in terms of the settings, not in the market data. This suggests that if the strategies are being over-fit, it may only be necessary to change the settings and try again.
The quality of the results for the fourth test case (EURUSD) lies somewhere between that of the first and third test cases. Because the average OOS net profit of the fittest member was not better than the average OOS net profit, it's possible that the fittest member is over-fit.
The value of performing a pretest prior to performing a full GP is that it provides some indication of how likely it is that the intended effort will not be wasted. While you could always forge ahead and verify everything with real-time tracking or a separate set of out-of-sample data, without knowing whether or not the GP process is inherently well suited to the market and settings, it may take much longer to infer that the process is not working correctly.
*This article appeared in the April 2012 issue of the Adaptrade Software newsletter.
HYPOTHETICAL OR SIMULATED PERFORMANCE RESULTS HAVE CERTAIN INHERENT LIMITATIONS. UNLIKE AN ACTUAL PERFORMANCE RECORD, SIMULATED RESULTS DO NOT REPRESENT ACTUAL TRADING. ALSO, SINCE THE TRADES HAVE NOT ACTUALLY BEEN EXECUTED, THE RESULTS MAY HAVE UNDER- OR OVER-COMPENSATED FOR THE IMPACT, IF ANY, OF CERTAIN MARKET FACTORS, SUCH AS LACK OF LIQUIDITY. SIMULATED TRADING PROGRAMS IN GENERAL ARE ALSO SUBJECT TO THE FACT THAT THEY ARE DESIGNED WITH THE BENEFIT OF HINDSIGHT. NO REPRESENTATION IS BEING MADE THAT ANY ACCOUNT WILL OR IS LIKELY TO ACHIEVE PROFITS OR LOSSES SIMILAR TO THOSE SHOWN.
If you'd like to be informed of new developments, news, and special offers from Adaptrade Software, please join our email list. Thank you.
For Email Marketing you can trust
Copyright (c) 2004-2019 Adaptrade Software. All rights reserved.