Strategy Quality Metrics Improve Out-of-Sample Results
by Michael R. Bryant
When optimizing a trading strategy, achieving good results on the so-called in-sample data on which the strategy is optimized is rarely a problem. You simply keep optimizing until you like the results. However, the ultimate goal is profitable trading, not results that look great in hindsight. The real goal in optimizing a trading strategy or in building a new strategy should be to select parameter values and/or elements of the strategy logic that generalize well to new data. In other words, we want the results we saw on the in-sample data to hold up on data that was not used in the optimization or strategy building process.
This so-called out-of-sample data is typically a recent segment of data that's large enough to provide a sufficiently large sample of trades for validation purposes. Real-time tracking, in which the strategy is tested in real-time using your trading platform's trading simulator, can also provide out-of-sample results. Of course, live trading results are also out-of-sample. Nonetheless, setting aside a segment of data for out-of-sample testing prior to optimization is often preferred as it can take months to test a strategy in real time, depending on how frequently it trades.
In previous articles, I've discussed different factors that can affect how well a strategy performs out-of-sample (training and selection bias, over-fitting), as well as a method to asses the likelihood of poor out-of-sample performance and various methods to prevent it (monitoring the optimization process, stress testing, multi-market techniques). This article falls into the last category, but instead of proposing changes to the optimization or build process itself, as in the prior articles, this article focuses on something simpler: selecting the metrics used to define the optimization's objective function.
Strategy Quality Metrics
When optimizing the parameter values for a trading strategy or when building a trading strategy using an automated tool like Adaptrade Builder, it's necessary to define what you mean by "optimal". A simple definition would be something like "find the parameter values that maximize net profit". In some cases, this may be sufficient, but it's also possible that metrics other than the most obvious ones, such as net profit, may provide better results. One alternative is what I call "strategy quality" metrics.
What do I mean by a strategy quality metric? Generally speaking, I define strategy quality metrics as measures of strategy performance that have the following properties:
Simple metrics such as net profit, average trade size, and drawdown don't fit these criteria. For the most part, these criteria imply that we want metrics based on ratios and other non-dimensional quantities. While a number of metrics meet these requirements, this article will focus on the metrics shown below in Table 1.
Table 1. Strategy Quality Metrics.
The table also includes a threshold value for each metric that is intended to represent the value above which a strategy is considered to be high-quality. These values are entirely subjective. For example, 60% is suggested by Zamansky and Stendahl1 as the threshold for good quality entries and exits.
The correlation coefficient effectively measures the consistency of the strategy results across different periods. The trade significance takes into account both the average trade size and its standard deviation. It's basically a measure of whether the average trade is large enough to be statistically different than zero given its standard deviation. While this is not a valid measure of the statistical significance of the trading strategy when the strategy has been optimized, it's nonetheless a useful measure of the quality of the trade distribution of the strategy.
Average entry and exit efficiency measure how effectively the strategy captures the potential profit of trades. The Kelly fraction is typically used for position sizing. Higher values generally indicate that greater leverage can be used via more aggressive position sizing, which is an indication of a higher quality strategy. The DoF ratio depends on the number of trades, so this is more of a measure of whether there is sufficient test data than of the strategy itself. Nonetheless, it's a useful measure of the quality of the strategy results.
Some metrics that could have been included in the table above but were excluded include Tharp's System Quality Number (SQN)4, the Sharpe ratio, the Sortino ratio, the MAR ratio, risk of ruin, and total efficiency, among others. For some of these, choosing a universal threshold value is difficult, whereas others evaluate differently under variable position sizing versus fixed size trading. The risk of ruin is almost always close to zero for most decent strategies, so it tends to be uninformative. Ultimately, the set of metrics listed in Table 1 is subjective.
To use the strategy quality metrics, it's helpful to be able to combine them into a composite measure of strategy quality. This can be done by simply adding up the number of metrics for which the value exceeds the threshold value. For example, if the profit factor exceeds (or equals) 1.7, a "1" is added to the sum. The same is done for each metric in the table. The result is an integer score between 0 and 7 representing the overall strategy quality, with 0 representing the lowest quality and 7 the highest quality.
Testing the Inclusion of Strategy Quality Metrics in a Build Process
To test whether or not including strategy quality metrics in the build process produces better out-of-sample results, the following test cases were run:
To run the tests, Adaptrade Builder was used to create a population of 500 strategies, which were evolved over 40 generations. The strategies were built for the futures symbol @TFS#C (IQFeed, mini Russell 2000 futures), 60 minute bars, over the date range 2/25/2008 to 2/23/2018. Trading costs of $15 per round turn were deducted from each trade, and one contract per trade was specified. The training segment was set to 70% of the date range, with 30% assigned to the validation segment. No test segment was used.
For those unfamiliar with Adaptrade Builder, it constructs trading strategies by combining indicators, price patterns, and other trading logic options so that the resulting strategies meet the user's performance specifications and requirements. It's based on a kind of machine learning algorithm called genetic programming (GP); see, for example, this white paper on GP for trading. One of the key aspects of this approach is that only the training data is used in the GP optimization process. The validation data is not examined or used in any way by the optimization process. The results on the validation segment don't benefit from hindsight and are therefore unbiased and out-of-sample.
The GP algorithm is a type of optimization algorithm in which the objective function is based on the metrics selected by the user. For the two test cases shown above, the following metrics were used:
Test Case 1:
Maximize Net Profit, weight 1.000
Test Case 2:
Maximize Net Profit, weight 1.000
Trade Sig >= 95.0000% (Train)
Test case 2 is therefore the same as test case 1 except for
the addition of the build conditions. Each build condition consists of one of
the strategy quality metrics being above its threshold value. The objective
function for the optimization combines the build objectives and the build
conditions, if any. The build process, based on the GP optimization, then works
to evolve a population of strategies that maximizes the objective function. For
test case 2, this means the process worked to maximize the net profit while
meeting the build conditions. It's important to understand that, in general,
there is no guarantee that the build conditions will be met. It may be
impossible, given the market, to meet all the conditions simultaneously.
Nonetheless, the conditions steer the solution towards strategies that come as
close as possible to meeting all the conditions.
The two test cases were designed to test the hypothesis that adding strategy quality metrics to the build process improves the out-of-sample results. For the results to support this hypothesis, we should find that the out-of-sample results are better in case 2 than in case 1. So what happened? The results are summarized below in Table 2.
Table 2. Test results show the average in-sample (IS) strategy profit, average out-of-sample (OOS) strategy profit, the number of profitable strategies in the out-of-sample segment, and the sum of the quality scores for the strategies in the out-of-sample segment for each test case. Despite lower average profits in-sample, the out-of-sample results for test case 2 are better in each column.
The population results from each test case were copied to a spreadsheet for analysis. The average strategy net profit over all 500 population members was computed for each test case for both the in-sample (IS, training) and out-of-sample (OOS, validation) segments. In addition, the number of profitable strategies in the population for the OOS segment was counted for each test case. Lastly, the quality scores were calculated for each strategy using the OOS results, and the scores were summed over all 500 strategies for each test case.
As seen in Table 2, the average out-of-sample net profit for strategies in test case 2 was nearly double that of test case 1. In addition, the number of strategies with profitable OOS results was higher in test case 2. Lastly, the total quality score of strategies in the OOS segment of test case 2 was more than six times higher than in test case 1. To check if the better OOS results in test case 2 were due to a favorable population, the in-sample results were also examined. As shown in the table, the average in-sample net profit for strategies in test case 1 was actually higher than in test case 2, suggesting, that, if anything, the strategies in test case 1 benefitted from a favorable bias.
Intuitively, it seems reasonable that strategies that are designed to be of higher quality should generalize better to new data than strategies that are of lower quality. However, given the complexity of strategy design and testing and, in particular, the tendency of optimization to cause over-fitting, this was not entirely a forgone conclusion. Fortunately, the results shown above support this idea. Compared to a population of strategies built without considering strategy quality, as represented by seven so-called strategy quality metrics, the population that took strategy quality into account achieved notably better results in out-of-sample testing.
It's also interesting to note that not only were the strategies more profitable in the out-of-sample segment when the strategy quality metrics were included in the objective function, but the resulting strategies displayed much better strategy quality metric values in the out-of-sample segment. In this regard, it appears that you get what you ask for. By asking for higher quality results in-sample, we also got that in the out-of-sample segment.
It's possible that the higher in-sample profits for test case 1 in combination with the lower out-of-sample profits is due to over-fitting. Conceivably, it's easier to fit the noise of the in-sample segment when you're only trying to maximize net profit. It's possible that adding the additional constraints imposed by the strategy quality metrics in test case 2 made it more difficult to fit the random elements (i.e., noise) of the price data.
The reader may be wondering why test case 2 didn't forgo net profit entirely and simply try to meet the seven build conditions by themselves. The potential problem with that approach is that a high strategy quality score can be achieved with a low net profit. Even with only a handful of trades, you could conceivably have a quality score of 4 or 5 (i.e., perhaps all conditions but Trade Significance and DoF Ratio would be satisfied). Since having a high net profit is fundamental to the success of any trading strategy, it's better to add conditions for strategy quality to net profit than to rely on strategy quality alone.
Good luck with your trading.
The results described in this article were generated using Adaptrade Builder version 2.4.1.
This article appeared in the February 2018 issue of the Adaptrade Software newsletter.
HYPOTHETICAL OR SIMULATED PERFORMANCE RESULTS HAVE CERTAIN INHERENT LIMITATIONS. UNLIKE AN ACTUAL PERFORMANCE RECORD, SIMULATED RESULTS DO NOT REPRESENT ACTUAL TRADING. ALSO, SINCE THE TRADES HAVE NOT ACTUALLY BEEN EXECUTED, THE RESULTS MAY HAVE UNDER- OR OVER-COMPENSATED FOR THE IMPACT, IF ANY, OF CERTAIN MARKET FACTORS, SUCH AS LACK OF LIQUIDITY. SIMULATED TRADING PROGRAMS IN GENERAL ARE ALSO SUBJECT TO THE FACT THAT THEY ARE DESIGNED WITH THE BENEFIT OF HINDSIGHT. NO REPRESENTATION IS BEING MADE THAT ANY ACCOUNT WILL OR IS LIKELY TO ACHIEVE PROFITS OR LOSSES SIMILAR TO THOSE SHOWN.
If you'd like to be informed of new developments, news, and special offers from Adaptrade Software, please join our email list. Thank you.
For Email Marketing you can trust
Copyright (c) 2004-2019 Adaptrade Software. All rights reserved.