Backtesting with strand

Jeff Enos and David Kane

Introduction

Note: this document assumes familiarity with the notion of investment strategy backtesting. For an introduction, see the article Backtests in R-News Volume 7/1.

Evaluating an investment strategy is a multi-step process. Often the first step is to scrutinize a strategy’s underlying signal, or alpha, by running a top-bottom quartile spread analysis using a tool like the R package backtest. A quartile analysis gives a good idea as to whether the signal is predictive of future returns. However, the analysis ignores many real-world aspects of strategy implementation and trading. For example, in a spread analysis it is assumed that we can trade immediately, regardless of actual liquidity. As a result, it is difficult to learn from a spread analysis how the performance of a strategy degrades as investment capital and portfolio size increases. In a sophisticated backtest more of an attempt is made to mimic how the strategy would be implemented in practice. This includes using optimization-based portfolio construction and trade selection, and making conservative assumptions about what trades could actually have been made in the market.

The strand package provides a framework for running this more realistic type of backtest. Once a strategy is defined in terms of its alpha, risk constraints, and position and turnover limits, the system simulates how the strategy would be operated day-by-day, including daily order generation and realistic trade filling.

The purpose of this vignette is to describe how to set up and run a strand simulation.

System overview

The strand system is meant to mimic a daily professional-level portfolio management process. The process involves the following steps:

Running a backtest in strand follows the same process. First, data is prepared and organized for input into the system. Second, the strategy is defined in a configuration file. Finally, we run the simulation and analyze results. In the next few sections we’ll cover each of these major steps in turn.

Configuration

All setup and configuration in strand is accomplished by filling in a yaml-format configuration file. The backtest in this vignette, for example, will use the file vignettes/sample.yaml. The configuration file contains two major sections. The strategies section contains entries for the strategy’s alpha, any portfolio construction constraints, position size constraints, etc. (Note that it is possible to operate multiple strategies in the same simulation, but this vignette will only cover a simulation with one strategy.) The simulator section contains the location of input data, input data column mappings, and simulator options. There are also several top-level settings that control aspects of the simulation.

Preparing input data

Three types of input data are needed to run a strand simulation:

  1. A security reference, or listing, of the securities in the backtesting universe.
  2. Daily alpha and factor inputs for each stock in the universe.
  3. Daily market data for each stock in the universe.

This data can be provided to the simulator directly using objects stored in memory, or the simulator can read the data from binary files stored in feather format on disk.

In this vignette we will use the package’s sample data sets as input and supply them as objects in memory. For a discussion of how to use files for input data, see Appendix: file-based inputs.

The package’s sample data set includes value and size factors, as well as pricing information, for most of the stocks in the S&P 500 for the period June-August 2020. The source of fundamental data is EDGAR, and all pricing data was downloaded using the Tiingo Stock API.

Loading required packages

Two packages are required to work through the code in this vignette: strand and dplyr:

library(strand)
library(dplyr)

Security reference

The security reference specifies static information about each security in the backtest. Security reference data must include at least the following columns:

Below is a listing of a few rows and columns from the sample_secref data set included with the package:

data(sample_secref)

sample_secref %>%
  select(id, name, sector) %>%
  head()
#>     id                name                 sector
#> 1  MMM          3M Company            Industrials
#> 2  ABT Abbott Laboratories            Health Care
#> 3 ABBV         AbbVie Inc.            Health Care
#> 4 ABMD         ABIOMED Inc            Health Care
#> 5  ACN       Accenture plc Information Technology
#> 6 ATVI Activision Blizzard Communication Services

Shown are three columns, all of class character: id, name, and one category column sector. In sample.yaml we indicate that this data will be passed to the simulator as an object by configuring /simulator/secref_data as follows:

simulator:
  secref_data:
    type: object

Alpha and factor inputs

Alpha and factor inputs are used in the daily portfolio construction process. Alpha and factor inputs for each day must contain at least the following columns:

The /simulator/input_data section in sample.yaml indicates that, like security reference data, alpha and factor inputs will be supplied to the simulation as an object:

simulator:
  input_data:
    type: object

We will be using the sample_inputs data set included with the package as the alpha and factor input for our simulation:

data(sample_inputs)

sample_inputs %>%
  filter(date %in% as.Date("2020-06-01")) %>%
  select("date", "id", "rc_vol", "size", "value") %>%
  head()
#>         date   id      rc_vol       size      value
#> 1 2020-06-01    A   195700553  0.3103556 -0.4425451
#> 2 2020-06-01  AAL   693431274 -1.7615180 -2.9023557
#> 3 2020-06-01  AAP   147177014 -0.8766498  0.1520585
#> 4 2020-06-01 AAPL 10830836279  2.6736936 -1.1883389
#> 5 2020-06-01 ABBV  1207185748  1.4392005 -1.8642855
#> 6 2020-06-01  ABC   119776226 -0.1054613 -0.4033034

The first column, date, contains the date for which the data should be used to generate orders in the simulation. The second and third columns are the id and rc_vol columns described above. The column value is a numeric column that serves as the signal for the backtest run in this vignette. The size column contains a numeric factor that is used in a portfolio construction constraint.

Note regarding date semantics for inputs data: The data for 2020-06-01 is used for constructing the portfolio and generating orders for trading on that day. As a result, it is assumed that this data is known before trading begins on 2020-06-01 (e.g., data as-of the close on 2020-05-29). How the strategy operates and any expectations around the timeliness of data delivery should dictate what is used as inputs for a given date.

Market data

Market data is used to value positions, calculate portfolio performance, and simulate trade fills in the backtest. In the current version of the package, all prices and market values are assumed to be in a single reference currency. Market data for each date must contain at least the following columns:

Below is the /simulator/pricing_data section of the sample.yaml configuration file:

simulator:
  pricing_data:
    type: object
    columns:
      close_price: price_unadj
      prior_close_price: prior_close_unadj
      adjustment_ratio: adjustment_ratio
      volume: volume
      dividend: dividend_unadj
      distribution: distribution_unadj

As with security and inputs data, the entry type: object indicates that pricing data will be supplied to the simulation as an object.

The /simulator/pricing_data/columns section allows us to map columns in our pricing data to columns the backtester expects to be present. For example, the entry

    columns:
      close_price: price_unadj

indicates that there is a column price_unadj in the data that should be treated by the system as the required close_price column.

We will be using the sample_pricing data set included with the package as the pricing data for our simulation. Below are a few rows of this dataset:

data(sample_pricing)

sample_pricing %>%
  filter(date %in% as.Date("2020-06-01")) %>%
  select("date", "id", "price_unadj", "prior_close_unadj", "adjustment_ratio",
         "volume", "dividend_unadj", "distribution_unadj") %>%
  head()
#>         date   id price_unadj prior_close_unadj adjustment_ratio   volume
#> 1 2020-06-01    A       89.91             88.14                1  2477600
#> 2 2020-06-01  AAL       11.11             10.50                1 50681600
#> 3 2020-06-01  AAP      139.79            139.32                1   840953
#> 4 2020-06-01 AAPL      321.85            317.94                1 20254653
#> 5 2020-06-01 ABBV       90.70             92.67                1  8483100
#> 6 2020-06-01  ABC       95.15             95.34                1  1012600
#>   dividend_unadj distribution_unadj
#> 1              0                  0
#> 2              0                  0
#> 3              0                  0
#> 4              0                  0
#> 5              0                  0
#> 6              0                  0

As per the configuration file, the price_unadj, prior_close_unadj, dividend_unadj, and distribution_unadj columns will be treated by the simulator as the close_price, prior_close_price, dividend, and distribution columns, respectively. There are no dividends or distributions for the securities shown, so these values are NA. There are no splits or other changes to the securities’ adjustment basis, so all adjustment_ratio values are 1.

Note regarding date semantics for pricing/market data: The pricing data for date 2020-06-01 in the simulator is data as-of 2020-06-01. Therefore, some of the data could not have been known until the end of that day. For example, the close_price and volume data items are measured at the end of the trading day on 2020-06-01. Note that these semantics are in contrast with the date semantics for alpha/factor input data, where all data is assumed to have been known before trading beings on the data date.

Strategy specification

All strategy settings, including input signal and exposure constraints, are specified in the yaml configuration file. Below are the key strategy settings from the sample.yaml file:

strategies:
  strategy_1:
    in_var: value
    strategy_capital: 1e6
    ideal_long_weight: 1
    ideal_short_weight: 1
    position_limit_pct_lmv: 1
    position_limit_pct_smv: 1
    position_limit_pct_adv: 30
    trading_limit_pct_adv: 5
    constraints:
      size:
        type: factor
        in_var: size
        upper_bound: 0.01
        lower_bound: -0.01
      sector:
        type: category
        in_var: sector
        upper_bound: 0.02
        lower_bound: -0.02
turnover_limit: 25000
target_weight_policy: half-way

In this section we’ll be discussing each of the entries listed above. Note that we are only working with a single strategy in this vignette (although it is possible to use strand to run multiple strategies in a single simulation). This strategy is called strategy_1. The settings specific to strategy_1 all fall under the /strategies/strategy_1 entry above.

Strategy alpha

We set the input alpha for strategy_1 by setting the in_var parameter:

strategies:
  strategy_1:
    in_var: value

The above indicates that in the objective function for our optimization we are maximizing the exposure of our portfolio to value. As we saw in the previous section, value is one of the columns in our inputs data.

Market value constraints

The market value of the portfolio is controlled by the following section of the configuration file:

strategies:
  strategy_1:
    strategy_capital: 1e6
    ideal_long_weight: 1
    ideal_short_weight: 1

The strategy_capital setting controls the amount of capital allocated to the strategy. The ideal_long_weight and ideal_short_weight parameters control the leverage for the long and short sides, respectively. The target long market value is the product of the strategy_capital and ideal_long_weight parameters, while the target short market value is -1 times the product of the strategy_capital and ideal_short_weight parameters. In our example the ideal long and short leverage values are both 1 with a strategy capital of $1mm. This means our target long market value is $1mm and target short market value is -$1mm.

Position size constraints

Position sizes are controlled by the parameters position_limit_pct_lmv and position_limit_pct_smv:

strategies:
  strategy_1:
    position_limit_pct_lmv: 1
    position_limit_pct_smv: 1

These parameters express limits on the size of positions as a percentage of the target long and short market values for the portfolio (calculated in the previous example). In our example both are set to 1, which means that the maximum size for a long position is 1% times $1mm = $10,000, and the maximum size for a short position is 1% times -$1mm = -$10,000.

Liquidity constraints

There are two ways in which the rc_vol measure in our inputs data is used to impose liquidity constraints on our portfolio construction process.

First, the position_limit_pct_adv parameter is used to impose a position size constraint in addition to the constraints discussed in the previous section. This parameter limits the position size, in absolute value, to a percentage of the rc_vol measure in the current day’s input data. In sample.yaml we have:

strategies:
  strategy_1:
    position_limit_pct_adv: 30

which means a position in our simulation can be no greater than 30% of our rc_vol value. For example, suppose the average volume measure for security ABC is $10M. This means that the liquidity constraint imposes a limit on the size of long positions in ABC of $3M and a limit on the size of short positions of -$3mm.

Second, the trading_limit_pct_adv parameter limits the size of the order that can be generated for a stock. The idea is to size orders to be in line with the amount of trading expected in the market and any limitation we are planning on imposing on participation. It’s difficult to control exposures effectively if we don’t do this! In the example configuration file, we have:

strategies:
  strategy_1:
    trading_limit_pct_adv: 5

which means that we can buy at most 5% or sell at most 5% of a security’s rc_vol measure on a given day. Continuing our example above, if our measurement on a given day is that ABC is trading on average $10M per day, we can buy at most $500,000 or sell at most $500,000 on that day.

Factor constraints

Factor constraints limit the amount of exposure we can have in our optimization to a given numeric value. That is, we impose an upper and/or lower bound on the product of our position weights and the numeric value. In the context of exposure constraints, a position weight means the signed market value of a position divided by the strategy’s strategy_capital value.

In this vignette’s example we impose factor constraints on size:

strategies:
  strategy_1:
    constraints:
      size:
        type: factor
        in_var: size
        upper_bound: 0.01
        lower_bound: -0.01

Recall that size must be a column present in the daily inputs data. Each constraint is configured in a separate entry in the constraints section that contains the following key/value pairs:

In our example we are limiting exposure to size to be within +/-1%.

Category exposure constraints

Category exposure constraints are similar to factor exposure constraints. A category constraint imposes a limit on the exposure (i.e., the sum of the position weights) within each level of a category. In our example we have a single constraint on sector:

strategies:
  strategy_1:
    constraints:
      sector:
        type: category
        in_var: sector
        upper_bound: 0.02
        lower_bound: -0.02

Here, sector must be a column that appears in the security reference or the simulation’s input data. As with factor constraints, the category constraint is defined in its own entry in the constraints section of the configuration file for strategy_1 with the following key/value pairs:

There are 11 levels in sector in our security reference:

data(sample_secref)
sample_secref %>%
  group_by(sector) %>%
  summarise(count = n()) %>%
  print(n = Inf)
#> # A tibble: 11 x 2
#>    sector                 count
#>    <chr>                  <int>
#>  1 Communication Services    24
#>  2 Consumer Discretionary    58
#>  3 Consumer Staples          31
#>  4 Energy                    26
#>  5 Financials                62
#>  6 Health Care               62
#>  7 Industrials               71
#>  8 Information Technology    71
#>  9 Materials                 28
#> 10 Real Estate               31
#> 11 Utilities                 28

The constraint above indicates that in our optimization we may have no more than +/-2% of exposure in any one of these levels.

Turnover limit

The top-level configuration entry turnover_limit: 25000 imposes a fixed turnover constraint on the optimization. This constraint means that, if the reference currency is USD, the most that can be traded on a single day is $25,000.

Note that the system will be allowed to trade more than the turnover limit specified above if the portfolio is significantly over- or under-invested. For example, if the target gross market value of the portfolio (\(T\)) is $1mm, the current gross market value (\(C\)) of the portfolio is $1.2mm, and the turnover limit (\(tl\)) is $25,000, the effective turnover limit will be \(\texttt{max}(tl, |T - C|)\) = $200,000.

Target weight policy

The target_weight_policy top-level configuration item controls how aggresively the ideal long and short weights are targeted during portfolio optimization. Currently there are two allowable values: full and half-way. When set to full, the ideal weights are targeted during optimization. When set to half-way, the optimization uses the midpoint between the current and ideal weight as the target weight. For example, suppose the current portfolio is empty, the ideal long weight is 1, and target_weight_policy is set to half-way. In this case a target long weight of 0.5 will be used during optimization.

Constraint loosening

There may be cases where, given the constraints imposed on the portfolio construction process, no solution can be found. In this scenario, the violating constraints are loosened in an attempt to find a solution. This loosening applies to factor and category exposure constraints.

For example, suppose the current exposure to level Industrials of sector is 3%, and we have set a 2% exposure limit on exposure to Industrials. If no solution is found, the optimization is re-run with a limit that is 50% as strict. That is, we set the limit to the midpoint between the current exposure and the constraint limit. In our example, the midpoint between the current exposure and the limit is 2.5%. If no solution is found with the new limit of 2.5%, the constraint is loosened again by 50%, moving the limit to 2.75%. If no solution is found after this second round of loosening, the constraint limit is set to the current exposure, in this case 3%, to ensure that the constraint can be satisfied.

Simulator settings

In the previous section we discussed how to define the strategy that we want to backtest, in terms of the input signal and portfolio construction constraints. In this section we will cover how to configure different aspects of the backtest that are not related to portfolio construction.

Date range

The top-level items from and to control the starting and ending dates for the backtest:

from: 2020-06-01
to: 2020-08-31

In this vignette we will run a simulation over 3 months of daily data, beginning on June 1, 2020 and ending on August 31, 2020.

Solver

The top-level parameter solver controls which linear optimization toolkit is used to solve the portfolio construction constrained optimization problem:

solver: glpk

Currently there are two options: glpk to use the GNU Linear Programming Kit, and symphony to use the COIN-OR SYMPHONY solver.

Participation rate limit

The simulator setting fill_rate_pct_vol controls what percentage of the observed volume for a security on a given day can be used to fill our order. We call this the fill rate or participation rate for the backtest. In sample.yaml we set this value to 4%:

simulator:
  fill_rate_pct_vol: 4

As an example, suppose on 2020-06-01 that the optimization process for strategy_1 generates an order to buy 5,000 shares of security . But suppose ABC only trades 100,000 shares on that day. Because we have set a 4% limit on participation, strategy_1 will only be able to buy 4,000 shares, despite generating an order to buy 5,000.

Transaction costs

The value of the parameter transaction_cost_pct determines the fixed transaction costs, as a percentage of traded notional, for the strategy. In sample.yaml, we have:

simulator:
  transaction_cost_pct: 0.1

which means that we are charging a transaction cost penalty of 10bps of traded notional. For example, if we trade $10,000 worth of stock ABC on 2019-07-15 we incur a transaction cost of $10 for that day.

Financing costs

Setting financing_cost_pct controls the financing rate used for the backtest. Financing for day \(t\) is applied to the starting notional of the position and ignores any trading on day \(t\). Financing is calculated using a standard 360-day-count methodology and is triple-charged on Mondays. The financing charge is set to 1% for the vignette’s backtest:

simulator:
  financing_cost_pct: 1

For example, if we have a position in ABC valued at $100,000 at the beginning of Monday, July 15, 2019, we would apply a financing charge of \(3 \times \frac{0.01 \times \$100,000}{360} = \$8.33\) for the position on that day.

Running the simulation

At this point we have covered all of the setup required to run the backtest. We have prepared our data, including the security master and daily inputs and market data. We have filled in the configuration file to specify the strategy and control different aspects of the simulator. We are ready to run the backtest.

The strand package is implemented using the R6 OOP system. To run the backtest, we create a Simulation object by passing the path to the yaml configuration file and the three data sets discussed above to the constructor. Then we call the method run():

data(sample_inputs)
data(sample_pricing)
data(sample_secref)

sim <- Simulation$new(config = "sample.yaml",
                      raw_input_data = sample_inputs,
                      raw_pricing_data = sample_pricing,
                      security_reference_data = sample_secref)
                      
sim$run()

Viewing summary statistics

When the backtest is finished, we can call methods to summarize and plot the results. For example, the overallStatsDf() method returns a data frame of key statistics:

sim$overallStatsDf()
#>                            Item   Gross       Net
#> 1                     Total P&L -32,355   -40,867
#> 2       Total Return on GMV (%)    -1.8      -2.3
#> 3  Annualized Return on GMV (%)    -6.9      -8.7
#> 4            Annualized Vol (%)     8.9       8.8
#> 5             Annualized Sharpe   -0.77     -0.98
#> 6              Max Drawdown (%)    -6.6      -6.8
#> 7                       Avg GMV         1,976,540
#> 8                       Avg NMV              -716
#> 9                     Avg Count               189
#> 10           Avg Daily Turnover            53,022
#> 11      Holding Period (months)               3.6

This display shows that, gross of transaction and financing costs, the strategy had a return of % on gross market value (GMV) and had an annualized Sharpe ratio of . Net of costs, the strategy’s return was % with a Sharpe of . The average gross market value of the portfolio (GMV) was , while the average net market value (NMV) was , as expected given that the strategy’s target market values are $1mm long, -$1mm short. On average there were positions in the portfolio. The average daily turnover (gross market value of trading) was , implying a holding period of months.

Plotting results

There are several methods that can be used to visualize backtest results.

Portfolio returns

The plotPerformance() method plots gross and net return on GMV over time:

sim$plotPerformance()

Market values

The plotMarketValue() method plots gross market value (GMV), net market value (NMV), long market value (LMV) and short market value (SMV) over time:

sim$plotMarketValue()

Category exposures

The plotCategoryExposure() method shows the exposure over time within each level of a given category. Below we plot the exposure within the levels of sector, which in our backtest has an exposure constraint of +/-2%:

sim$plotCategoryExposure("sector")

Note that there are some cases where the exposure to a level of sector falls outside of +/-2% despite the category exposure constraint we impose during portfolio construction. This can be due to the following:

  • Prices for securities in the level of the category are rising (in the case of an exposure that is too positive) or falling (in the case of an exposure that is too negative). The portfolio construction step uses prices as of the start of the day, while the plot above shows exposures at the end of the day. So even if constraints are within bounds using starting market values they could be out of bounds at the end of the day due to price movement.
  • Lack of liquidity. The trades that we need to make to bring an exposure back within bounds could be left unfilled due to a lack of liquidity. Recall that we configured our backtest to only allow fills up to 4% of the number of shares traded in the market.
  • Loosened constraints. It could be the case that no set of trades can be found to bring an exposure that has drifted back within bounds and that the constraint needed to be loosened.

Exploring these scenarios is possible by looking at lower-level backtest results but is outside the scope of this vignette.

Factor exposures

The plotFactorExposure() method shows the portfolio exposure over time to one or more factors. Below we plot the exposure to size, which in our backtest has an exposure constraint of +/-1%:

sim$plotFactorExposure(c("size"))

Here we can also see spikes of exposure outside the constrained range +/-1%. As discussed in the previous section, price movement, lack of liquidity and constraint loosening are possible explanations for these spikes in end-of-day exposure. In the case of factor constraints, another possible explanation is a significant day-over-day change in factor values. Again, exploring these scenarios is possible by diving more deeply into the backtest’s result data, but is outside the scope of this document.

Appendix: file-based inputs

The first part of the vignette showed how to run a simulation where all data is supplied using objects in memory. In this appendix we discuss setting up a simulation where all data comes from binary feather files stored on disk. This approach is useful for running simulations over many periods with a small memory footprint.

In this section we assume we have a directory called sample_data in the vignettes directory that contains security reference, pricing, and alpha/factor input data in feather format. An archive of sample data that matches the configuration below is available for download from the package’s GitHub repository for experimentation.

Security reference

The simulation’s configuration file should be set as follows for file-based security reference input:

simulator:
  secref_data:
    type: file
    filename: sample_data/secref.feather

The type field specifies that secref information should come from a file and not be passed to the simulator as a constructor parameter. The filename field gives the location of the file.

Alpha and factor inputs

When using file-based data, strand expects alpha and factor input data for each day to be stored in its own file. The /simulator/input_data/directory and /simulator/input_data/prefix configuration options specify where the system should find these files:

simulator:
  input_data:
    type: file
    directory: sample_data/inputs
    prefix: inputs

This entry indicates that the input data should be retrieved from files located in sample_data/inputs with filename prefix inputs. By convention the data for YYYYmmdd should have filename prefix_YYYYmmdd.feather. Therefore the file sample_data/inputs/inputs_20190104.feather contains the alpha and risk values that will be used for trading on 2019-01-04.

Market data

Like alpha and factor inputs, strand expects pricing data for each day to be stored in its own file. The /simulator/pricing_data/directory and /simulator/pricing_data/prefix configuration options specify where the system should find these files:

simulator:
  pricing_data:
    type: file
    directory: sample_data/pricing
    prefix: pricing

The /simulator/pricing_data/directory value specifies the file system location for the market data files. The value of /simulator/pricing_data/prefix indicates the prefix for each file name. So in our example, the market data for 2019-01-04 will be found in sample_data/pricing/pricing_20190104.feather.