Note: this document assumes familiarity with the notion of investment strategy backtesting. For an introduction, see the article Backtests in R-News Volume 7/1.
Evaluating an investment strategy is a multi-step process. Often the first step is to scrutinize a strategy’s underlying signal, or alpha, by running a top-bottom quartile spread analysis using a tool like the R package backtest
. A quartile analysis gives a good idea as to whether the signal is predictive of future returns. However, the analysis ignores many real-world aspects of strategy implementation and trading. For example, in a spread analysis it is assumed that we can trade immediately, regardless of actual liquidity. As a result, it is difficult to learn from a spread analysis how the performance of a strategy degrades as investment capital and portfolio size increases. In a sophisticated backtest more of an attempt is made to mimic how the strategy would be implemented in practice. This includes using optimization-based portfolio construction and trade selection, and making conservative assumptions about what trades could actually have been made in the market.
The strand
package provides a framework for running this more realistic type of backtest. Once a strategy is defined in terms of its alpha, risk constraints, and position and turnover limits, the system simulates how the strategy would be operated day-by-day, including daily order generation and realistic trade filling.
The purpose of this vignette is to describe how to set up and run a strand
simulation.
The strand
system is meant to mimic a daily professional-level portfolio management process. The process involves the following steps:
Running a backtest in strand
follows the same process. First, data is prepared and organized for input into the system. Second, the strategy is defined in a configuration file. Finally, we run the simulation and analyze results. In the next few sections we’ll cover each of these major steps in turn.
All setup and configuration in strand
is accomplished by filling in a yaml-format configuration file. The backtest in this vignette, for example, will use the file vignettes/sample.yaml
. The configuration file contains two major sections. The strategies
section contains entries for the strategy’s alpha, any portfolio construction constraints, position size constraints, etc. (Note that it is possible to operate multiple strategies in the same simulation, but this vignette will only cover a simulation with one strategy.) The simulator
section contains the location of input data, input data column mappings, and simulator options. There are also several top-level settings that control aspects of the simulation.
Three types of input data are needed to run a strand
simulation:
This data can be provided to the simulator directly using objects stored in memory, or the simulator can read the data from binary files stored in feather
format on disk.
In this vignette we will use the package’s sample data sets as input and supply them as objects in memory. For a discussion of how to use files for input data, see Appendix: file-based inputs.
The package’s sample data set includes value and size factors, as well as pricing information, for most of the stocks in the S&P 500 for the period June-August 2020. The source of fundamental data is EDGAR, and all pricing data was downloaded using the Tiingo Stock API.
Two packages are required to work through the code in this vignette: strand
and dplyr
:
library(strand)
library(dplyr)
The security reference specifies static information about each security in the backtest. Security reference data must include at least the following columns:
id
containing a unique security identifier.symbol
(required for reporting).Below is a listing of a few rows and columns from the sample_secref
data set included with the package:
data(sample_secref)
%>%
sample_secref select(id, name, sector) %>%
head()
#> id name sector
#> 1 MMM 3M Company Industrials
#> 2 ABT Abbott Laboratories Health Care
#> 3 ABBV AbbVie Inc. Health Care
#> 4 ABMD ABIOMED Inc Health Care
#> 5 ACN Accenture plc Information Technology
#> 6 ATVI Activision Blizzard Communication Services
Shown are three columns, all of class character
: id
, name
, and one category column sector
. In sample.yaml
we indicate that this data will be passed to the simulator as an object by configuring /simulator/secref_data
as follows:
simulator:
secref_data:
type: object
Alpha and factor inputs are used in the daily portfolio construction process. Alpha and factor inputs for each day must contain at least the following columns:
date
: the date for which the information should be used to create orders.id
: the security identifier. The identifier must appear in the security reference discussed above.rc_vol
: an estimate of the amount of volume, in market value (in the simulation’s reference currency) that we expect to see for one day, on average. This measure of volume is used in the portfolio optimization step to ensure that our order sizes are in line with the amount we expect to be able to trade in the market. It is also used in the optimization’s liquidity constraint on position size.The /simulator/input_data
section in sample.yaml
indicates that, like security reference data, alpha and factor inputs will be supplied to the simulation as an object:
simulator:
input_data:
type: object
We will be using the sample_inputs
data set included with the package as the alpha and factor input for our simulation:
data(sample_inputs)
%>%
sample_inputs filter(date %in% as.Date("2020-06-01")) %>%
select("date", "id", "rc_vol", "size", "value") %>%
head()
#> date id rc_vol size value
#> 1 2020-06-01 A 195700553 0.3103556 -0.4425451
#> 2 2020-06-01 AAL 693431274 -1.7615180 -2.9023557
#> 3 2020-06-01 AAP 147177014 -0.8766498 0.1520585
#> 4 2020-06-01 AAPL 10830836279 2.6736936 -1.1883389
#> 5 2020-06-01 ABBV 1207185748 1.4392005 -1.8642855
#> 6 2020-06-01 ABC 119776226 -0.1054613 -0.4033034
The first column, date
, contains the date for which the data should be used to generate orders in the simulation. The second and third columns are the id
and rc_vol
columns described above. The column value
is a numeric column that serves as the signal for the backtest run in this vignette. The size
column contains a numeric factor that is used in a portfolio construction constraint.
Note regarding date semantics for inputs data: The data for 2020-06-01 is used for constructing the portfolio and generating orders for trading on that day. As a result, it is assumed that this data is known before trading begins on 2020-06-01 (e.g., data as-of the close on 2020-05-29). How the strategy operates and any expectations around the timeliness of data delivery should dictate what is used as inputs for a given date.
Market data is used to value positions, calculate portfolio performance, and simulate trade fills in the backtest. In the current version of the package, all prices and market values are assumed to be in a single reference currency. Market data for each date must contain at least the following columns:
date
: as-of date for the pricing data.id
: the security identifier. The identifier must appear in the security reference discussed above.close_price
: the unadjusted closing price for the securityprior_close_price
: the unadjusted closing price for the security on the previous dayadjustment_ratio
: ratio indicating any change in the adjustment basis. The ratio is the number of old shares over the number of new shares. So in the case of a 2:1 split, the adjustment ratio is 0.5.volume
: the number of total shares of the security traded in the market.dividend
: any cash dividend issued for the security.distribution
: any other cash or equivalent per-share distribution, such as a spin-off, that does not affect the company’s shares outstanding.Below is the /simulator/pricing_data
section of the sample.yaml
configuration file:
simulator:
pricing_data:
type: object
columns:
close_price: price_unadj
prior_close_price: prior_close_unadj
adjustment_ratio: adjustment_ratio
volume: volume
dividend: dividend_unadj
distribution: distribution_unadj
As with security and inputs data, the entry type: object
indicates that pricing data will be supplied to the simulation as an object.
The /simulator/pricing_data/columns
section allows us to map columns in our pricing data to columns the backtester expects to be present. For example, the entry
columns:
close_price: price_unadj
indicates that there is a column price_unadj
in the data that should be treated by the system as the required close_price
column.
We will be using the sample_pricing
data set included with the package as the pricing data for our simulation. Below are a few rows of this dataset:
data(sample_pricing)
%>%
sample_pricing filter(date %in% as.Date("2020-06-01")) %>%
select("date", "id", "price_unadj", "prior_close_unadj", "adjustment_ratio",
"volume", "dividend_unadj", "distribution_unadj") %>%
head()
#> date id price_unadj prior_close_unadj adjustment_ratio volume
#> 1 2020-06-01 A 89.91 88.14 1 2477600
#> 2 2020-06-01 AAL 11.11 10.50 1 50681600
#> 3 2020-06-01 AAP 139.79 139.32 1 840953
#> 4 2020-06-01 AAPL 321.85 317.94 1 20254653
#> 5 2020-06-01 ABBV 90.70 92.67 1 8483100
#> 6 2020-06-01 ABC 95.15 95.34 1 1012600
#> dividend_unadj distribution_unadj
#> 1 0 0
#> 2 0 0
#> 3 0 0
#> 4 0 0
#> 5 0 0
#> 6 0 0
As per the configuration file, the price_unadj
, prior_close_unadj
, dividend_unadj
, and distribution_unadj
columns will be treated by the simulator as the close_price
, prior_close_price
, dividend
, and distribution
columns, respectively. There are no dividends or distributions for the securities shown, so these values are NA
. There are no splits or other changes to the securities’ adjustment basis, so all adjustment_ratio
values are 1.
Note regarding date semantics for pricing/market data: The pricing data for date 2020-06-01 in the simulator is data as-of 2020-06-01. Therefore, some of the data could not have been known until the end of that day. For example, the
close_price
andvolume
data items are measured at the end of the trading day on 2020-06-01. Note that these semantics are in contrast with the date semantics for alpha/factor input data, where all data is assumed to have been known before trading beings on the data date.
All strategy settings, including input signal and exposure constraints, are specified in the yaml configuration file. Below are the key strategy settings from the sample.yaml
file:
strategies:
strategy_1:
in_var: value
strategy_capital: 1e6
ideal_long_weight: 1
ideal_short_weight: 1
position_limit_pct_lmv: 1
position_limit_pct_smv: 1
position_limit_pct_adv: 30
trading_limit_pct_adv: 5
constraints:
size:
type: factor
in_var: size
upper_bound: 0.01
lower_bound: -0.01
sector:
type: category
in_var: sector
upper_bound: 0.02
lower_bound: -0.02
turnover_limit: 25000
target_weight_policy: half-way
In this section we’ll be discussing each of the entries listed above. Note that we are only working with a single strategy in this vignette (although it is possible to use strand
to run multiple strategies in a single simulation). This strategy is called strategy_1
. The settings specific to strategy_1
all fall under the /strategies/strategy_1
entry above.
We set the input alpha for strategy_1
by setting the in_var
parameter:
strategies:
strategy_1:
in_var: value
The above indicates that in the objective function for our optimization we are maximizing the exposure of our portfolio to value
. As we saw in the previous section, value
is one of the columns in our inputs data.
The market value of the portfolio is controlled by the following section of the configuration file:
strategies:
strategy_1:
strategy_capital: 1e6
ideal_long_weight: 1
ideal_short_weight: 1
The strategy_capital
setting controls the amount of capital allocated to the strategy. The ideal_long_weight
and ideal_short_weight
parameters control the leverage for the long and short sides, respectively. The target long market value is the product of the strategy_capital
and ideal_long_weight
parameters, while the target short market value is -1 times the product of the strategy_capital
and ideal_short_weight
parameters. In our example the ideal long and short leverage values are both 1 with a strategy capital of $1mm. This means our target long market value is $1mm and target short market value is -$1mm.
Position sizes are controlled by the parameters position_limit_pct_lmv
and position_limit_pct_smv
:
strategies:
strategy_1:
position_limit_pct_lmv: 1
position_limit_pct_smv: 1
These parameters express limits on the size of positions as a percentage of the target long and short market values for the portfolio (calculated in the previous example). In our example both are set to 1, which means that the maximum size for a long position is 1% times $1mm = $10,000, and the maximum size for a short position is 1% times -$1mm = -$10,000.
There are two ways in which the rc_vol
measure in our inputs data is used to impose liquidity constraints on our portfolio construction process.
First, the position_limit_pct_adv
parameter is used to impose a position size constraint in addition to the constraints discussed in the previous section. This parameter limits the position size, in absolute value, to a percentage of the rc_vol
measure in the current day’s input data. In sample.yaml
we have:
strategies:
strategy_1:
position_limit_pct_adv: 30
which means a position in our simulation can be no greater than 30% of our rc_vol
value. For example, suppose the average volume measure for security ABC is $10M. This means that the liquidity constraint imposes a limit on the size of long positions in ABC of $3M and a limit on the size of short positions of -$3mm.
Second, the trading_limit_pct_adv
parameter limits the size of the order that can be generated for a stock. The idea is to size orders to be in line with the amount of trading expected in the market and any limitation we are planning on imposing on participation. It’s difficult to control exposures effectively if we don’t do this! In the example configuration file, we have:
strategies:
strategy_1:
trading_limit_pct_adv: 5
which means that we can buy at most 5% or sell at most 5% of a security’s rc_vol
measure on a given day. Continuing our example above, if our measurement on a given day is that ABC is trading on average $10M per day, we can buy at most $500,000 or sell at most $500,000 on that day.
Factor constraints limit the amount of exposure we can have in our optimization to a given numeric value. That is, we impose an upper and/or lower bound on the product of our position weights and the numeric value. In the context of exposure constraints, a position weight means the signed market value of a position divided by the strategy’s strategy_capital
value.
In this vignette’s example we impose factor constraints on size
:
strategies:
strategy_1:
constraints:
size:
type: factor
in_var: size
upper_bound: 0.01
lower_bound: -0.01
Recall that size
must be a column present in the daily inputs data. Each constraint is configured in a separate entry in the constraints
section that contains the following key/value pairs:
type
: in this case factor
because we are constraining exposure to a numeric value.in_var
: the name of the column in the input data that contains the numeric value.upper_bound
: the upper bound on exposure to in_var
.lower_bound
: the lower bound on exposure to in_var
.In our example we are limiting exposure to size
to be within +/-1%.
Category exposure constraints are similar to factor exposure constraints. A category constraint imposes a limit on the exposure (i.e., the sum of the position weights) within each level of a category. In our example we have a single constraint on sector
:
strategies:
strategy_1:
constraints:
sector:
type: category
in_var: sector
upper_bound: 0.02
lower_bound: -0.02
Here, sector
must be a column that appears in the security reference or the simulation’s input data. As with factor constraints, the category constraint is defined in its own entry in the constraints
section of the configuration file for strategy_1
with the following key/value pairs:
type
: in this case category
because we are constraining exposure to the levels of a category.in_var
: the name of the column in the security reference that contains the category level for each security.upper_bound
: the upper bound on exposure for each level of in_var
.lower_bound
: the lower bound on exposure for each level of in_var
.There are 11 levels in sector
in our security reference:
data(sample_secref)
%>%
sample_secref group_by(sector) %>%
summarise(count = n()) %>%
print(n = Inf)
#> # A tibble: 11 x 2
#> sector count
#> <chr> <int>
#> 1 Communication Services 24
#> 2 Consumer Discretionary 58
#> 3 Consumer Staples 31
#> 4 Energy 26
#> 5 Financials 62
#> 6 Health Care 62
#> 7 Industrials 71
#> 8 Information Technology 71
#> 9 Materials 28
#> 10 Real Estate 31
#> 11 Utilities 28
The constraint above indicates that in our optimization we may have no more than +/-2% of exposure in any one of these levels.
The top-level configuration entry turnover_limit: 25000
imposes a fixed turnover constraint on the optimization. This constraint means that, if the reference currency is USD, the most that can be traded on a single day is $25,000.
Note that the system will be allowed to trade more than the turnover limit specified above if the portfolio is significantly over- or under-invested. For example, if the target gross market value of the portfolio (\(T\)) is $1mm, the current gross market value (\(C\)) of the portfolio is $1.2mm, and the turnover limit (\(tl\)) is $25,000, the effective turnover limit will be \(\texttt{max}(tl, |T - C|)\) = $200,000.
The target_weight_policy
top-level configuration item controls how aggresively the ideal long and short weights are targeted during portfolio optimization. Currently there are two allowable values: full
and half-way
. When set to full
, the ideal weights are targeted during optimization. When set to half-way
, the optimization uses the midpoint between the current and ideal weight as the target weight. For example, suppose the current portfolio is empty, the ideal long weight is 1, and target_weight_policy
is set to half-way
. In this case a target long weight of 0.5 will be used during optimization.
There may be cases where, given the constraints imposed on the portfolio construction process, no solution can be found. In this scenario, the violating constraints are loosened in an attempt to find a solution. This loosening applies to factor and category exposure constraints.
For example, suppose the current exposure to level Industrials
of sector
is 3%, and we have set a 2% exposure limit on exposure to Industrials
. If no solution is found, the optimization is re-run with a limit that is 50% as strict. That is, we set the limit to the midpoint between the current exposure and the constraint limit. In our example, the midpoint between the current exposure and the limit is 2.5%. If no solution is found with the new limit of 2.5%, the constraint is loosened again by 50%, moving the limit to 2.75%. If no solution is found after this second round of loosening, the constraint limit is set to the current exposure, in this case 3%, to ensure that the constraint can be satisfied.
In the previous section we discussed how to define the strategy that we want to backtest, in terms of the input signal and portfolio construction constraints. In this section we will cover how to configure different aspects of the backtest that are not related to portfolio construction.
The top-level items from
and to
control the starting and ending dates for the backtest:
from: 2020-06-01
to: 2020-08-31
In this vignette we will run a simulation over 3 months of daily data, beginning on June 1, 2020 and ending on August 31, 2020.
The top-level parameter solver
controls which linear optimization toolkit is used to solve the portfolio construction constrained optimization problem:
solver: glpk
Currently there are two options: glpk
to use the GNU Linear Programming Kit, and symphony
to use the COIN-OR SYMPHONY solver.
The simulator setting fill_rate_pct_vol
controls what percentage of the observed volume for a security on a given day can be used to fill our order. We call this the fill rate or participation rate for the backtest. In sample.yaml
we set this value to 4%:
simulator:
fill_rate_pct_vol: 4
As an example, suppose on 2020-06-01 that the optimization process for strategy_1
generates an order to buy 5,000 shares of security . But suppose ABC only trades 100,000 shares on that day. Because we have set a 4% limit on participation, strategy_1
will only be able to buy 4,000 shares, despite generating an order to buy 5,000.
The value of the parameter transaction_cost_pct
determines the fixed transaction costs, as a percentage of traded notional, for the strategy. In sample.yaml
, we have:
simulator:
transaction_cost_pct: 0.1
which means that we are charging a transaction cost penalty of 10bps of traded notional. For example, if we trade $10,000 worth of stock ABC on 2019-07-15 we incur a transaction cost of $10 for that day.
Setting financing_cost_pct
controls the financing rate used for the backtest. Financing for day \(t\) is applied to the starting notional of the position and ignores any trading on day \(t\). Financing is calculated using a standard 360-day-count methodology and is triple-charged on Mondays. The financing charge is set to 1% for the vignette’s backtest:
simulator:
financing_cost_pct: 1
For example, if we have a position in ABC valued at $100,000 at the beginning of Monday, July 15, 2019, we would apply a financing charge of \(3 \times \frac{0.01 \times \$100,000}{360} = \$8.33\) for the position on that day.
At this point we have covered all of the setup required to run the backtest. We have prepared our data, including the security master and daily inputs and market data. We have filled in the configuration file to specify the strategy and control different aspects of the simulator. We are ready to run the backtest.
The strand
package is implemented using the R6
OOP system. To run the backtest, we create a Simulation
object by passing the path to the yaml configuration file and the three data sets discussed above to the constructor. Then we call the method run()
:
data(sample_inputs)
data(sample_pricing)
data(sample_secref)
<- Simulation$new(config = "sample.yaml",
sim raw_input_data = sample_inputs,
raw_pricing_data = sample_pricing,
security_reference_data = sample_secref)
$run() sim
When the backtest is finished, we can call methods to summarize and plot the results. For example, the overallStatsDf()
method returns a data frame of key statistics:
$overallStatsDf()
sim#> Item Gross Net
#> 1 Total P&L -32,355 -40,867
#> 2 Total Return on GMV (%) -1.8 -2.3
#> 3 Annualized Return on GMV (%) -6.9 -8.7
#> 4 Annualized Vol (%) 8.9 8.8
#> 5 Annualized Sharpe -0.77 -0.98
#> 6 Max Drawdown (%) -6.6 -6.8
#> 7 Avg GMV 1,976,540
#> 8 Avg NMV -716
#> 9 Avg Count 189
#> 10 Avg Daily Turnover 53,022
#> 11 Holding Period (months) 3.6
This display shows that, gross of transaction and financing costs, the strategy had a return of % on gross market value (GMV) and had an annualized Sharpe ratio of . Net of costs, the strategy’s return was % with a Sharpe of . The average gross market value of the portfolio (GMV) was , while the average net market value (NMV) was , as expected given that the strategy’s target market values are $1mm long, -$1mm short. On average there were positions in the portfolio. The average daily turnover (gross market value of trading) was , implying a holding period of months.
There are several methods that can be used to visualize backtest results.
The plotPerformance()
method plots gross and net return on GMV over time:
$plotPerformance() sim
The plotMarketValue()
method plots gross market value (GMV), net market value (NMV), long market value (LMV) and short market value (SMV) over time:
$plotMarketValue() sim
The plotCategoryExposure()
method shows the exposure over time within each level of a given category. Below we plot the exposure within the levels of sector
, which in our backtest has an exposure constraint of +/-2%:
$plotCategoryExposure("sector") sim
Note that there are some cases where the exposure to a level of sector
falls outside of +/-2% despite the category exposure constraint we impose during portfolio construction. This can be due to the following:
Exploring these scenarios is possible by looking at lower-level backtest results but is outside the scope of this vignette.
The plotFactorExposure()
method shows the portfolio exposure over time to one or more factors. Below we plot the exposure to size
, which in our backtest has an exposure constraint of +/-1%:
$plotFactorExposure(c("size")) sim
Here we can also see spikes of exposure outside the constrained range +/-1%. As discussed in the previous section, price movement, lack of liquidity and constraint loosening are possible explanations for these spikes in end-of-day exposure. In the case of factor constraints, another possible explanation is a significant day-over-day change in factor values. Again, exploring these scenarios is possible by diving more deeply into the backtest’s result data, but is outside the scope of this document.
The first part of the vignette showed how to run a simulation where all data is supplied using objects in memory. In this appendix we discuss setting up a simulation where all data comes from binary feather
files stored on disk. This approach is useful for running simulations over many periods with a small memory footprint.
In this section we assume we have a directory called sample_data
in the vignettes directory that contains security reference, pricing, and alpha/factor input data in feather
format. An archive of sample data that matches the configuration below is available for download from the package’s GitHub repository for experimentation.
The simulation’s configuration file should be set as follows for file-based security reference input:
simulator:
secref_data:
type: file
filename: sample_data/secref.feather
The type
field specifies that secref information should come from a file and not be passed to the simulator as a constructor parameter. The filename
field gives the location of the file.
When using file-based data, strand
expects alpha and factor input data for each day to be stored in its own file. The /simulator/input_data/directory
and /simulator/input_data/prefix
configuration options specify where the system should find these files:
simulator:
input_data:
type: file
directory: sample_data/inputs
prefix: inputs
This entry indicates that the input data should be retrieved from files located in sample_data/inputs
with filename prefix inputs
. By convention the data for YYYYmmdd should have filename prefix_YYYYmmdd.feather
. Therefore the file sample_data/inputs/inputs_20190104.feather
contains the alpha and risk values that will be used for trading on 2019-01-04.
Like alpha and factor inputs, strand
expects pricing data for each day to be stored in its own file. The /simulator/pricing_data/directory
and /simulator/pricing_data/prefix
configuration options specify where the system should find these files:
simulator:
pricing_data:
type: file
directory: sample_data/pricing
prefix: pricing
The /simulator/pricing_data/directory
value specifies the file system location for the market data files. The value of /simulator/pricing_data/prefix
indicates the prefix for each file name. So in our example, the market data for 2019-01-04 will be found in sample_data/pricing/pricing_20190104.feather
.