pairs trading using data-driven techniques simple trading strategies part 1

Pairs Trading using Data-Driven Techniques: Simple Trading Strategies Part 3

Pairs trading is a overnice exercise of a scheme based happening mathematical analysis. We'll demonstrate how to leveraging data to create and automate a pairs trading strategy.

Download Ipython Notebook here.

Underlying Principle

Army of the Righteou's say you have a dyad of securities X and Y that have roughly underlying economic link, for example two companies that manufacture the same product alike Pepsi and Coca Cola. You gestate the ratio or difference in prices (also called the distribute) of these two to remain steadfast with time. Notwithstandin, now and again, on that point might be a divergence in the spread 'tween these two pairs caused past temporary supply/demand changes, large buy/sell orders for one security, chemical reaction for important news about one of the companies etc. In this scenario, one stock moves rising while the former moves down congener to each other. If you expect this divergence to revert gage to normal with time, you toilet make a pairs trade.

When there is a temporary divergence, the pairs trade would be to sell the outperforming stock (the stock that stirred up )and to buy the underperforming blood line (the stock that moved down ). You are making a bet that the spread between the two stocks would eventually converge by either the outperforming standard moving pull out or the underperforming stock moving back up or both — your trade will score money in all of these scenarios. If both the stocks rise or move down together without changing the spread between them, you don't make or fall back whatsoever money.

Therefore, pairs trading is a commercialise neutral trading strategy facultative traders to profit from virtually any market conditions: uptrend, downtrend, Oregon sideways campaign.

Explaining the Concept: We start by generating two fake securities.

                          significance              numpy              as              np              
              import              pandas              as              pd                              import                statsmodels                
                from                statsmodels.tsa.stattools                significance                coint
                # merely set the seed for the random total source                
np.random.seed(107)
                              import                matplotlib.pyplot                as                plt

Let's generate a fake security X and model it's daily returns past drawing from a normal distribution. Then we perform a cumulative sum to get the value of X on apiece twenty-four hours.

Fake Security X with returns careworn from a perpendicular dispersion

            # Generate day-after-day returns            Xreturns = np.random.average(0, 1, 100)                        # sum them and shift altogether the prices up            X = pd.Series(np.cumsum(
              Xreturns), name='X')              
              + 50
X.plot(figsize=(15,7))
plt.indicate()

At present we generate Y which has a deep worldly data link to X, so price of Y should vary pretty likewise equally X. We model this by winning X, shifting it up and adding some random noise drawn from a normal distribution.

            haphazardness = np.random.normal(0, 1, 100)
Y = X + 5 + noise
Y.name = 'Y'            pd.concat([X, Y], axis=1).plot(figsize=(15,7))            plt.show()

Cointegration

Cointegration, very similar to correlation, means that the ratio between two series bequeath vary approximately a mean. The two series, Y and X follow the follwing:

Y = ⍺ X + e

where ⍺ is the constant ratio and e is white racket. Read more present

For pairs trading to work between two timeseries, the expected value of the ratio over time must converge to the mean, i.e. they should be cointegrated.

The prison term serial publication we constructed above are cointegrated. We'll plot the ratio between the deuce like a sho thus we can regard how this looks.

Ratio between prices of two cointegrated stocks and it's mean

            (Y/X).plot(figsize=(15,7))                        plt.axhline((Y/X).mean(), color='violent', linestyle='--')                        plt.xlabel('Fourth dimension')
plt.legend(['Mary Leontyne Pric Ratio', 'Mean'])
plt.show()

Testing for Cointegration

There is a commodious test that lives in statsmodels.tsa.stattools. We should see a identical low p-prise, as we've artificially created two serial publication that are as cointegrated as physically possible.

                          # compute the p-time value of the cointegration examination              
              # will inform us as to whether the ratio between the 2 timeseries is stationary              
              # some its mean              
score, pvalue, _ = coint(X,Y)
              print              pvalue

1.81864477307e-17

Remark: Correlation vs. Cointegration

Correlation and cointegration, while theoretically similar, are not the same. Let's look after at examples of series that are correlated, but not cointegrated, and the other way around. First let's check the correlation of the series we merely generated.

            X.corr(Y)

0.951

That's selfsame high, as we would expect. But how would deuce series that are correlated but not cointegrated look? A swordlike example is two series that just vary.

2 correlative serial publication (that are not co-integrated)

            ret1 = np.random.mean(1, 1, 100)
ret2 = Np.random.normal(2, 1, 100)s1 = pd.Serial( nurse clinician.cumsum(ret1), name='X')
s2 = pd.Series( np.cumsum(ret2), name='Y')
              atomic number 46.concat([s1, s2], axis=1 ).plot(figsize=(15,7))
plt.present()
                                      publish              'Correlation: ' + str(X_diverging.corr(Y_diverging))
score, pvalue, _ = coint(X_diverging,Y_diverging)
              print              'Cointegration test p-value: ' + str(pvalue)

Correlational statistics: 0.998
Cointegration try p-value: 0.258

A simple example of cointegration without correlation is a normally distributed serial publication and a square wave.

              Y2 = pd.Serial publication(np.random.formula(0, 1, 800), name='Y2') + 20
Y3 = Y2.transcript()

              Y3[0:100] = 30
Y3[100:200] = 10
Y3[200:300] = 30
Y3[300:400] = 10
Y3[400:500] = 30
Y3[500:600] = 10
Y3[600:700] = 30
Y3[700:800] = 10
                            Y2.secret plan(figsize=(15,7))
Y3.plot()
plt.ylim([0, 40])
plt.demonstrate()                              # correlation is well-nigh zero                
                black and white                'Correlation coefficient: ' + str(Y2.corr(Y3))
account, pvalue, _ = coint(Y2,Y3)
                print                'Cointegration test p-value: ' + str(pvalue)

Correlational statistics: 0.007546
Cointegration trial p-value: 0.0

The correlation is implausibly low, merely the p-value shows unflawed cointegration!

How to draw a pairs trade?

Because two cointegrated time series (such as X and Y above) swan towards and apart from each other, there will be times when the spread is higher and times when the spread is low. We make a pairs swop away buying one security and selling another. This way, if both securities go down together operating theater go up together, we neither make nor lose money — we are market neutral.

Going back to X and Y higher up that follow Y = ⍺ X + e, such that ratio (Y/X) moves around it's mean ⍺, we make money on the ratio of the two reverting to the mean. In order to do this we'll watch for when X and Y are far apart, i.e ⍺ is as well high or too low:

Exit Long the Ratio This is when the ratio ⍺ is small than usual and we expect it to increase. In the above deterrent example, we place a bet on this by purchasing Y and selling X.
Going Snub the Ratio This is when the ratio ⍺ is large and we expect it to get on smaller. In the above example, we direct a wager on this away merchandising Y and purchasing X.

Note that we always possess a "hedged position": a short position makes money if the security sold loses value, and a extended position will make money if a security system gains value, so we're immune to overall market movement. We single make or lose money if securities X and Y move relative to each another.

Using Data to find securities that behave like this

The good way to do this is to start with securities you suspect may be cointegrated and do a statistical essa. If you just run statistical tests over all pairs, you'll fall target to multiple comparison oblique.

Five-fold comparisons bias is simply the fact that on that point is an inflated run a risk to incorrectly generate a significant p-value when many tests are run, because we are running a lot of tests. If 100 tests are run on random data, we should expect to see 5 p-values below 0.05. If you are comparing n securities for co-integration, you will execute n(n-1)/2 comparisons, and you should wait to see umteen incorrectly significant p-values, which bequeath increase Eastern Samoa you increase. To stave off this, pick a microscopic number of pairs you have rationality to suspect might be cointegrated and screen to each one singly. This will solvent in less exposure to multiple comparisons bias.

So Army of the Pure's try to find some securities that display cointegration. Let's crop with a basket of US gigantic cap tech stocks — in Sdanamp;P 500. These stocks operate in a similar segment and could have cointegrated prices. We scan through a list of securities and test for cointegration between all pairs. Information technology returns a cointegration test nock matrix, a p-value intercellular substance, and whatsoever pairs for which the p-value was less than 0.05. This method is inclined to multiple comparison bias and in practice the securities should be subject to a 2d verification step. Let's ignore this for the sake of this object lesson.

              def find_cointegrated_pairs(data):
                n = data.shape[1]
                score_matrix = np.zeros((n, n))
                pvalue_matrix = neptunium.ones((n, n))
                keys = data.keys()
                pairs = []
                for i in range(n):
                for j in scope(i+1, n):
                S1 = data[keys[i]]
                S2 = data[keys[j]]
                result = coint(S1, S2)
                score = result[0]
                pvalue = final result[1]
                score_matrix[i, j] = tally
                pvalue_matrix[i, j] = pvalue
                if pvalue danlt; 0.02:
                pairs.append((keys[i], keys[j]))
                return score_matrix, pvalue_matrix, pairs

Note: We include the market bench mark (SPX) in our data — the market drives the movement of then many securities that often you might find oneself 2 seemingly cointegrated securities; but in reality they are not cointegrated with each strange simply both conintegrated with the market. This is titled a confounding variable and it is important to check for market involvement in whatever relationship you find.

              from backtester.dataSource.yahoo_data_source import YahooStockDataSource
from datetime import datetime              startDateStr = '2007/12/01'
endDateStr = '2017/12/01'
cachedFolderName = 'yahooData/'
dataSetId = 'testPairsTrading'
instrumentIds = ['SPY','AAPL','ADBE','SYMC','EBAY','MSFT','QCOM',
                'HpQ','JNPR','AMD','IBM']
ds = YahooStockDataSource(cachedFolderName=cachedFolderName,
                dataSetId=dataSetId,
                instrumentIds=instrumentIds,
                startDateStr=startDateStr,
                endDateStr=endDateStr,
                event='account')              information = ds.getBookDataByFeature()['Adj Scalelike']              data.head(3)

Now let's effort to breakthrough cointegrated pairs using our method.

              # Heatmap to show the p-values of the cointegration test
# 'tween each pair of stocks              scores, pvalues, pairs = find_cointegrated_pairs(data)
implication seaborn
m = [0,0.2,0.4,0.6,0.8,1]
seaborn.heatmap(pvalues, xticklabels=instrumentIds,                
                yticklabels=instrumentIds, cmap='RdYlGn_r',                
                mask = (pvalues dangt;= 0.98))
plt.show up()
print pairs

              [('ADBE', 'MSFT')]

Looks like 'ADBE' and 'MSFT' are cointegrated. Allow's get a load at the prices to make sure this in reality makes sense.

Plot of Price Ratio between MSFT and ADBE from 2008–2017

              S1 = data['ADBE']
S2 = data['MSFT']
score, pvalue, _ = coint(S1, S2)
mark(pvalue)
ratios = S1 / S2
ratios.plot()
plt.axhline(ratios.normal())
plt.legend([' Ratio'])
plt.show()

The ratio does look like it moved around a stable mean.The right-down ratio isn't very useful in statistical footing. It is many helpful to normalize our signal aside treating it as a z-score. Z score is defined as:

Z Grade (Value) = (Value — Entail) / Standard Digression

WARNING

In practice this is usually done to try to give some scale to the data, just this assumes an underlying distribution. Usually normal. Notwithstandin, much financial data is non normally distributed, and we must be very careful not to simply assume normality, surgery any specific distribution when generating statistics. The veracious statistical distribution of ratios could be very fat-caudated and prostrate to extreme values messing up our theoretical account and sequent in galactic losses.

                              def                zscore(series):
                return                (series - series.mean()) / np.std(series)

Z Score of Price Ratio between MSFT and ADBE from 2008–2017

              zscore(ratios).plot()
plt.axhline(zscore(ratios).mean())
plt.axhline(1.0, color='red')
plt.axhline(-1.0, color='leafy vegetable')
plt.show()

It's easier to now observe the ratio now moves around the mean, but sometimes is prone to large divergences from the mean, which we john take advantages of.

Now that we've talked about the basics of pair trading strategy, and identified co-integrated securities based on historical Leontyne Price, let's try to develop a trading signal. First, let's recap the steps in developing a trading signal using data techniques:

Collect authentic Data and clean Information
Create features from data to key out a trading signal/logic
Features lav be moving averages or ratios of price data, correlations or more compound signals — combine these to create new features
Generate a trading signal victimization these features, i.e which instruments are a buy, a sell or neutral

Step 1: Setup your job

Hera we are trying to make a signal that tells us if the ratio is a bribe or a trade at the next instant in fourth dimension, i.e our prediction adaptable Y:

Y = Ratio is buy (1) or sell (-1)

Y(t)= Sign( Ratio(t+1) — Ratio(t) )

Government note we don't need to augur actual neckcloth prices, or even up actual value of ratio (though we could), just the focusing of next move in ratio

Step 2: Pull in Reliable and Accurate Information

Auquan Toolbox is your friend here! You entirely have to specify the broth you want to swap and the datasource to use, and IT pulls the required data and cleans it for dividends and stock splits. So our data here is already clean.

We are using the following data from Yahoo at every day intervals for trading years over close 10 years (~2500 data points): Open, Close, High, Low and Trading Volume

Stone's throw 3: Split Data

Don't forget this super important step to test accuracy of your models. We're exploitation the favorable Training/Proof/Test Snag

Training 7 years ~ 70%
Test ~ 3 years 30%

              ratios = data['ADBE'] / data['MSFT']
print(len(ratios))
train = ratios[:1762]
test = ratios[1762:]

Ideally we should also make a validation set but we volition skip this for at present.

dance step 4: Boast Engineering

What could relevant features be? We want to predict the direction of ratio move. We've seen that our two securities are cointegrated so the ratio tends to turn and revert back to the mean. IT seems our features should be foreordained measures for the mean of the ratio, the divergence of the live value from the mean to be able to generate our trading signal.

Allow's use the following features:

60 day Moving Average of Ratio: Measure of rolling mean
5 day Moving Average of Ratio: Measure of live esteem of mean
60 Clarence Shepard Day Jr. Standard Deviation
z score: (5d Mommy — 60d MA) /60d SD

              ratios_mavg5 = take.rolling(window=5,
                center=False).mean()              ratios_mavg60 = train.rolling(windowpane=60,
                sum=False).mean()              std_60 = train.resonating(window=60,
                center=False).std()              zscore_60_5 = (ratios_mavg5 - ratios_mavg60)/std_60
plt.figure(figsize=(15,7))
plt.plot(train.index finger, aim.values)
plt.plot(ratios_mavg5.index, ratios_mavg5.values)
plt.plat(ratios_mavg60.index, ratios_mavg60.values)              plt.legend(['Ratio','5d Ratio MA', '60d Ratio MA'])              plt.ylabel('Ratio')
plt.bear witness()

              plt.figure out(figsize=(15,7))
zscore_60_5.plat()
plt.axhline(0, discolour='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Rolling Ratio z-Score', 'Mean', '+1', '-1'])
plt.evidenc()

The Z Score of the rolling means really brings unconscious the mean reverting nature of the ratio!

Step 5: Model Selection

Let's start with a really simple posture. Looking at the z-make graph, we tush see that whenever the z-score feature gets too high, or too low, it tends to revert back. Let's use +1/-1 as our thresholds for likewise high and too low, then we can use the following simulation to generate a trading signal:

Ratio is buy (1) whenever the z-grade is below -1.0 because we bear z nock to go back up to 0, hence ratio to gain
Ratio is betray(-1) when the z-score is above 1.0 because we gestate z score to recuperate down to 0, hence ratio to decrease

Stair 6: Train, Validate and Optimize

Finally, LET's see how our model actually does connected real information? Let's see what this signal looks like connected actual ratios

              # Plot the ratios and bribe and sell signals from z grudge
plt.figure(figsize=(15,7))              train[60:].plot()
buy = train.copy()
trade = train.re-create()
buy[zscore_60_5dangt;-1] = 0
sell[zscore_60_5danlt;1] = 0
buy[60:].plot(color='g', linestyle='None', marker='^')
sell[60:].plot(color='r', linestyle='None', marking='^')
x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,ratios.min(),ratios.max()))
plt.legend(['Ratio', 'Buy Signal', 'Sell Impressive'])
plt.show off()

Bargain and Betray Signal on Price Ratios

The signalize seems reasonable, we seem to sell the ratio (ruddy dots) when it is high or increasing and grease one's palms it support when it's low (green dots) and decreasing. What does that mean for actual stocks that we are trading? Let's get a load

              # Plot the prices and buy and sell signals from z grudge
plt.figure(figsize=(18,9))
S1 = data['ADBE'].iloc[:1762]
S2 = data['MSFT'].iloc[:1762]              S1[60:].plot(color='b')
S2[60:].plot(color='c')
buyR = 0*S1.copy()
sellR = 0*S1.copy()              # When buying the ratio, buy S1 and deal out S2
buyR[buy!=0] = S1[buy!=0]
sellR[bargain!=0] = S2[buy!=0]
# When selling the ratio, sell S1 and buy S2                
buyR[sell!=0] = S2[betray!=0]
sellR[deal out!=0] = S1[deal!=0]              buyR[60:].plat(color='g', linestyle='None', mark='^')
sellR[60:].plot(colour='r', linestyle='No', mark='^')
x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,min(S1.min(),S2.min()),max(S1.max(),S2.max())))              plt.legend(['ADBE','MSFT', 'Bribe Signal', 'Deal out Signal'])
plt.show()

Buy in and Sell Signals for MSFT and ADBE stocks

Discover how we sometimes make money connected the short leg and sometimes on the long-acting leg, and sometimes both.

We'atomic number 75 happy with our signal on the training data. Let's see what kind-hearted of profits this signal can generate. We can make a ensiform backtester which buys 1 ratio (buy 1 ADBE stock and sell ratio x MSFT strain) when ratio is dejected, sell 1 ratio (sell 1 ADBE stock and buy ratio x MSFT stock) when information technology's high and calculate PnL of these trades.

              # Trade using a simple strategy
def deal out(S1, S2, window1, window2):                      # If window length is 0, algorithmic rule doesn't make common sense, so pass
                  if (window1 == 0) or (window2 == 0):
                  return 0
                                      # Reckon rolling ungenerous and rolling received deviation
                  ratios = S1/S2
                  ma1 = ratios.rolling(window=window1,
                  center=False).mean()
                  ma2 = ratios.roll(window=window2,
                  nitty-gritty=False).mean()
                  std = ratios.reverberating(windowpane=window2,
                  substance=False).std()
                  zscore = (ma1 - ma2)/std
                                      # Sham trading
                  # Start with no money and nobelium positions
                  money = 0
                  countS1 = 0
                  countS2 = 0
                  for i in kitchen stove(len(ratios)):
                  # Betray short if the z-score is dangt; 1
                  if zscore[i] dangt; 1:
                  money += S1[i] - S2[i] * ratios[i]
                  countS1 -= 1
                  countS2 += ratios[i]
                  print('Selling Ratio %s %s %s %s'%(money, ratios[i], countS1,countS2))
                  # Buy long-range if the z-score is danlt; 1
                  elif zscore[i] danlt; -1:
                  money -= S1[i] - S2[i] * ratios[i]
                  countS1 += 1
                  countS2 -= ratios[i]
                  print('Buying Ratio %s %s %s %s'%(money,ratios[i], countS1,countS2))
                  # Clear positions if the z-score between -.5 and .5
                  elif abs(zscore[i]) danlt; 0.75:
                  money += S1[i] * countS1 + S2[i] * countS2
                  countS1 = 0
                  countS2 = 0
                  print('Exit pos %s %s %s %s'%(money,ratios[i], countS1,countS2))
                                                  return money
                            trade(data['ADBE'].iloc[:1763], information['MSFT'].iloc[:1763], 5, 60)

629.71

So that strategy seems profitable! Now we can optimize further by changing our moving average windows, by changing the thresholds for buy/sell and exit positions etc and stop for performance improvements on validation data.

We could also try more sophisticated models like Logisitic Fixation, SVM etc to make our 1/-1 predictions.

For like a sho, permit's say we decide to go forward with this model, this brings us to

Step 7: Backtest on Psychometric test Information

Backtesting is simple, we can just use our function from above to see PnL on test data

            deal(data['ADBE'].iloc[1762:], data['MSFT'].iloc[1762:], 5, 60)

1017.61

The model does quite well! This makes our first elongate pairs trading exemplary.

Avoid Overfitting

Before closing the treatment, we'd look-alike to establish special note to overfitting. Overfitting is the most dangerous booby trap of a trading strategy. An overfit algorithm may perform wonderfully connected a backtest simply fails miserably on new-sprung unseen information — this mean it has not really uncovered any trend in data and nobelium real predictive power. Let's take a simple example.

In our model, we used rolling parameter estimates and may wish to optimize windowpane distance. We may decide to simply retel over all possible, reasonable window length and pick the length supported which our model performs the prizewinning . Below we write a simple loop to to score windowpane lengths founded on pnl of training data and find the best one.

            # Find the windowpane length 0-254              
# that gives the highest returns using this scheme
length_scores = [trade(data['ADBE'].iloc[:1762],              
              information['MSFT'].iloc[:1762], l, 5)              
              for l in range(255)]
best_length = np.argmax(length_scores)
photographic print ('Scoop windowpane duration:', best_length)

            ('Best window length:', 246)

Now we check the performance of our model on essa data and we uncovering that this window length is far from optimal! This is because our original choice was clearly overfitted to the sample information.

            # Find the returns for test information
# using what we think out is the best window distance
length_scores2 = [deal(information['ADBE'].iloc[1762:],              
              data['MSFT'].iloc[1762:],l,5)              
              for l in range(255)]
print (best_length, 'sidereal day windowpane:', length_scores2[best_length])            # Breakthrough the best window distance supported connected this dataset,              
# and the returns using this windowpane length
best_length2 = np.argmax(length_scores2)
print (best_length2, 'day window:', length_scores2[best_length2])

            (1, 'day window:', 10.06)
(218, 'daytime window:', 527.92)

Clearly fitting to our sample information doesn't always give good results in the succeeding. Just for fun, let's plot the length scores computed from the ii datasets

            plt.figure(figsize=(15,7))
plt.diagram(length_scores)
plt.plot(length_scores2)
plt.xlabel('Windowpane length')
plt.ylabel('Score')
plt.caption(['Grooming', 'Quiz'])
plt.express()

We can see that anything above about 90 would be a effective choice for our window.

To avoid overfitting, we can use economic reasoning or the nature of our algorithm to pick our window duration. We behind too use Kalman filters, which do not require us to specify a length; this method will be covered in another notebook afterward.

Next Steps

In this stake, we bestowed many simple introductory approaches to demonstrate the action of nonindustrial a pairs trading scheme. In practice one should use more sophisticated statistics, some of which are listed here

Hurst exponent
Half-life of mean reversion inferred from an Ornstein–Uhlenbeck process
Kalman filters