“Data science” and the 2018 stock market crash & recovery

This is more of a “look how easy pandas & friends make it to investigate things” than anything fully baked. Closer to a tutorial than anything else. But it was small & self-contained but also not an entirely contrived example, so I thought it would be worth sharing.

As we all know, in late 2018 the stock market crashed. It didn’t go down by exactly 20% based on a closing prices. It didn’t meet the arbitrary 20% drop based on closing prices criteria that seems to now be universally adopted as the definition of a bear market. But if you press people about why it “doesn’t count” few will argue that a few tenths of a percentage point are really that important. Instead they’ll usually talk about how the recovery was “too fast” and that’s why it doesn’t count as a real crash.

So that made me wonder. Was it “too fast”? What would that even mean? How does the crash and recovery of 2018 actually compare to other stock market gyrations?

`import pandasfrom scipy.signal import find_peaksfrom matplotlib import pyplot as pltimport seabornseaborn.set(style='whitegrid')seaborn.set_context('poster')`

First let’s set up our imports. We change the default styling & context for seaborn to make the images a bit bigger & more readable.

^GSPC.csv is a file of daily S&P 500 price quotes downloaded from Yahoo Finance.

`gspc = pandas.read_csv('^GSPC.csv', parse_dates=True, index_col='Date')`

All we really care about is the Adjusted Close, so let’s make a new variable that makes it easier to refer to that.

`close = gspc['Adj Close']`

Now comes the magic part. scipy.signal has a method called “find_peaks” which is how we’ll determine the peak before a crash.

`peak_indices = find_peaks(close, width=10)[0]`

It “finds all local maxima by simple comparison of neighbouring values”.

`plt.figure(figsize=(20,10))ax = close.plot(logy=True)for idx in peak_indices:    x = close.iloc[idx:idx+1].index[0]    y = close[idx]    ax.plot(x, y, 'rd')seaborn.despine(ax=ax, left=True, bottom=True, offset=20)ax.set_title('S&P 500 Price Index')`

Now we want to plot the price data and add a separate marker — a red triangle — at the (x,y) coordinate of each of the peaks we detected.

Here’s a zoomed-in view of what 2018 looks like:

`peak_pairs = [(close.iloc[idx:idx+1].index[0], close[idx]) for idx in peak_indices]`

The results of “find_peaks” isn’t gives us the index of the peak…we need to convert that to (x, y) coordinates to be useful for the rest of what we want to do.

`def crash(date, price):    after = close[date:] # this includes x    after = after[1:] # this excludes x    recovered = after[after >= price].head(1) # the first time we reach the original value    if recovered.empty:        return(date, None, None)    else:        r_len = recovered.index - date # time delta between the two        r_len = r_len.days.to_numpy()[0] # strip the indexification        r_date = recovered.index.to_numpy()[0]        lowest = after[:r_date].min() # low point between peak & recovery        drop = 1 - lowest / price        return(date, drop, r_len)`

Now that we’ve detected the peaks — meaning we know when the crashes (or at least mini-crashes) are, we want to calculate how long it took to recover and how deep the crash was (in percentage terms).

`crash_df = pandas.DataFrame.from_records([crash(*p) for p in peak_pairs],                                         index='Date',                                         columns=['Date', 'Percent Crash', 'Recovery Days'])crash_df = crash_df[crash_df['Percent Crash'] >= 0.1]crash_df = crash_df[crash_df['Recovery Days'] < 7_000]`

Now we can create a new pandas DataFrame holding information about each crash we’ve detected: the date it started, the percentage drop, and how long it took for the price to recover.

I’ve also filtered out crashes of less than 10% or that took more than 7,000 days to recover. In practice that means “filter out the two big crashes of the Great Depression”. Those outliers don’t actually change the results dramatically but they blow up the charts making them hard to read. (I’ll show you what I mean in a minute.)

We know the 2018 crash started on September 20th. Let’s find that in our new DataFrame and store away the (x,y) coordinates. We’ll use them later to annotate our chart.

`i_crash = crash_df['2018-09-20':]i_xy = (i_crash['Percent Crash'][0], i_crash['Recovery Days'][0])`

Now — finally — we’ve got everything in place. We want to make a regression plot of “percentage crash” versus “length of recovery”. And we want to annotate the chart pointing out where the 2018 crash is.

`plt.figure(figsize=(20,10))ax = seaborn.regplot(data=crash_df, x='Percent Crash', y='Recovery Days')seaborn.despine(ax=ax, left=True, bottom=True, offset=20)ax.set_title('Crash & Recovery')ax.annotate("2018 crash",            xy=i_xy,            xytext=(i_xy[0] + .1, i_xy[1] + 3_000),            arrowprops=dict( facecolor='black',connectionstyle="arc3,rad=.2"),            fontsize=20)`

So the 2018 crash falls below the regression line…but it doesn’t really look like a notable outlier in how quickly it recovered. Check out all the dots to the right — those are even bigger crashes that recovered just as quickly.

That said, maybe people’s intuition about “real crashes” means something like “a crash that lies above the regression line”. Because that’s what it takes to wring irrational exuberance out of the system? The market is resilient enough to shake off all those intermediate gyrations?

I don’t have any strong conclusions here. The claim about the recovery being “too fast” doesn’t quite ring true. There seem to have been lots of crashes as deep that took much longer to recover. But there’s the additional twist in all of this that investor psychology matters. If people think the recovery was “too fast” then data isn’t going to change how they invest.

Remember how I filtered out the really big crashes of the Great Depression because I said they made the chart hard to read? Here’s the same chart but with the three Great Depression peaks added back in. They’re the three dots in the top right.

You can see just how much of an outlier they are. But also how it makes the chart hard to read by squishing everything else down. It doesn’t appear to really change the results of the regression, either.

Why three peaks? The find_peaks algorithm said there were local maxima at September 16, 1929; April 10, 1930; and September 10, 1930. The April 10 peak came after the market had been recovering for nearly 5 months. The September 10 peak came after a shorter 4 months mini-recovery.

We can see them here and imagine what an investor back then would have felt after seeing months of recovery and having their hopes dashed as the market plunged lower:

Written by