SARIMA Modeling
For the last couple of weeks I have been walking through time series modeling. For today’s post I wanted to continue this topic and focus on the SARIMA model. For this analysis I will assume that you have read my previous articles on ARIMA modeling and time series analysis, so take the time to understand the basics of these articles before diving into this one.
What is SARIMA:
As we recall from last week, an ARIMA model is an algorithm that is used to analyze and predict time series data. This model includes three terms: an autoregressive term (p) indicating the correlation between each data point and the points before it, an integrated term indicating how many times the series needed to be differenced to reach stationarity (d), and a moving averages term which indicates the correlation between the variances of each data point and the points before it (q). These three terms are called the order of the model and the model is written as ARIMA(p,d,q). ARIMA is a very good model for forecasting time series, but it does have some drawbacks. One of the major drawbacks is that ARIMA does not work when the time series shows evidence of seasonality. Seasonality is the presence of repeating patterns within a time series that occur in a regular frequency. For example, depending on the data in question, you may have daily, weekly, monthly or annually repeating patterns. To take this example one step further, let’s look at the daily average temperature in Austin, Texas from 12/21/2013 to 7/31/2017.
Here we see the temperature change over the course of three years. Each year is marked by the same general pattern in the data; it starts cold, heats up through spring and summer, then cools down as we go through fall and into winter. This data would not be ideal for ARIMA modeling because of the clear seasonal pattern.
In this instance we would use a SARIMA model. SARIMA stands for Seasonal AutoRegressive Integrated Moving Average and it builds directly upon the ARIMA model. For SARIMA modeling, we use the same order that we would for an ARIMA model but also include a seasonal order. The seasonal order consists of four terms: P, D, Q and s. Three of these terms, P, D, and Q, are counterparts to the p, d, and q terms in the order but they only apply to the seasonal nature of the data. The s is a value to represent the seasonality in the time series. This term represents how many cycles occur in a period. In this case our s value is 365 because we have one cycle each 365 days. A monthly cycle would have an s value close to 30 and weekly cycles would have an s value of 7. Once we have our order and seasonal order configured, we write the model as SARIMA(p,d,q)(P,D,Q)s.
Building a SARIMA model:
Now that we know what a SARIMA model is, let’s go over we can build one in python. For this example, I will be using the daily temperature data from Austin. This data was obtained from kaggle, and you can check out the source here. For a closer look at this example, check out this repository on my Github.
To start we load some basic libraries and our data:
# Import Pandas and NumPy
import pandas as pd
import numpy as np# Import plotting libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')# Load data and make the date columns into datetimes
df = pd.read_csv('austin_weather.csv')
df.Date = pd.to_datetime(df.Date)# Plot average daily temperature
fig = plt.figure(figsize=(16,6))
sns.lineplot(x=df.Date, y=df.TempAvgF)
plt.title('Daily Average Temperature - Austin')
plt.show()
This code block gives us the graph that we saw above. This ends up being noisy and takes a long time to model. For the purposes of this demonstration, we will resample into monthly data.
df1 = df.resample('M', on='Date').mean()data = df1.TempAvgFfig = plt.figure(figsize=(16,6))
sns.lineplot(x= data.index, y= data)
plt.title('Monthly Average Temperature - Austin')
plt.show()
From previous explorations with time series analysis, we know that the next step is to check the stationarity of our time series. For this, we run an Augmented Dickey-Fuller test.
from statsmodels.tsa.stattools import adfullerresults = adfuller(data)
output = pd.Series(results[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in results[4].items():
output['Critical Value (%s)'%key] = value
print(output)
We can see from the p-value that the data is not stationary. This makes sense because of the large effect of the seasonality. At this point we will need to seasonally difference the data. We do that by subtracting each data point from the corresponding lag in the next period. In this case we will subtract each value from the value of the same month in the next year. In python we do this by calling .diff() and passing in the length of our cycle. In this case we pass 12 since we are looking at the same month in one year.
fig = plt.figure(figsize=(14,6))
seasonal_diff = data.diff(12).dropna()
seasonal_diff.plot()
plt.title('Seasonally Differenced Data')
plt.show()
We see that the data is much closer to stationary at this point. Let’s run an adfuller test again to check.
results = adfuller(seasonal_diff)
output = pd.Series(results[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in results[4].items():
output['Critical Value (%s)'%key] = value
print(output)
We can see that the time series is now stationary. At this point let’s take a look at the autocorrelation plot and partial autocorrelation plot to estimate our order and seasonal order. To check for our order, we will look at the acf and pacf of our time series when differenced normally.
# Create acf plot of normally differenced series
plot_acf(data.diff().dropna(), lags =15)
plt.show()# Create pacf plot of normally differenced series
plot_pacf(data.diff().dropna(), lags = 15)
plt.show()
To estimate the order, we will look at where the plots would cross zero. Our p parameter comes from the partial autocorrelation plot, and I would estimate it at 2 in this example. Our d is 1 because we differenced the time series once. The q parameter is obtained from the acf plot and in this case I would also estimate it to be 2. The estimated order for our SARIMA model is (2,1,2). Now we will do the same thing but looking at the acf and pacf of the seasonally differenced data.
plot_acf(seasonal_diff, lags =15)
plt.show()
plot_pacf(seasonal_diff, lags = 15)
plt.show()
From the acf and pacf of the seasonally differenced data, I would estimate the seasonal order of our model to be (0, 1, 0, 12). Since there are so many different combinations of parameters between the order and seasonal order, we will make use of a grid search to check ourselves and find the ideal combination. In the following cell block, we try all possible combinations for p, d, q, P, D, and Q from a range of possibilities. We then create a model for each order and seasonal order, then find the AIC to test model fit. We will use the model that has the smallest AIC.
# Import itertools.product and SARIMAX algorithm
from itertools import product
from statsmodels.tsa.statespace.sarimax import SARIMAX# Set parameter range
p = range(0,3)
q = range(0,3)
d = range(0,2)
s = [12]# Find all parameter combos
pdq = list(product(p, d, q))
seasonal_pdq = list(product(p, d, q, s))# Create SARIMA model for each order and seasonal order
aics = []
for order in pdq:
for seasonal_order in seasonal_pdq:
try:
model = SARIMAX(seasonal_diff, order=order, seasonal_order=seasonal_order)
results = model.fit()
aics.append((order, seasonal_order, results.aic))
except:
print('SARIMA{},{} - Skipped'.format(order, seasonal_order))# Check for smallest AIC
aics.sort(key=lambda x: x[2])
print(aics[0])
From this block of code, we find our ideal order and seasonal order to minimize our AIC. We end up using SARIMA(2,1,0)(0,1,0)12. This was pretty close to the estimated order, but ends up being more optimized. At this point, all that’s left is to run the model and check the output.
model = SARIMAX(data, order=(2,1,0), seasonal_order=(0,1,0,12))
fit = model.fit()
y_hat = fit.predict(1, len(data))fig = plt.figure(figsize=(14,6))
plt.plot(data, label='Observed')
plt.plot(y_hat, label='Predicted')
plt.title('Predicted Temperatures')
plt.legend()
plt.show()
It looks like our model does a really good job of predicting how the temperature changes month to month. Let’s look at the forecast of temperature changes for the next year.
forecast = fit.forecast(12)fig = plt.figure(figsize=(14,6))
plt.plot(data, label='Observed')
plt.plot(forecast, label='Predicted')
plt.title('Predicted Temperatures')
plt.legend()
plt.show()
Again we get good results from our predictions! The forecast matches the shape of the earlier years very well and includes the slight upward trend present in the data. Finally let’s look at the actual values that were forecast.
print(forecast)
And there we have it! A functioning SARIMA model used to predict average monthly temperatures in Austin, Texas.
Resources:
For a closer look at the code for this walkthrough, check out my Github.
Click here to find out more about the statistics involved in SARIMA models.
And of course, don’t forget to read up on the documentation from Statsmodels.
Thanks for reading!