ARIMA Modeling
Last week I posted about the basics involved in time series modeling. Today I wanted to expand on that topic by going over one modeling algorithm in detail. Let’s talk about the ARIMA model and how to build it.
What is ARIMA?
An ARIMA model is a very common way to analyze a time series and forecast future values. The name ARIMA stands for AutoRegressive Integrated Moving Average. Let’s dive into each of the three components of this model in more detail. The autoregressive term, represented as a lag term (p) in the model, accounts for the dependent relationship between one observation and following observations. For example, stock prices are heavily dependent on the price of the previous day. The lag term shows how many lag observations are present in the model. This is like taking a partial difference and the magnitude of the coefficient gives an idea of the proportion of the difference that the model is using. The integration term (d) incorporates differencing into the model. The magnitude of the d term indicates how many times the time series needs to be differenced in order to become stationary. Finally, the moving average (q) term accounts for the dependency of one observation the error of an earlier observation. The magnitude of this value indicates the size of the window of past error terms to look at.
Once identified, the p, d and q values are called the order of the ARIMA and the model is written ARIMA(p,d,q). If any of our parameters are set to 0, then we simply do not use that portion of the ARIMA model. For instance, if our p value is 0, then we don’t really need the autoregressive portion of the model, and instead could use an Integrated Moving Average model. In this way, the ARIMA model incorporates every combination of AR, I, and MA models. Once our order is set, the model constructs a linear regression and fits it to our data.
Building an ARIMA
Now that I have explained how an ARIMA model functions, let’s go over how we can build out the model. To help explain the concept, I will use an example. For this code-along, I will use a basic dataset involving airline passengers. The data is easily obtained from kaggle if you want to follow along. As always, the first thing we need to do is load in key libraries and the dataset.
# Load libraries for data manipulation and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns# Load dataset
data = pd.read_csv('AirPassengers.csv')# View data
data.head()
From a quick look at the data, there are a couple of things that we need to address. First, for ease of workability, the Months column should be converted to a datetime format. Second, we need to set the date as an index to create a functional time series. Once this is complete, we can plot the series to see what our data looks like.
# Convert Month column to datetime
data.Month = pd.to_datetime(data.Month)
# Set Month as index
data = data.set_index('Month')# Plot time series
fig = plt.figure(figsize=(14,8))
sns.lineplot(x = data.index, y = data['#Passengers'])
plt.title('Airline Passengers From 1949-1960')
plt.show()
Great, our time series is all set begin working with. From last week, we know that we need the series to be stationary to work with, so let’s check the stationarity.
# Import Augmented Dickey-Fuller test
from statsmodels.tsa.stattools import adfuller# Run Dickey-Fuller test and format results
dfresults = adfuller(data)
dfoutput = pd.Series(dfresults[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dfresults[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(output)
With a p-value close to 1, the data is definitely not stationary. To continue, I logged transformed and differenced the data and re-ran the Dickey-Fuller test.
# Log transform data
log_transformed = np.log(data)# Run Dickey-Fuller test on differenced log-tranformed series
dfresults = adfuller(log_transformed.diff().dropna())
dfoutput = pd.Series(dfresults[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dfresults[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(output)
Our p-value is still a little high, but it is good enough to continue with this example. Now we need to figure out what the order is going to be for our ARIMA. Luckily, the d term is very easy to figure out. We differenced our data 1 time to reach stationarity, so our d value is 1. Figuring out p and q is a bit trickier, and to accomplish this we will need the help of an autocorrelation function plot and a partial autocorrelation function plot.
# Import functions to quickly create these plots
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf# Plot autocorrelation function plot
plot_acf(log_transformed.diff().dropna())
plt.show()# Plot partial autocorrelation function plot
plot_pacf(log_transformed.diff().dropna())
plt.show()
To estimate an ideal value for p and q, we need to look at where these plots reach zero. The autocorrelation plot crosses zero between 1 and 2 lags. Because of this we can estimate our q value to be 1 or 2. For this example, we will use q=2. The p term is taken the same way from the partial autocorrelation plot. Again, the plot crosses zero between 1 and 2 lags. We will use p=2 for our ARIMA model. At this point we have identified our model to be ARIMA(2,1,2). Let’s look at how to build the model in Python.
# Import ARIMA model
from statsmodels.tsa.arima_model import ARIMA# Create ARIMA model by passing in the data and the order
model = ARIMA(log_transformed, order=(2,1,2))
# Fit model
fit_model = model.fit()
# Plot differenced time series with predictions
sns.lineplot(x = log_transformed.diff().dropna().index,
y = log_transformed.diff().dropna()['#Passengers'])
sns.lineplot(x = results.fittedvalues.index,
y = results.fittedvalues,
color='red')
plt.title('ARIMA Results')
plt.show()
We can see from the graph that the predicted values follow the shape of the observed values showing a pretty good fit. I calculated the root mean squared error and it was just about .085. Overall the model performs very well compared to the series itself, now let’s look at how we can quickly get predictions from the model.
# Use plot_predit to plot out predictions
# We start at our first value, 1 and predict for 204 months,
# ending 5 years after the end of the data.
results.plot_predict(start=1, end=204)
plt.title('Predicted Airline Passengers')
plt.show()
# To check the values of our predictions we use forecast
predictions = results.forecast(60)# We need to account for the log transformation
print(np.exp(predictions[0]))
And voila, a fully functional ARIMA model. Of course there is more detail that can be done here, but this is a good demonstration of how to quickly construct and get predictions from an ARIMA model. Thanks for reading!
Resources:
Make sure to read up on the documentation for the ARIMA model here.
Check out the code on my Github