# ARIMA Modeling

Last week I posted about the basics involved in time series modeling. Today I wanted to expand on that topic by going over one modeling algorithm in detail. Let’s talk about the ARIMA model and how to build it.

# What is ARIMA?

An ARIMA model is a very common way to analyze a time series and forecast future values. The name ARIMA stands for AutoRegressive Integrated Moving Average. Let’s dive into each of the three components of this model in more detail. The autoregressive term, represented as a lag term (p) in the model, accounts for the dependent relationship between one observation and following observations. For example, stock prices are heavily dependent on the price of the previous day. The lag term shows how many lag observations are present in the model. This is like taking a partial difference and the magnitude of the coefficient gives an idea of the proportion of the difference that the model is using. The integration term (d) incorporates differencing into the model. The magnitude of the d term indicates how many times the time series needs to be differenced in order to become stationary. Finally, the moving average (q) term accounts for the dependency of one observation the error of an earlier observation. The magnitude of this value indicates the size of the window of past error terms to look at.

Once identified, the p, d and q values are called the order of the ARIMA and the model is written ARIMA(p,d,q). If any of our parameters are set to 0, then we simply do not use that portion of the ARIMA model. For instance, if our p value is 0, then we don’t really need the autoregressive portion of the model, and instead could use an Integrated Moving Average model. In this way, the ARIMA model incorporates every combination of AR, I, and MA models. Once our order is set, the model constructs a linear regression and fits it to our data.

# Building an ARIMA

Now that I have explained how an ARIMA model functions, let’s go over how we can build out the model. To help explain the concept, I will use an example. For this code-along, I will use a basic dataset involving airline passengers. The data is easily obtained from kaggle if you want to follow along. As always, the first thing we need to do is load in key libraries and the dataset.

# Load libraries for data manipulation and visualization

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns# Load dataset

data = pd.read_csv('AirPassengers.csv')# View data

data.head()

From a quick look at the data, there are a couple of things that we need to address. First, for ease of workability, the Months column should be converted to a datetime format. Second, we need to set the date as an index to create a functional time series. Once this is complete, we can plot the series to see what our data looks like.

# Convert Month column to datetime

data.Month = pd.to_datetime(data.Month)

# Set Month as index

data = data.set_index('Month')# Plot time series

fig = plt.figure(figsize=(14,8))

sns.lineplot(x = data.index, y = data['#Passengers'])

plt.title('Airline Passengers From 1949-1960')

plt.show()

Great, our time series is all set begin working with. From last week, we know that we need the series to be stationary to work with, so let’s check the stationarity.

# Import Augmented Dickey-Fuller test

from statsmodels.tsa.stattools import adfuller# Run Dickey-Fuller test and format results

dfresults = adfuller(data)

dfoutput = pd.Series(dfresults[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

for key,value in dfresults[4].items():

dfoutput['Critical Value (%s)'%key] = value

print(output)

With a p-value close to 1, the data is definitely not stationary. To continue, I logged transformed and differenced the data and re-ran the Dickey-Fuller test.

# Log transform data

log_transformed = np.log(data)# Run Dickey-Fuller test on differenced log-tranformed series

dfresults = adfuller(log_transformed.diff().dropna())

dfoutput = pd.Series(dfresults[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

for key,value in dfresults[4].items():

dfoutput['Critical Value (%s)'%key] = value

print(output)

Our p-value is still a little high, but it is good enough to continue with this example. Now we need to figure out what the order is going to be for our ARIMA. Luckily, the d term is very easy to figure out. We differenced our data 1 time to reach stationarity, so our d value is 1. Figuring out p and q is a bit trickier, and to accomplish this we will need the help of an autocorrelation function plot and a partial autocorrelation function plot.

# Import functions to quickly create these plots

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf# Plot autocorrelation function plot

plot_acf(log_transformed.diff().dropna())

plt.show()# Plot partial autocorrelation function plot

plot_pacf(log_transformed.diff().dropna())

plt.show()

To estimate an ideal value for p and q, we need to look at where these plots reach zero. The autocorrelation plot crosses zero between 1 and 2 lags. Because of this we can estimate our q value to be 1 or 2. For this example, we will use q=2. The p term is taken the same way from the partial autocorrelation plot. Again, the plot crosses zero between 1 and 2 lags. We will use p=2 for our ARIMA model. At this point we have identified our model to be ARIMA(2,1,2). Let’s look at how to build the model in Python.

# Import ARIMA model

from statsmodels.tsa.arima_model import ARIMA# Create ARIMA model by passing in the data and the order

model = ARIMA(log_transformed, order=(2,1,2))

# Fit model

fit_model = model.fit()

# Plot differenced time series with predictions

sns.lineplot(x = log_transformed.diff().dropna().index,

y = log_transformed.diff().dropna()['#Passengers'])

sns.lineplot(x = results.fittedvalues.index,

y = results.fittedvalues,

color='red')

plt.title('ARIMA Results')

plt.show()

We can see from the graph that the predicted values follow the shape of the observed values showing a pretty good fit. I calculated the root mean squared error and it was just about .085. Overall the model performs very well compared to the series itself, now let’s look at how we can quickly get predictions from the model.

`# Use plot_predit to plot out predictions`

# We start at our first value, 1 and predict for 204 months,

# ending 5 years after the end of the data.

results.plot_predict(start=1, end=204)

plt.title('Predicted Airline Passengers')

plt.show()

# To check the values of our predictions we use forecast

predictions = results.forecast(60)# We need to account for the log transformation

print(np.exp(predictions[0]))

And voila, a fully functional ARIMA model. Of course there is more detail that can be done here, but this is a good demonstration of how to quickly construct and get predictions from an ARIMA model. Thanks for reading!

# Resources:

Make sure to read up on the documentation for the ARIMA model here.

Check out the code on my Github