Autoregressive Neural Network

Real world project to understand the process of implementation of an Autoregressive Neural Network for time series forecasting with R

An autoregressive neural network (ARNN) is a type of neural network that makes predictions based on previous values in a sequence. It is a form of recurrent neural network (RNN) that models the current output as a function of previous inputs and outputs in a sequence. ARNNs have gained significant popularity in the field of ecology due to their ability to capture complex patterns in time series data. ARNNs are well-suited for ecology because they can handle a large number of input variables and can predict future trends based on historical patterns.

In an ARNN, the prediction at each time step is made based on the previous outputs and inputs, and the model is trained to minimize the difference between the predicted and actual outputs. The prediction of the current output is dependent on the previous outputs, so ARNNs have an inherent sequential nature.

One of the most significant applications of ARNNs in ecology is in predicting climate variables such as temperature, precipitation, and evapotranspiration. ARNNs can be trained on historical climate data and can be used to predict future trends in temperature, precipitation, and evapotranspiration. ARNNs can also be used in the prediction of land use changes.

Here is a real world project to understand the process of implementation of an Autoregressive Neural Network for time series forecasting with R

Datasets

RH1.csv is a meteo dataset containing Relative Air humidity and rain data from 2012-01-01 to 2016-08-31. Data that have been extracted from GEE for the site “RSP_six_Forage / Senegal”

In another dataset RSP_Six_forages_rain.csv we have rain data from 2012-2020. What is missing is the Relative Humidity (RH) for the period 2016-09-01 to 2018-12-31)

Objective

We want to predict the RH data that is missing (RH from 2016-09-01 to 2018-12-31).

# Import RH1.csv and subset variable of interest
data <- read.csv("RH1.csv", h=TRUE, dec=",");data <- data[,c("date","RH","rain")]
data$rain <- as.numeric(data$rain) ;data$RH<- as.numeric(data$RH)

#Convert RH in RH1.csv into a time series object myts : frequency = 365 days because we have one observation per day.
myts = ts(data$RH, start = c(2012,1), frequency = 365)

# Plot myts
plot(myts) # Clear seasonality

plot1

First assumptions after plotting myts

  • myts looks like a seasonal dataset
  • myts looks stationary (mean and variance stay constant throughout the time series)
  • there might be an absence of trend in myts
  • we notice a typical time intervals resulting in spikes (RH peaks appear in the wet season each year at about the same period)

Check assumptions

plot(decompose(myts)) 

plot2

  • On the plot we see seasonality and little trend (little decrease toward the end of the time series)
  • No need to check for autocorelation with acf() because previous observations don’t influence the recent one. the seasonality is clear

Choice of the model to use

  • ARIMA ? : is very general but not always the best option
  • Exponential smoothing ? : seasonal datasets are easier to communicate with exponential smoothing but it puts more weight on recent observations
  • Autoregression based neural network : simplifies input variables over one or more layers into one input value and is very suitable for seasonal datasets

We will go with the Autoregression based neural network for our prediction

Neural nets :

Neural networks (NN) is one of the more interesting machine learning models because of their structure inspired by the brain, and also the ability to add much more complexity (think of deep learning with many hidden layers). Neural Networks can be thought of as a “network of neurons” (nodes) which are organized in layers — analogous to the way information passes through neurons in humans. The advantage of a neural network is its adaptive nature which learns from the inputs provided and trains itself from the data.

In a NN :

  • An input vector is compressed to several layers
  • Each layer consists of multiple neurons
  • Weight of importance may be ascribed to each neuron and
  • The amount of required layers is specified by the dataset : library “forecast” - nnetar()

We have :

  • Primitive Neural Network Model that works a bit like a linear regression and
  • Multilayer feed forward Network that is non linear. It has hidden layer / Artificial layer and works pretty well on Seasonal Dataset like the one we have in myts

Multilayer feed forward Network model Evaluation

Define training and testing sets

We split our RH time series data (of about 1705 daily Relative Humidity starting from 2012–01–01) into a training set (80%) on which to train the neural net and a testset data set (20%) to evaluate the results of the NN predictions.

#The original series (myts) is partitioned into training and validation periods.

nTest <- length(myts) * 0.2   # 20% of the dataset for validation
nTrain <- length(myts) - nTest  # 80% of the dataset for training

# Subset train data (80% of the data)
mytstrain <- window(myts, start = c(2012,1), end = c(2012,nTrain))

# Subset test set (20% of the data)
mytstest <- window(myts, start = c(2012,nTrain + 1),end = c(2012, nTrain + nTest))

Fitting the training RH data with the NNAR

require(forecast)
trainfit = nnetar(mytstrain)
trainfit

## Series: mytstrain 
## Model:  NNAR(21,1,12)[365] 
## Call:   nnetar(y = mytstrain)
## 
## Average of 20 networks, each of which is
## a 22-12-1 network with 289 weights
## options were - linear output units 
## 
## sigma^2 estimated as 13.46
  • the function nnetar of the library forecast use the last 21 lags of the time series + 1 seasonal lag as input and 12 nodes (hidden layer) to generate 1 Output : NNAR(21,1,12) (NNAR(p, P, k)). 365 is the frequency (One observation per day)

  • The number of input values 21 (lags here) is selected with information criterion (e.g. AIC) and the seasonal lag (1 here) is set by default (P=1). The neural network model has a random component, the model have been run 20 times. To get the final result, the results (e.g. mean or median) of the repetitions have been aggregated and trend component has been removed before modeling (by De-trend or diffrencing)

Get the 10 last forecasted RH values for the training period

tail(trainfit$fitted,5)

## Time Series:
## Start = c(2015, 265) 
## End = c(2015, 269) 
## Frequency = 365 
## [1] 53.81025 60.25025 62.54991 64.22241 54.13247

Forecast Relative Humidity values for validation period

nnetforecast.test <- forecast(trainfit, h = nTest, PI = F)
tail(nnetforecast.test$mean,5)

## Time Series:
## Start = c(2016, 241) 
## End = c(2016, 245) 
## Frequency = 365 
## [1] 18.16940 18.98637 18.45237 17.33455 16.81044

Plot

Let’s plot (1) the original RH values, (2) the fit in the training period, and (3) the neural network model predictions for the validation period.

#yrange=range(myts)
plot(c(2012,2017), c(2,150), ylab="Relative Humidity (%)", 
     xlab="Time", bty="l", xaxt="n", lty=1, main="Plots of the 
     original RH & the neural network model")
axis(1, at=seq(2012,2017,1), labels=format(seq(2012,2017,1)))
lines(myts, lwd=2, lty=1)
lines(trainfit$fitted, lwd=1,col="blue", lty=1)
lines(nnetforecast.test$mean,lwd=1,col="red",lty=2)
legend(2012.5, 150, c("Original Series", "Neural network model fitted in training period", "Neural network model forecasted in validation period"), lty=c(1,1,2), 
       col=c("black", "blue", "red"), bty="n")

plot3

Accuracy of the model for Training and test period

accuracy(nnetforecast.test, mytstest)

##                        ME      RMSE       MAE       MPE      MAPE      MASE
## Training set  0.001999429  3.668254  2.744186 -2.835638  9.662814 0.2526669
## Test set     11.392350335 19.830478 14.757222 15.983163 45.509756 1.3587493
##                     ACF1 Theil's U
## Training set 0.009784273        NA
## Test set     0.953918751  4.060945

On the matrix of accuracy measures, the model was trained and fitted with an RMSE = 3.67 not too high. Using the trained neural net to forecast RH for the test period, we notice an increase of the RMSE as expected. The MAPE (Mean Absolute Percentage Error) tells us that we can expect to see, on aggregate, at least a 9.66 % forecasting error for the training set and 45.51 % for the test set (a pretty high percentage)

Fit the entire full RH dataset with the NNAR

We now take the model with parameters obtained from the training dataset (NNAR(21,1,12)), and apply them to the entire full dataset (myts = mytstrain + mytstest)

fit = nnetar(myts,p = 21, P = 1, k = 12)
fit 

## Series: myts 
## Model:  NNAR(21,1,12)[365] 
## Call:   nnetar(y = myts, p = 21, P = 1, k = 12)
## 
## Average of 20 networks, each of which is
## a 22-12-1 network with 289 weights
## options were - linear output units 
## 
## sigma^2 estimated as 14.87

Accuracy of the model for the entire dataset

accuracy(fit)

##                        ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.004739791 3.856154 2.869723 -2.787154 10.22046 0.2399218
##                     ACF1
## Training set 0.006093659

With a RMSE = 3.86% and a MAPE = 10.22, the model fits the full dataset relatively well.

Forecast Relative Humidity values for the next 487 days after 2016-08-31

We use the model fit for the entire dataset in order to make a forward forecast from the last RH value available (on 2016-08-31) to 2018-12-31.

nnetforecast <- forecast(fit, h = 852, PI = F) 

Plot forecasted values

require(ggplot2)
autoplot(nnetforecast) 

plot4

The model doesn’t represent as well the fluctuations observed at the peaks and was not also able to reproduce the seasonality. Luckily we have another variable (rain) measured alongside the Relative Humidity. The precipitation can be used in this case as an external regressor in our autoregressive neural network model.

Alternative solution using an external regressor (Rain) in the neural network autoregressive model

Fitting the new model using the precipitation (rain) as external regressor

fit2 = nnetar(myts, xreg = data$rain)
fit2 

## Series: myts 
## Model:  NNAR(28,1,16)[365] 
## Call:   nnetar(y = myts, xreg = data$rain)
## 
## Average of 20 networks, each of which is
## a 30-16-1 network with 513 weights
## options were - linear output units 
## 
## sigma^2 estimated as 8.31

Accuracy of the model for the entire dataset

accuracy(fit2)

##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.001920949 2.882668 2.109254 -1.964648 7.691006 0.1763431
##                      ACF1
## Training set -0.001815274

With a RMSE = 2.88 and a MAPE = 7.69 %, the neural network autoregressive model performs better while using precipitation variable as an external regressor.

Defining the vector which we want to forecast

We will use rain from RSP_Six_forages_rain.csv for the period 2016-09-01 to 2018-12-31 (the period for which we want to forecast the RH)

# Import RSP_Six_forages_rain.csv and subset period of interest
y <- read.csv("RSP_Six_forages_rain.csv", h=TRUE, dec=",");y <- y[1706:2557,c("date", "rain")] 
y$rain <- as.numeric(y$rain) ; tail(y,5) 

##            date rain
## 2553 2018-12-27    0
## 2554 2018-12-28    0
## 2555 2018-12-29    0
## 2556 2018-12-30    0
## 2557 2018-12-31    0

Forecast Relative Humidity values for the rain vector defined

nnetforecast2 <- forecast(fit2, xreg = y$rain, PI = F)

autoplot(nnetforecast2) #Plot forecasted values

plot5

The model represents better the seasonality

Visualize model predictions

autoplot(nnetforecast2) +
        autolayer(nnetforecast, series = ("Forecast: NNAR model"), linetype = "dashed") +
        theme_minimal() +
        ylab("Relative Humidity (%)") 

plot6

It’s clear that the model performs better when we use an external regressor (prediction in blue). In contrary to the first model prediction in red dash-line, we can now represent better the seasonality of the RH observed the previouse years. This work pretty mainly because precipitation is correlated to RH values. The RH peaks are generally observed during the wet season.

With the alternative model, we didn’t build a Ferrari but we have a car that we can take to the streets!

#Put forecasted values and corresponding dates in a dataframe

df2= data.frame(date = seq.Date(as.Date("2016-09-01"), length.out = length(nnetforecast2$mean), by = "day", drop=FALSE),
                RH = nnetforecast2$mean)

df3 = data[,c("date","RH")] ; df3$date <- as.Date(df3$date)
df4 = rbind(df3,df2) ; tail(df4)

##            date       RH
## 2552 2018-12-26 76.64735
## 2553 2018-12-27 76.42674
## 2554 2018-12-28 76.23870
## 2555 2018-12-29 76.16043
## 2556 2018-12-30 75.95408
## 2557 2018-12-31 75.70750

Write new df containing RH from 2012 to 2020

#library(openxlsx)
#write new df containing RH from 2012 to 2020
#write.csv(df4, "RSP_Six_forage_RH_2012-2018.csv")
Yélognissè Frédi Agbohessou
Yélognissè Frédi Agbohessou
Post-doctoral researcher

My research interests include Climate Change, Agroforestry, Food security in West Africa, Ecological modeling and Remote sensing.

Related