Loading in libraries
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.4
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(ggplot2)
library(TTR)
## Warning: package 'TTR' was built under R version 4.0.4
library(stats)
library(readr)
library(fpp)
## Warning: package 'fpp' was built under R version 4.0.5
## Loading required package: fma
## Warning: package 'fma' was built under R version 4.0.5
## Loading required package: expsmooth
## Warning: package 'expsmooth' was built under R version 4.0.5
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 4.0.4
library(fpp2)
## Warning: package 'fpp2' was built under R version 4.0.5
##
## Attaching package: 'fpp2'
## The following objects are masked from 'package:fpp':
##
## ausair, ausbeer, austa, austourists, debitcards, departures,
## elecequip, euretail, guinearice, oil, sunspotarea, usmelec
library(seasonal)
## Warning: package 'seasonal' was built under R version 4.0.5
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(psych)
## Warning: package 'psych' was built under R version 4.0.5
##
## Attaching package: 'psych'
## The following object is masked from 'package:seasonal':
##
## outlier
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.0.5
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.0.5
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.0.5
## Registered S3 methods overwritten by 'ggfortify':
## method from
## autoplot.Arima forecast
## autoplot.acf forecast
## autoplot.ar forecast
## autoplot.bats forecast
## autoplot.decomposed.ts forecast
## autoplot.ets forecast
## autoplot.forecast forecast
## autoplot.stl forecast
## autoplot.ts forecast
## fitted.ar forecast
## fortify.ts forecast
## residuals.ar forecast
library(vars)
## Warning: package 'vars' was built under R version 4.0.5
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following objects are masked from 'package:fma':
##
## cement, housing, petrol
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.0.5
## Loading required package: sandwich
## Loading required package: urca
## Warning: package 'urca' was built under R version 4.0.4
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(AER)
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## Loading required package: survival
library(MASS)
Clearing everything to make it easier to work
rm(list = ls())
Excel spreadsheets and data to build my models from
dengue_features_test <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\FINAL\\dengue_features_test.csv")
dengue_features_train <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\FINAL\\dengue_features_train.csv")
dengue_labels_train <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\FINAL\\dengue_labels_train.csv")
submission_format <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\FINAL\\submission_format.csv")
So, we have four excel files worth of data. We first should merge by the training set and the labels which has the variable of interest we need.
The two excel files match in nrows so maybe we can just use cbind instead of merge.
dengue_merge_data.train <- cbind(dengue_labels_train, dengue_features_train)
summary(dengue_merge_data.train)
## city year weekofyear total_cases
## Length:1456 Min. :1990 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.:1997 1st Qu.:13.75 1st Qu.: 5.00
## Mode :character Median :2002 Median :26.50 Median : 12.00
## Mean :2001 Mean :26.50 Mean : 24.68
## 3rd Qu.:2005 3rd Qu.:39.25 3rd Qu.: 28.00
## Max. :2010 Max. :53.00 Max. :461.00
##
## city year weekofyear week_start_date
## Length:1456 Min. :1990 Min. : 1.00 Length:1456
## Class :character 1st Qu.:1997 1st Qu.:13.75 Class :character
## Mode :character Median :2002 Median :26.50 Mode :character
## Mean :2001 Mean :26.50
## 3rd Qu.:2005 3rd Qu.:39.25
## Max. :2010 Max. :53.00
##
## ndvi_ne ndvi_nw ndvi_se ndvi_sw
## Min. :-0.40625 Min. :-0.45610 Min. :-0.01553 Min. :-0.06346
## 1st Qu.: 0.04495 1st Qu.: 0.04922 1st Qu.: 0.15509 1st Qu.: 0.14421
## Median : 0.12882 Median : 0.12143 Median : 0.19605 Median : 0.18945
## Mean : 0.14229 Mean : 0.13055 Mean : 0.20378 Mean : 0.20231
## 3rd Qu.: 0.24848 3rd Qu.: 0.21660 3rd Qu.: 0.24885 3rd Qu.: 0.24698
## Max. : 0.50836 Max. : 0.45443 Max. : 0.53831 Max. : 0.54602
## NA's :194 NA's :52 NA's :22 NA's :22
## precipitation_amt_mm reanalysis_air_temp_k reanalysis_avg_temp_k
## Min. : 0.00 Min. :294.6 Min. :294.9
## 1st Qu.: 9.80 1st Qu.:297.7 1st Qu.:298.3
## Median : 38.34 Median :298.6 Median :299.3
## Mean : 45.76 Mean :298.7 Mean :299.2
## 3rd Qu.: 70.23 3rd Qu.:299.8 3rd Qu.:300.2
## Max. :390.60 Max. :302.2 Max. :302.9
## NA's :13 NA's :10 NA's :10
## reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :289.6 Min. :297.8
## 1st Qu.:294.1 1st Qu.:301.0
## Median :295.6 Median :302.4
## Mean :295.2 Mean :303.4
## 3rd Qu.:296.5 3rd Qu.:305.5
## Max. :298.4 Max. :314.0
## NA's :10 NA's :10
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :286.9 Min. : 0.00
## 1st Qu.:293.9 1st Qu.: 13.05
## Median :296.2 Median : 27.25
## Mean :295.7 Mean : 40.15
## 3rd Qu.:297.9 3rd Qu.: 52.20
## Max. :299.9 Max. :570.50
## NA's :10 NA's :10
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :57.79 Min. : 0.00
## 1st Qu.:77.18 1st Qu.: 9.80
## Median :80.30 Median : 38.34
## Mean :82.16 Mean : 45.76
## 3rd Qu.:86.36 3rd Qu.: 70.23
## Max. :98.61 Max. :390.60
## NA's :10 NA's :13
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :11.72 Min. : 1.357 Min. :21.40
## 1st Qu.:15.56 1st Qu.: 2.329 1st Qu.:26.30
## Median :17.09 Median : 2.857 Median :27.41
## Mean :16.75 Mean : 4.904 Mean :27.19
## 3rd Qu.:17.98 3rd Qu.: 7.625 3rd Qu.:28.16
## Max. :20.46 Max. :16.029 Max. :30.80
## NA's :10 NA's :10 NA's :43
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. : 4.529 Min. :26.70 Min. :14.7
## 1st Qu.: 6.514 1st Qu.:31.10 1st Qu.:21.1
## Median : 7.300 Median :32.80 Median :22.2
## Mean : 8.059 Mean :32.45 Mean :22.1
## 3rd Qu.: 9.567 3rd Qu.:33.90 3rd Qu.:23.3
## Max. :15.800 Max. :42.20 Max. :25.6
## NA's :43 NA's :20 NA's :14
## station_precip_mm
## Min. : 0.00
## 1st Qu.: 8.70
## Median : 23.85
## Mean : 39.33
## 3rd Qu.: 53.90
## Max. :543.30
## NA's :22
dengue_merge_data.train$weekofyear = NULL
dengue_merge_data.train$city = NULL
dengue_merge_data.train$year = NULL
Setting week start date to actually be a date. We could also break the data frame into two for the two separate cities? No idea if necessary yet though.
dengue_merge_data.train$week_start_date <- as.Date(dengue_merge_data.train$week_start_date, format=c('%Y-%m-%d'))
Now we have essentially everything we need to build our models.
Running some exploratory data analysis. Boxplots always a nice place to start.
boxplot(dengue_merge_data.train$total_cases ~ dengue_merge_data.train$city)
boxplot(dengue_merge_data.train$total_cases ~ dengue_merge_data.train$city, outline = FALSE)
We can immediately see that sj is higher on average and in its outliers
boxplot(dengue_merge_data.train$total_cases ~ dengue_merge_data.train$weekofyear)
boxplot(dengue_merge_data.train$total_cases ~ dengue_merge_data.train$weekofyear, outline = FALSE)
We might notice that the late part of the year (around late fall/winter time) is when the number of cases ramp up. This also seems to hold true even when including the outlier points.
At row 936, that is the cutoff for city sj and the start of data for city iq. Similarly, there is over 10 years worth of data for each city so I think here that creating two separate data frames to analyze trend by city would be a decent option here
sj.data <- dengue_merge_data.train[1:936, ]
summary(sj.data)
## total_cases city year weekofyear
## Min. : 0.00 Length:936 Min. :1990 Min. : 1.00
## 1st Qu.: 9.00 Class :character 1st Qu.:1994 1st Qu.:13.75
## Median : 19.00 Mode :character Median :1999 Median :26.50
## Mean : 34.18 Mean :1999 Mean :26.50
## 3rd Qu.: 37.00 3rd Qu.:2003 3rd Qu.:39.25
## Max. :461.00 Max. :2008 Max. :53.00
##
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Min. :1990-04-30 Min. :-0.40625 Min. :-0.45610 Min. :-0.01553
## 1st Qu.:1994-10-27 1st Qu.: 0.00450 1st Qu.: 0.01642 1st Qu.: 0.13928
## Median :1999-04-26 Median : 0.05770 Median : 0.06807 Median : 0.17719
## Mean :1999-04-26 Mean : 0.05792 Mean : 0.06747 Mean : 0.17766
## 3rd Qu.:2003-10-23 3rd Qu.: 0.11110 3rd Qu.: 0.11520 3rd Qu.: 0.21256
## Max. :2008-04-22 Max. : 0.49340 Max. : 0.43710 Max. : 0.39313
## NA's :191 NA's :49 NA's :19
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :-0.06346 Min. : 0.00 Min. :295.9
## 1st Qu.: 0.12916 1st Qu.: 0.00 1st Qu.:298.2
## Median : 0.16597 Median : 20.80 Median :299.3
## Mean : 0.16596 Mean : 35.47 Mean :299.2
## 3rd Qu.: 0.20277 3rd Qu.: 52.18 3rd Qu.:300.1
## Max. : 0.38142 Max. :390.60 Max. :302.2
## NA's :19 NA's :9 NA's :6
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :296.1 Min. :289.6 Min. :297.8
## 1st Qu.:298.3 1st Qu.:293.8 1st Qu.:300.4
## Median :299.4 Median :295.5 Median :301.5
## Mean :299.3 Mean :295.1 Mean :301.4
## 3rd Qu.:300.2 3rd Qu.:296.4 3rd Qu.:302.4
## Max. :302.2 Max. :297.8 Max. :304.3
## NA's :6 NA's :6 NA's :6
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :292.6 Min. : 0.00
## 1st Qu.:296.3 1st Qu.: 10.82
## Median :297.5 Median : 21.30
## Mean :297.3 Mean : 30.47
## 3rd Qu.:298.4 3rd Qu.: 37.00
## Max. :299.9 Max. :570.50
## NA's :6 NA's :6
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :66.74 Min. : 0.00
## 1st Qu.:76.25 1st Qu.: 0.00
## Median :78.67 Median : 20.80
## Mean :78.57 Mean : 35.47
## 3rd Qu.:80.96 3rd Qu.: 52.18
## Max. :87.58 Max. :390.60
## NA's :6 NA's :9
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :11.72 Min. :1.357 Min. :22.84
## 1st Qu.:15.24 1st Qu.:2.157 1st Qu.:25.84
## Median :16.85 Median :2.457 Median :27.23
## Mean :16.55 Mean :2.516 Mean :27.01
## 3rd Qu.:17.86 3rd Qu.:2.800 3rd Qu.:28.19
## Max. :19.44 Max. :4.429 Max. :30.07
## NA's :6 NA's :6 NA's :6
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. :4.529 Min. :26.70 Min. :17.8
## 1st Qu.:6.200 1st Qu.:30.60 1st Qu.:21.7
## Median :6.757 Median :31.70 Median :22.8
## Mean :6.757 Mean :31.61 Mean :22.6
## 3rd Qu.:7.286 3rd Qu.:32.80 3rd Qu.:23.9
## Max. :9.914 Max. :35.60 Max. :25.6
## NA's :6 NA's :6 NA's :6
## station_precip_mm
## Min. : 0.000
## 1st Qu.: 6.825
## Median : 17.750
## Mean : 26.785
## 3rd Qu.: 35.450
## Max. :305.900
## NA's :6
iq.data <- dengue_merge_data.train[937: 1456,]
summary(iq.data)
## total_cases city year weekofyear
## Min. : 0.000 Length:520 Min. :2000 Min. : 1.00
## 1st Qu.: 1.000 Class :character 1st Qu.:2003 1st Qu.:13.75
## Median : 5.000 Mode :character Median :2005 Median :26.50
## Mean : 7.565 Mean :2005 Mean :26.50
## 3rd Qu.: 9.000 3rd Qu.:2007 3rd Qu.:39.25
## Max. :116.000 Max. :2010 Max. :53.00
##
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Min. :2000-07-01 Min. :0.06173 Min. :0.03586 Min. :0.02988
## 1st Qu.:2002-12-30 1st Qu.:0.20000 1st Qu.:0.17954 1st Qu.:0.19474
## Median :2005-06-28 Median :0.26364 Median :0.23297 Median :0.24980
## Mean :2005-06-28 Mean :0.26387 Mean :0.23878 Mean :0.25013
## 3rd Qu.:2007-12-26 3rd Qu.:0.31997 3rd Qu.:0.29393 3rd Qu.:0.30230
## Max. :2010-06-25 Max. :0.50836 Max. :0.45443 Max. :0.53831
## NA's :3 NA's :3 NA's :3
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :0.06418 Min. : 0.00 Min. :294.6
## 1st Qu.:0.20413 1st Qu.: 39.10 1st Qu.:297.1
## Median :0.26214 Median : 60.47 Median :297.8
## Mean :0.26678 Mean : 64.25 Mean :297.9
## 3rd Qu.:0.32515 3rd Qu.: 85.76 3rd Qu.:298.6
## Max. :0.54602 Max. :210.83 Max. :301.6
## NA's :3 NA's :4 NA's :4
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :294.9 Min. :290.1 Min. :300.0
## 1st Qu.:298.2 1st Qu.:294.6 1st Qu.:305.2
## Median :299.1 Median :295.9 Median :307.1
## Mean :299.1 Mean :295.5 Mean :307.1
## 3rd Qu.:300.1 3rd Qu.:296.5 3rd Qu.:308.7
## Max. :302.9 Max. :298.4 Max. :314.0
## NA's :4 NA's :4 NA's :4
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :286.9 Min. : 0.00
## 1st Qu.:292.0 1st Qu.: 24.07
## Median :293.1 Median : 46.44
## Mean :292.9 Mean : 57.61
## 3rd Qu.:294.2 3rd Qu.: 71.07
## Max. :296.0 Max. :362.03
## NA's :4 NA's :4
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :57.79 Min. : 0.00
## 1st Qu.:84.30 1st Qu.: 39.10
## Median :90.92 Median : 60.47
## Mean :88.64 Mean : 64.25
## 3rd Qu.:94.56 3rd Qu.: 85.76
## Max. :98.61 Max. :210.83
## NA's :4 NA's :4
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :12.11 Min. : 3.714 Min. :21.40
## 1st Qu.:16.10 1st Qu.: 7.371 1st Qu.:27.00
## Median :17.43 Median : 8.964 Median :27.60
## Mean :17.10 Mean : 9.207 Mean :27.53
## 3rd Qu.:18.18 3rd Qu.:11.014 3rd Qu.:28.10
## Max. :20.46 Max. :16.029 Max. :30.80
## NA's :4 NA's :4 NA's :37
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. : 5.20 Min. :30.1 Min. :14.7
## 1st Qu.: 9.50 1st Qu.:33.2 1st Qu.:20.6
## Median :10.62 Median :34.0 Median :21.3
## Mean :10.57 Mean :34.0 Mean :21.2
## 3rd Qu.:11.65 3rd Qu.:34.9 3rd Qu.:22.0
## Max. :15.80 Max. :42.2 Max. :24.2
## NA's :37 NA's :14 NA's :8
## station_precip_mm
## Min. : 0.00
## 1st Qu.: 17.20
## Median : 45.30
## Mean : 62.47
## 3rd Qu.: 85.95
## Max. :543.30
## NA's :16
more boxplots
boxplot(sj.data$total_cases ~ sj.data$weekofyear)
boxplot(sj.data$total_cases ~ sj.data$weekofyear, outline = FALSE)
boxplot(iq.data$total_cases ~ iq.data$weekofyear)
boxplot(iq.data$total_cases ~ iq.data$weekofyear, outline = FALSE)
Interesting how the trend changes by each week. Maybe in our analysis we will need some sort of dummy variable to distinguish between the two cities.
We should check how the total cases are distributed
hist(sj.data$total_cases)
hist(iq.data$total_cases)
That might need transforming later to make it more normal
Finally we can plot some stuff to get an idea for
I think we can start with an ETS/ARIMA model. Additionally, I think it would be wise to use the two separate data frames for sj and iq data to model since they follow clearly different patterns. If we were doing linear modeling, we could just create a dummy variable.
ts.sj.data <- ts(sj.data$total_cases, frequency = 52, start = c(1990, 18))
autoplot(ts.sj.data)
Maybe a neural net would be good here? It seems like there is some seasonality but peaks are not uniform. Lets check the iq time series data next
ts.iq.data <- ts(iq.data$total_cases, frequency = 52, start = c(2000, 26))
autoplot(ts.iq.data)
Checking if the decomposition command also shows seasonality
decompose(ts.sj.data, type = "additive") %>% autoplot()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Removed 104 row(s) containing missing values (geom_path).
decompose(ts.iq.data, type = "additive") %>% autoplot()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Removed 104 row(s) containing missing values (geom_path).
While the remainder is not white noise, there is still a seasonal and trend component to our data.
Same here for the iq data, there are some significant peaks and nothing seems uniform. I would like to see if our data is stationary since it actually looks to be that way.
ggtsdisplay(ts.sj.data)
pacf(ts.sj.data)
ggtsdisplay(ts.iq.data)
pacf(ts.iq.data)
Not stationary at all. Very periodical? So maybe an ARIMA model will still be good here!
ndiffs(ts.sj.data)
## [1] 1
nsdiffs(ts.sj.data)
## [1] 0
ts.sj.data %>% diff(1) %>% diff(1) %>% ggtsdisplay()
Differencing twice seemed to help it a bit more but there are still still plenty of significant lags in both the ACF and PACF plots.
ndiffs(ts.iq.data)
## [1] 1
nsdiffs(ts.iq.data)
## [1] 0
ts.iq.data %>% diff(1) %>% ggtsdisplay()
IQ data looks good after one difference but SJ is still a bit concerning. However, it is an improvement from the original ACF and PACF plots.
Lets just try ARIMA models and see what happens
aa.sj <- auto.arima(ts.sj.data)
summary(aa.sj)
## Series: ts.sj.data
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## 0.7116 -0.5929
## s.e. 0.0948 0.1078
##
## sigma^2 estimated as 180.9: log likelihood=-3755.85
## AIC=7517.71 AICc=7517.73 BIC=7532.23
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.001467535 13.42959 8.047587 NaN Inf 0.2202839 0.001092614
checkresiduals(aa.sj)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 95.049, df = 102, p-value = 0.6741
##
## Model df: 2. Total lags used: 104
So the book tells us that the ARIMA model could follow an ARIMA(p,d,0) if ACF is exp decay or sinusoidal (wave like) and PACF has no significant spikes past lag p. Our ACF and PACF plots fit both criteria for that on the SJ data. PACF has significant spikes up to 3, 6 and 9. Ill try all 3 and see how they look
sj.310 <- arima(ts.sj.data, order = c(3,1,0))
sj.610 <- arima(ts.sj.data, order = c(6,1,0))
sj.910 <- arima(ts.sj.data, order = c(9,1,0))
summary(sj.310)
##
## Call:
## arima(x = ts.sj.data, order = c(3, 1, 0))
##
## Coefficients:
## ar1 ar2 ar3
## 0.1244 0.0878 -0.0043
## s.e. 0.0327 0.0328 0.0327
##
## sigma^2 estimated as 180.9: log likelihood = -3756.7, aic = 7521.4
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.001461025 13.44178 8.094611 NaN Inf 1.013317 0.0003701168
checkresiduals(sj.310)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,0)
## Q* = 101.61, df = 101, p-value = 0.4642
##
## Model df: 3. Total lags used: 104
summary(sj.610)
##
## Call:
## arima(x = ts.sj.data, order = c(6, 1, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6
## 0.1197 0.0832 -0.0182 0.0517 0.0766 0.0024
## s.e. 0.0327 0.0328 0.0329 0.0328 0.0328 0.0326
##
## sigma^2 estimated as 179.1: log likelihood = -3752.13, aic = 7518.26
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.001486091 13.37593 8.050992 NaN Inf 1.007856 0.0001348531
checkresiduals(sj.610)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(6,1,0)
## Q* = 88.145, df = 98, p-value = 0.7521
##
## Model df: 6. Total lags used: 104
summary(sj.910)
##
## Call:
## arima(x = ts.sj.data, order = c(9, 1, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
## 0.1280 0.0819 -0.0239 0.0568 0.0835 -0.0058 -0.0023 0.0843
## s.e. 0.0325 0.0326 0.0327 0.0327 0.0326 0.0327 0.0326 0.0325
## ar9
## -0.1117
## s.e. 0.0324
##
## sigma^2 estimated as 175.9: log likelihood = -3743.83, aic = 7507.65
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.001885691 13.25658 8.062695 NaN Inf 1.009321 -0.004295591
checkresiduals(sj.910)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(9,1,0)
## Q* = 77.812, df = 95, p-value = 0.9001
##
## Model df: 9. Total lags used: 104
All 3 hand selected parameter models looked roughly the same compared with the auto.arima() model of order (1,1,1). However, the ARIMA(9,1,0) model had a slightly better AIC value than the others. Additionally, the resulting residual ACF plot had less significant lags. Based on that, I will use the ARIMA(9,1,0) model.
Lets move onto the IQ data.
aa.iq <- auto.arima(ts.iq.data)
summary(aa.iq)
## Series: ts.iq.data
## ARIMA(0,1,2)(0,0,1)[52]
##
## Coefficients:
## ma1 ma2 sma1
## -0.2508 -0.2322 0.0568
## s.e. 0.0430 0.0449 0.0375
##
## sigma^2 estimated as 52.12: log likelihood=-1761.09
## AIC=3530.18 AICc=3530.26 BIC=3547.19
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.01292348 7.191928 3.841351 NaN Inf 0.4069155 -0.002950598
checkresiduals(aa.iq)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2)(0,0,1)[52]
## Q* = 93.91, df = 101, p-value = 0.6786
##
## Model df: 3. Total lags used: 104
Interestingly enough, the auto.arima() function chose a seasonal version of the model even though our nsdiffs() value for the IQ data was 0. Once again, our iq model seems to meet the criteria necessary for an ARIMA(0,d,q) with an sinusoidal ACF plot and significant lags at 3 and 5 in the PACF plot and none afterwards. (ACF and PACF plots above)
I will leave the seasonal component the same as the auto.arima() as I am not confident on how to address those parameters but will try changing the non seasonal components of our iq model.
iq.013 <- arima(ts.iq.data, order = c(0,1,3), seasonal = list(order = c(0,0,1)))
iq.015 <- arima(ts.iq.data, order = c(0,1,5), seasonal = list(order = c(0,0,1)))
summary(iq.013)
##
## Call:
## arima(x = ts.iq.data, order = c(0, 1, 3), seasonal = list(order = c(0, 0, 1)))
##
## Coefficients:
## ma1 ma2 ma3 sma1
## -0.2539 -0.2307 0.0153 0.0572
## s.e. 0.0441 0.0450 0.0527 0.0375
##
## sigma^2 estimated as 51.82: log likelihood = -1761.05, aic = 3532.1
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.01229733 7.191378 3.83601 NaN Inf 0.98267 -0.0001826112
checkresiduals(iq.013)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,3)(0,0,1)[52]
## Q* = 92.949, df = 100, p-value = 0.6784
##
## Model df: 4. Total lags used: 104
summary(iq.015)
##
## Call:
## arima(x = ts.iq.data, order = c(0, 1, 5), seasonal = list(order = c(0, 0, 1)))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 sma1
## -0.2948 -0.2269 0.0208 0.0390 -0.1683 0.0544
## s.e. 0.0452 0.0459 0.0528 0.0498 0.0495 0.0375
##
## sigma^2 estimated as 50.62: log likelihood = -1755.08, aic = 3524.16
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.02222482 7.107777 3.814002 NaN Inf 0.9770322 0.02084161
checkresiduals(iq.015)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,5)(0,0,1)[52]
## Q* = 84.997, df = 98, p-value = 0.8226
##
## Model df: 6. Total lags used: 104
Once again, the comparison parameters loos more or less the same on all 3 models. Nearly similar AIC values for all of them but the only exception is that the ARIMA(0,1,5)(0,0,1) model only had 3 significant lags in the residual ACF plot whereas the other two had 4. Therefore, I would use the ARIMA(0,1,5)(0,0,1) model.
I guess now we can try to do some forecasting with these models and see how they look?
fc.arima.sj <- forecast(sj.910)
autoplot(fc.arima.sj)
## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
It looks like a naive model. I guess it works but i dont think its particularly useful
Moving onto the IQ forecast
fc.arima.iq <- forecast(iq.015)
autoplot(fc.arima.iq)
This forecast seems to have some attempt at seasonality but ultimately it still looks like a naive forecast.
Lets try some linear modeling. I will first make a copy of the separate data sets so that way I dont mess anything up
sj.copy <- sj.data
summary(sj.copy)
## total_cases city year weekofyear
## Min. : 0.00 Length:936 Min. :1990 Min. : 1.00
## 1st Qu.: 9.00 Class :character 1st Qu.:1994 1st Qu.:13.75
## Median : 19.00 Mode :character Median :1999 Median :26.50
## Mean : 34.18 Mean :1999 Mean :26.50
## 3rd Qu.: 37.00 3rd Qu.:2003 3rd Qu.:39.25
## Max. :461.00 Max. :2008 Max. :53.00
##
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Min. :1990-04-30 Min. :-0.40625 Min. :-0.45610 Min. :-0.01553
## 1st Qu.:1994-10-27 1st Qu.: 0.00450 1st Qu.: 0.01642 1st Qu.: 0.13928
## Median :1999-04-26 Median : 0.05770 Median : 0.06807 Median : 0.17719
## Mean :1999-04-26 Mean : 0.05792 Mean : 0.06747 Mean : 0.17766
## 3rd Qu.:2003-10-23 3rd Qu.: 0.11110 3rd Qu.: 0.11520 3rd Qu.: 0.21256
## Max. :2008-04-22 Max. : 0.49340 Max. : 0.43710 Max. : 0.39313
## NA's :191 NA's :49 NA's :19
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :-0.06346 Min. : 0.00 Min. :295.9
## 1st Qu.: 0.12916 1st Qu.: 0.00 1st Qu.:298.2
## Median : 0.16597 Median : 20.80 Median :299.3
## Mean : 0.16596 Mean : 35.47 Mean :299.2
## 3rd Qu.: 0.20277 3rd Qu.: 52.18 3rd Qu.:300.1
## Max. : 0.38142 Max. :390.60 Max. :302.2
## NA's :19 NA's :9 NA's :6
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :296.1 Min. :289.6 Min. :297.8
## 1st Qu.:298.3 1st Qu.:293.8 1st Qu.:300.4
## Median :299.4 Median :295.5 Median :301.5
## Mean :299.3 Mean :295.1 Mean :301.4
## 3rd Qu.:300.2 3rd Qu.:296.4 3rd Qu.:302.4
## Max. :302.2 Max. :297.8 Max. :304.3
## NA's :6 NA's :6 NA's :6
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :292.6 Min. : 0.00
## 1st Qu.:296.3 1st Qu.: 10.82
## Median :297.5 Median : 21.30
## Mean :297.3 Mean : 30.47
## 3rd Qu.:298.4 3rd Qu.: 37.00
## Max. :299.9 Max. :570.50
## NA's :6 NA's :6
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :66.74 Min. : 0.00
## 1st Qu.:76.25 1st Qu.: 0.00
## Median :78.67 Median : 20.80
## Mean :78.57 Mean : 35.47
## 3rd Qu.:80.96 3rd Qu.: 52.18
## Max. :87.58 Max. :390.60
## NA's :6 NA's :9
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :11.72 Min. :1.357 Min. :22.84
## 1st Qu.:15.24 1st Qu.:2.157 1st Qu.:25.84
## Median :16.85 Median :2.457 Median :27.23
## Mean :16.55 Mean :2.516 Mean :27.01
## 3rd Qu.:17.86 3rd Qu.:2.800 3rd Qu.:28.19
## Max. :19.44 Max. :4.429 Max. :30.07
## NA's :6 NA's :6 NA's :6
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. :4.529 Min. :26.70 Min. :17.8
## 1st Qu.:6.200 1st Qu.:30.60 1st Qu.:21.7
## Median :6.757 Median :31.70 Median :22.8
## Mean :6.757 Mean :31.61 Mean :22.6
## 3rd Qu.:7.286 3rd Qu.:32.80 3rd Qu.:23.9
## Max. :9.914 Max. :35.60 Max. :25.6
## NA's :6 NA's :6 NA's :6
## station_precip_mm
## Min. : 0.000
## 1st Qu.: 6.825
## Median : 17.750
## Mean : 26.785
## 3rd Qu.: 35.450
## Max. :305.900
## NA's :6
iq.copy <- iq.data
summary(iq.copy)
## total_cases city year weekofyear
## Min. : 0.000 Length:520 Min. :2000 Min. : 1.00
## 1st Qu.: 1.000 Class :character 1st Qu.:2003 1st Qu.:13.75
## Median : 5.000 Mode :character Median :2005 Median :26.50
## Mean : 7.565 Mean :2005 Mean :26.50
## 3rd Qu.: 9.000 3rd Qu.:2007 3rd Qu.:39.25
## Max. :116.000 Max. :2010 Max. :53.00
##
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Min. :2000-07-01 Min. :0.06173 Min. :0.03586 Min. :0.02988
## 1st Qu.:2002-12-30 1st Qu.:0.20000 1st Qu.:0.17954 1st Qu.:0.19474
## Median :2005-06-28 Median :0.26364 Median :0.23297 Median :0.24980
## Mean :2005-06-28 Mean :0.26387 Mean :0.23878 Mean :0.25013
## 3rd Qu.:2007-12-26 3rd Qu.:0.31997 3rd Qu.:0.29393 3rd Qu.:0.30230
## Max. :2010-06-25 Max. :0.50836 Max. :0.45443 Max. :0.53831
## NA's :3 NA's :3 NA's :3
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :0.06418 Min. : 0.00 Min. :294.6
## 1st Qu.:0.20413 1st Qu.: 39.10 1st Qu.:297.1
## Median :0.26214 Median : 60.47 Median :297.8
## Mean :0.26678 Mean : 64.25 Mean :297.9
## 3rd Qu.:0.32515 3rd Qu.: 85.76 3rd Qu.:298.6
## Max. :0.54602 Max. :210.83 Max. :301.6
## NA's :3 NA's :4 NA's :4
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :294.9 Min. :290.1 Min. :300.0
## 1st Qu.:298.2 1st Qu.:294.6 1st Qu.:305.2
## Median :299.1 Median :295.9 Median :307.1
## Mean :299.1 Mean :295.5 Mean :307.1
## 3rd Qu.:300.1 3rd Qu.:296.5 3rd Qu.:308.7
## Max. :302.9 Max. :298.4 Max. :314.0
## NA's :4 NA's :4 NA's :4
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :286.9 Min. : 0.00
## 1st Qu.:292.0 1st Qu.: 24.07
## Median :293.1 Median : 46.44
## Mean :292.9 Mean : 57.61
## 3rd Qu.:294.2 3rd Qu.: 71.07
## Max. :296.0 Max. :362.03
## NA's :4 NA's :4
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :57.79 Min. : 0.00
## 1st Qu.:84.30 1st Qu.: 39.10
## Median :90.92 Median : 60.47
## Mean :88.64 Mean : 64.25
## 3rd Qu.:94.56 3rd Qu.: 85.76
## Max. :98.61 Max. :210.83
## NA's :4 NA's :4
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :12.11 Min. : 3.714 Min. :21.40
## 1st Qu.:16.10 1st Qu.: 7.371 1st Qu.:27.00
## Median :17.43 Median : 8.964 Median :27.60
## Mean :17.10 Mean : 9.207 Mean :27.53
## 3rd Qu.:18.18 3rd Qu.:11.014 3rd Qu.:28.10
## Max. :20.46 Max. :16.029 Max. :30.80
## NA's :4 NA's :4 NA's :37
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. : 5.20 Min. :30.1 Min. :14.7
## 1st Qu.: 9.50 1st Qu.:33.2 1st Qu.:20.6
## Median :10.62 Median :34.0 Median :21.3
## Mean :10.57 Mean :34.0 Mean :21.2
## 3rd Qu.:11.65 3rd Qu.:34.9 3rd Qu.:22.0
## Max. :15.80 Max. :42.2 Max. :24.2
## NA's :37 NA's :14 NA's :8
## station_precip_mm
## Min. : 0.00
## 1st Qu.: 17.20
## Median : 45.30
## Mean : 62.47
## 3rd Qu.: 85.95
## Max. :543.30
## NA's :16
I think that maybe we could impute the median for all of these missing values?
for (i in 1:ncol(sj.copy)){
for (j in 1:nrow(sj.copy)){
if (is.na(sj.copy[j,i] == TRUE)){
sj.copy[j,i] <- median(sj.copy[,i], na.rm = TRUE)
}
}
}
summary(sj.copy)
## total_cases city year weekofyear
## Min. : 0.00 Length:936 Min. :1990 Min. : 1.00
## 1st Qu.: 9.00 Class :character 1st Qu.:1994 1st Qu.:13.75
## Median : 19.00 Mode :character Median :1999 Median :26.50
## Mean : 34.18 Mean :1999 Mean :26.50
## 3rd Qu.: 37.00 3rd Qu.:2003 3rd Qu.:39.25
## Max. :461.00 Max. :2008 Max. :53.00
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Min. :1990-04-30 Min. :-0.40625 Min. :-0.45610 Min. :-0.01553
## 1st Qu.:1994-10-27 1st Qu.: 0.02123 1st Qu.: 0.02047 1st Qu.: 0.14012
## Median :1999-04-26 Median : 0.05770 Median : 0.06807 Median : 0.17719
## Mean :1999-04-26 Mean : 0.05788 Mean : 0.06750 Mean : 0.17765
## 3rd Qu.:2003-10-23 3rd Qu.: 0.09885 3rd Qu.: 0.11270 3rd Qu.: 0.21160
## Max. :2008-04-22 Max. : 0.49340 Max. : 0.43710 Max. : 0.39313
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :-0.06346 Min. : 0.00 Min. :295.9
## 1st Qu.: 0.12993 1st Qu.: 0.00 1st Qu.:298.2
## Median : 0.16597 Median : 20.80 Median :299.3
## Mean : 0.16596 Mean : 35.33 Mean :299.2
## 3rd Qu.: 0.20226 3rd Qu.: 51.66 3rd Qu.:300.1
## Max. : 0.38142 Max. :390.60 Max. :302.2
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :296.1 Min. :289.6 Min. :297.8
## 1st Qu.:298.3 1st Qu.:293.9 1st Qu.:300.4
## Median :299.4 Median :295.5 Median :301.5
## Mean :299.3 Mean :295.1 Mean :301.4
## 3rd Qu.:300.2 3rd Qu.:296.4 3rd Qu.:302.4
## Max. :302.2 Max. :297.8 Max. :304.3
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :292.6 Min. : 0.00
## 1st Qu.:296.3 1st Qu.: 10.90
## Median :297.5 Median : 21.30
## Mean :297.3 Mean : 30.41
## 3rd Qu.:298.4 3rd Qu.: 36.92
## Max. :299.9 Max. :570.50
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :66.74 Min. : 0.00
## 1st Qu.:76.26 1st Qu.: 0.00
## Median :78.67 Median : 20.80
## Mean :78.57 Mean : 35.33
## 3rd Qu.:80.95 3rd Qu.: 51.66
## Max. :87.58 Max. :390.60
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :11.72 Min. :1.357 Min. :22.84
## 1st Qu.:15.25 1st Qu.:2.157 1st Qu.:25.84
## Median :16.85 Median :2.457 Median :27.23
## Mean :16.55 Mean :2.516 Mean :27.01
## 3rd Qu.:17.85 3rd Qu.:2.789 3rd Qu.:28.18
## Max. :19.44 Max. :4.429 Max. :30.07
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. :4.529 Min. :26.70 Min. :17.8
## 1st Qu.:6.211 1st Qu.:30.60 1st Qu.:21.7
## Median :6.757 Median :31.70 Median :22.8
## Mean :6.757 Mean :31.61 Mean :22.6
## 3rd Qu.:7.286 3rd Qu.:32.80 3rd Qu.:23.9
## Max. :9.914 Max. :35.60 Max. :25.6
## station_precip_mm
## Min. : 0.00
## 1st Qu.: 6.90
## Median : 17.75
## Mean : 26.73
## 3rd Qu.: 35.30
## Max. :305.90
for (i in 1:ncol(iq.copy)){
for (j in 1:nrow(iq.copy)){
if (is.na(iq.copy[j,i] == TRUE)){
iq.copy[j,i] <- median(iq.copy[,i], na.rm = TRUE)
}
}
}
summary(iq.copy)
## total_cases city year weekofyear
## Min. : 0.000 Length:520 Min. :2000 Min. : 1.00
## 1st Qu.: 1.000 Class :character 1st Qu.:2003 1st Qu.:13.75
## Median : 5.000 Mode :character Median :2005 Median :26.50
## Mean : 7.565 Mean :2005 Mean :26.50
## 3rd Qu.: 9.000 3rd Qu.:2007 3rd Qu.:39.25
## Max. :116.000 Max. :2010 Max. :53.00
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Min. :2000-07-01 Min. :0.06173 Min. :0.03586 Min. :0.02988
## 1st Qu.:2002-12-30 1st Qu.:0.20015 1st Qu.:0.18026 1st Qu.:0.19475
## Median :2005-06-28 Median :0.26364 Median :0.23297 Median :0.24980
## Mean :2005-06-28 Mean :0.26387 Mean :0.23875 Mean :0.25012
## 3rd Qu.:2007-12-26 3rd Qu.:0.31962 3rd Qu.:0.29383 3rd Qu.:0.30213
## Max. :2010-06-25 Max. :0.50836 Max. :0.45443 Max. :0.53831
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :0.06418 Min. : 0.00 Min. :294.6
## 1st Qu.:0.20430 1st Qu.: 39.15 1st Qu.:297.1
## Median :0.26214 Median : 60.47 Median :297.8
## Mean :0.26675 Mean : 64.22 Mean :297.9
## 3rd Qu.:0.32488 3rd Qu.: 85.64 3rd Qu.:298.6
## Max. :0.54602 Max. :210.83 Max. :301.6
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :294.9 Min. :290.1 Min. :300.0
## 1st Qu.:298.2 1st Qu.:294.6 1st Qu.:305.2
## Median :299.1 Median :295.9 Median :307.1
## Mean :299.1 Mean :295.5 Mean :307.1
## 3rd Qu.:300.1 3rd Qu.:296.5 3rd Qu.:308.7
## Max. :302.9 Max. :298.4 Max. :314.0
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :286.9 Min. : 0.00
## 1st Qu.:292.0 1st Qu.: 24.21
## Median :293.1 Median : 46.44
## Mean :292.9 Mean : 57.52
## 3rd Qu.:294.1 3rd Qu.: 70.43
## Max. :296.0 Max. :362.03
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :57.79 Min. : 0.00
## 1st Qu.:84.39 1st Qu.: 39.15
## Median :90.92 Median : 60.47
## Mean :88.66 Mean : 64.22
## 3rd Qu.:94.55 3rd Qu.: 85.64
## Max. :98.61 Max. :210.83
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :12.11 Min. : 3.714 Min. :21.40
## 1st Qu.:16.12 1st Qu.: 7.371 1st Qu.:27.05
## Median :17.43 Median : 8.964 Median :27.60
## Mean :17.10 Mean : 9.205 Mean :27.54
## 3rd Qu.:18.18 3rd Qu.:11.004 3rd Qu.:28.04
## Max. :20.46 Max. :16.029 Max. :30.80
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. : 5.20 Min. :30.1 Min. :14.7
## 1st Qu.: 9.55 1st Qu.:33.2 1st Qu.:20.6
## Median :10.62 Median :34.0 Median :21.3
## Mean :10.57 Mean :34.0 Mean :21.2
## 3rd Qu.:11.60 3rd Qu.:34.8 3rd Qu.:22.0
## Max. :15.80 Max. :42.2 Max. :24.2
## station_precip_mm
## Min. : 0.00
## 1st Qu.: 18.00
## Median : 45.30
## Mean : 61.94
## 3rd Qu.: 83.35
## Max. :543.30
This is probably the simplest imputation we can do but lets just start here
sj.lm.1 <- lm(total_cases ~ ndvi_ne + ndvi_nw + ndvi_se + ndvi_sw, data = sj.copy)
summary(sj.lm.1)
##
## Call:
## lm(formula = total_cases ~ ndvi_ne + ndvi_nw + ndvi_se + ndvi_sw,
## data = sj.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.72 -23.89 -15.20 2.26 426.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.9364 5.6764 5.978 3.21e-09 ***
## ndvi_ne -5.8315 22.4915 -0.259 0.795
## ndvi_nw 33.5166 23.9061 1.402 0.161
## ndvi_se -0.4145 52.6665 -0.008 0.994
## ndvi_sw -9.6840 53.5738 -0.181 0.857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.42 on 931 degrees of freedom
## Multiple R-squared: 0.002673, Adjusted R-squared: -0.001612
## F-statistic: 0.6238 on 4 and 931 DF, p-value: 0.6456
sj.lm.2 <- lm(total_cases ~ precipitation_amt_mm + reanalysis_air_temp_k +reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_precip_amt_kg_per_m2 + reanalysis_relative_humidity_percent + reanalysis_sat_precip_amt_mm + reanalysis_specific_humidity_g_per_kg+ reanalysis_tdtr_k+ station_avg_temp_c +station_diur_temp_rng_c, data = sj.copy)
summary(sj.lm.2)
##
## Call:
## lm(formula = total_cases ~ precipitation_amt_mm + reanalysis_air_temp_k +
## reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k +
## reanalysis_min_air_temp_k + reanalysis_precip_amt_kg_per_m2 +
## reanalysis_relative_humidity_percent + reanalysis_sat_precip_amt_mm +
## reanalysis_specific_humidity_g_per_kg + reanalysis_tdtr_k +
## station_avg_temp_c + station_diur_temp_rng_c, data = sj.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.35 -23.60 -9.99 6.02 395.20
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7531.97796 6461.34651 1.166 0.244038
## precipitation_amt_mm -0.07123 0.04695 -1.517 0.129631
## reanalysis_air_temp_k 34.43433 63.31903 0.544 0.586695
## reanalysis_avg_temp_k -44.77643 21.70573 -2.063 0.039403
## reanalysis_dew_point_temp_k -39.68892 58.34138 -0.680 0.496493
## reanalysis_max_air_temp_k 18.73630 4.94775 3.787 0.000162
## reanalysis_min_air_temp_k 3.33871 5.35669 0.623 0.533256
## reanalysis_precip_amt_kg_per_m2 0.06011 0.06438 0.934 0.350705
## reanalysis_relative_humidity_percent 0.84440 12.75105 0.066 0.947216
## reanalysis_sat_precip_amt_mm NA NA NA NA
## reanalysis_specific_humidity_g_per_kg 36.85390 22.14988 1.664 0.096483
## reanalysis_tdtr_k -17.89419 5.70138 -3.139 0.001752
## station_avg_temp_c 0.35655 3.00517 0.119 0.905581
## station_diur_temp_rng_c 5.01060 2.33911 2.142 0.032446
##
## (Intercept)
## precipitation_amt_mm
## reanalysis_air_temp_k
## reanalysis_avg_temp_k *
## reanalysis_dew_point_temp_k
## reanalysis_max_air_temp_k ***
## reanalysis_min_air_temp_k
## reanalysis_precip_amt_kg_per_m2
## reanalysis_relative_humidity_percent
## reanalysis_sat_precip_amt_mm
## reanalysis_specific_humidity_g_per_kg .
## reanalysis_tdtr_k **
## station_avg_temp_c
## station_diur_temp_rng_c *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.59 on 923 degrees of freedom
## Multiple R-squared: 0.08038, Adjusted R-squared: 0.06842
## F-statistic: 6.723 on 12 and 923 DF, p-value: 1.102e-11
sj.lm.3 <- lm(total_cases ~ station_max_temp_c + station_min_temp_c + station_precip_mm, data = sj.copy)
summary(sj.lm.3)
##
## Call:
## lm(formula = total_cases ~ station_max_temp_c + station_min_temp_c +
## station_precip_mm, data = sj.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.18 -25.12 -11.78 6.53 408.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -158.95702 31.02149 -5.124 3.63e-07 ***
## station_max_temp_c 3.99980 1.30404 3.067 0.00222 **
## station_min_temp_c 2.86117 1.49202 1.918 0.05546 .
## station_precip_mm 0.07639 0.05667 1.348 0.17798
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.36 on 932 degrees of freedom
## Multiple R-squared: 0.04232, Adjusted R-squared: 0.03924
## F-statistic: 13.73 on 3 and 932 DF, p-value: 9.105e-09
No apparent trend and using season would be bad in a tslm() since my frequency is 52. But I will create factor variables for months since it seemed like there was seasonality present in the boxplots
sj.copy$month <- as.factor(month(sj.copy$week_start_date))
summary(sj.copy$month)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 90 72 72 90 72 72 90 72 76 86 72 72
sj.lm.4 <- lm(total_cases ~ month, data = sj.copy)
summary(sj.lm.4)
##
## Call:
## lm(formula = total_cases ~ month, data = sj.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.00 -19.36 -7.10 4.59 390.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.422 5.007 6.275 5.36e-10 ***
## month2 -10.325 7.511 -1.375 0.16957
## month3 -16.825 7.511 -2.240 0.02532 *
## month4 -21.400 7.081 -3.022 0.00258 **
## month5 -20.283 7.511 -2.701 0.00705 **
## month6 -15.103 7.511 -2.011 0.04464 *
## month7 -4.067 7.081 -0.574 0.56592
## month8 16.619 7.511 2.213 0.02716 *
## month9 24.328 7.400 3.287 0.00105 **
## month10 39.578 7.163 5.525 4.28e-08 ***
## month11 30.397 7.511 4.047 5.62e-05 ***
## month12 10.258 7.511 1.366 0.17234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.5 on 924 degrees of freedom
## Multiple R-squared: 0.1553, Adjusted R-squared: 0.1453
## F-statistic: 15.44 on 11 and 924 DF, p-value: < 2.2e-16
Not surprisingly, the months variable have the most explanatory power for the total cases. Maybe we can try an autoregressive term (like from week 1’s discussion).
Following what was written there
sj.copy$seq <- seq(1,936)
sj.copy$total_cases_t1 <- c(sj.copy$total_cases[1], sj.copy$total_cases[1:935])
sj.lm.ar <- lm(total_cases ~ total_cases_t1 + seq + month, data = sj.copy[1:800,])
summary(sj.lm.ar)
##
## Call:
## lm(formula = total_cases ~ total_cases_t1 + seq + month, data = sj.copy[1:800,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.274 -5.554 -0.454 4.871 98.091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5297220 1.8496088 0.286 0.7746
## total_cases_t1 0.9630156 0.0098321 97.946 <2e-16 ***
## seq -0.0009883 0.0021186 -0.467 0.6410
## month2 -1.5225277 2.3458012 -0.649 0.5165
## month3 -1.1132167 2.3504160 -0.474 0.6359
## month4 -0.3087953 2.2152379 -0.139 0.8892
## month5 0.7440402 2.3166753 0.321 0.7482
## month6 2.2634187 2.3126499 0.979 0.3280
## month7 4.7167953 2.1781654 2.165 0.0307 *
## month8 3.7758713 2.3045609 1.638 0.1017
## month9 4.8694858 2.2880341 2.128 0.0336 *
## month10 5.7773149 2.2643995 2.551 0.0109 *
## month11 1.0882387 2.3659468 0.460 0.6457
## month12 -5.8403595 2.3504852 -2.485 0.0132 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.53 on 786 degrees of freedom
## Multiple R-squared: 0.9383, Adjusted R-squared: 0.9373
## F-statistic: 920.1 on 13 and 786 DF, p-value: < 2.2e-16
By creating an autoregressive term with an offset of 1 and using that and the monthly variable we are able to explain nearly everything with the total cases.
Checking residuals on this
plot(sj.lm.ar)
The residuals plot tell us that there is heteroskedasticity present in our data. This was also apparent from the earlier boxplots as there were a few total_cases values magnitudes above any of the weekly means.
I have built the model on this whole data set but I want to try and split the sj data into a training and validation set and see where we can go from there
sj.ar.pred2 <- predict(sj.lm.ar, sj.copy[1:800,],level = 0.95, interval = "prediction")
sj.ar.pred <- predict(sj.lm.ar, sj.copy[801:936,], level = 0.95, interval = "prediction")
head(sj.ar.pred)
## fit lwr upr
## 801 130.76258 103.88540 157.63977
## 802 112.46430 85.61015 139.31845
## 803 84.48067 57.67375 111.28759
## 804 75.81254 49.00858 102.61650
## 805 46.92109 20.11953 73.72264
## 806 58.47629 31.67490 85.27767
sj.ar.pred <- as.data.frame(sj.ar.pred)
sj.ar.pred$actual <- sj.copy$total_cases[801:936]
sj.ar.pred2 <- as.data.frame(sj.ar.pred2)
sj.ar.pred2$actual <- sj.copy$total_cases[1:800]
totalpred <- rbind(sj.ar.pred2, sj.ar.pred)
plot(sj.copy$seq, sj.copy$total_cases, col = 'blue', type = 'l')
lines(sj.copy$seq, totalpred$fit, col = 'red')
lines(sj.copy$seq,totalpred$lwr, lty=3, col='dark gray')
lines(sj.copy$seq,totalpred$upr, lty=3, col='dark gray')
We can see that the fitted values track the test set values relatively well compared with the ARIMA model from before. Lets try to do the same thing for our iq data.
First we begin with the other variables
iq.lm.1 <- lm(total_cases ~ ndvi_ne + ndvi_nw + ndvi_se + ndvi_sw, data = iq.copy)
summary(iq.lm.1)
##
## Call:
## lm(formula = total_cases ~ ndvi_ne + ndvi_nw + ndvi_se + ndvi_sw,
## data = iq.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.142 -6.136 -2.722 1.622 107.964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.086 1.742 4.643 4.37e-06 ***
## ndvi_ne 11.776 12.539 0.939 0.3481
## ndvi_nw -10.526 10.219 -1.030 0.3035
## ndvi_se -20.534 9.756 -2.105 0.0358 *
## ndvi_sw 15.075 10.988 1.372 0.1707
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.74 on 515 degrees of freedom
## Multiple R-squared: 0.01252, Adjusted R-squared: 0.004853
## F-statistic: 1.633 on 4 and 515 DF, p-value: 0.1646
iq.lm.2 <- lm(total_cases ~ precipitation_amt_mm + reanalysis_air_temp_k +reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_precip_amt_kg_per_m2 + reanalysis_relative_humidity_percent + reanalysis_sat_precip_amt_mm + reanalysis_specific_humidity_g_per_kg+ reanalysis_tdtr_k+ station_avg_temp_c +station_diur_temp_rng_c, data = iq.copy)
summary(iq.lm.2)
##
## Call:
## lm(formula = total_cases ~ precipitation_amt_mm + reanalysis_air_temp_k +
## reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k +
## reanalysis_min_air_temp_k + reanalysis_precip_amt_kg_per_m2 +
## reanalysis_relative_humidity_percent + reanalysis_sat_precip_amt_mm +
## reanalysis_specific_humidity_g_per_kg + reanalysis_tdtr_k +
## station_avg_temp_c + station_diur_temp_rng_c, data = iq.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.501 -5.278 -2.741 1.579 105.714
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2237.31781 1700.68915 1.316 0.189
## precipitation_amt_mm -0.00425 0.01507 -0.282 0.778
## reanalysis_air_temp_k 0.04476 4.97250 0.009 0.993
## reanalysis_avg_temp_k -0.46683 2.12672 -0.220 0.826
## reanalysis_dew_point_temp_k -7.25981 6.65697 -1.091 0.276
## reanalysis_max_air_temp_k -0.39068 0.43609 -0.896 0.371
## reanalysis_min_air_temp_k 0.12312 0.64784 0.190 0.849
## reanalysis_precip_amt_kg_per_m2 -0.01303 0.01251 -1.041 0.298
## reanalysis_relative_humidity_percent -0.42937 1.06935 -0.402 0.688
## reanalysis_sat_precip_amt_mm NA NA NA NA
## reanalysis_specific_humidity_g_per_kg 10.21622 5.74591 1.778 0.076 .
## reanalysis_tdtr_k -0.41877 0.73579 -0.569 0.570
## station_avg_temp_c -0.20457 0.75968 -0.269 0.788
## station_diur_temp_rng_c -0.03732 0.44636 -0.084 0.933
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 507 degrees of freedom
## Multiple R-squared: 0.07332, Adjusted R-squared: 0.05139
## F-statistic: 3.343 on 12 and 507 DF, p-value: 0.0001058
iq.lm.3 <- lm(total_cases ~ station_max_temp_c + station_min_temp_c + station_precip_mm, data = iq.copy)
summary(iq.lm.3)
##
## Call:
## lm(formula = total_cases ~ station_max_temp_c + station_min_temp_c +
## station_precip_mm, data = iq.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.479 -6.004 -3.018 1.955 108.432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.684051 13.804523 -2.875 0.00421 **
## station_max_temp_c 0.409500 0.363137 1.128 0.25998
## station_min_temp_c 1.563356 0.382145 4.091 4.98e-05 ***
## station_precip_mm 0.002974 0.007706 0.386 0.69975
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.59 on 516 degrees of freedom
## Multiple R-squared: 0.03866, Adjusted R-squared: 0.03307
## F-statistic: 6.917 on 3 and 516 DF, p-value: 0.0001426
Similarly, we once again create some month factors as that seemed to be the most important variable in the sj data
iq.copy$month <- as.factor(month(iq.copy$week_start_date))
summary(iq.copy$month)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 50 40 40 50 40 40 50 40 43 47 40 40
iq.lm.4 <- lm(total_cases ~ month, data = iq.copy)
summary(iq.lm.4)
##
## Call:
## lm(formula = total_cases ~ month, data = iq.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.900 -5.700 -1.700 2.331 103.100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.080 1.452 8.321 8.12e-16 ***
## month2 0.495 2.178 0.227 0.820280
## month3 -4.830 2.178 -2.218 0.027002 *
## month4 -6.380 2.053 -3.107 0.001993 **
## month5 -7.230 2.178 -3.320 0.000965 ***
## month6 -8.705 2.178 -3.997 7.36e-05 ***
## month7 -9.020 2.053 -4.393 1.36e-05 ***
## month8 -8.930 2.178 -4.101 4.80e-05 ***
## month9 -5.638 2.135 -2.641 0.008528 **
## month10 -2.335 2.086 -1.120 0.263372
## month11 -2.255 2.178 -1.035 0.300933
## month12 0.820 2.178 0.377 0.706671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.27 on 508 degrees of freedom
## Multiple R-squared: 0.1099, Adjusted R-squared: 0.09067
## F-statistic: 5.705 on 11 and 508 DF, p-value: 1.061e-08
Month seems to not be as significant when compared with the sj model but it is still the most statistically significant factor nonetheless.
Moving onto AR model
iq.copy$seq <- seq(1,520)
iq.copy$total_cases_t1 <- c(iq.copy$total_cases[1], iq.copy$total_cases[1:519])
iq.lm.ar <- lm(total_cases ~ total_cases_t1 + seq + month, data = iq.copy[1:400,])
summary(iq.lm.ar)
##
## Call:
## lm(formula = total_cases ~ total_cases_t1 + seq + month, data = iq.copy[1:400,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.314 -2.609 -0.702 1.390 66.258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.281138 1.379281 2.379 0.0179 *
## total_cases_t1 0.658746 0.038366 17.170 <2e-16 ***
## seq 0.007662 0.003340 2.294 0.0223 *
## month2 -1.081819 1.767804 -0.612 0.5409
## month3 -3.207606 1.820379 -1.762 0.0789 .
## month4 -2.938256 1.730741 -1.698 0.0904 .
## month5 -2.587347 1.844905 -1.402 0.1616
## month6 -4.112597 1.848442 -2.225 0.0267 *
## month7 -3.382219 1.687260 -2.005 0.0457 *
## month8 -3.928849 1.788266 -2.197 0.0286 *
## month9 -2.757111 1.753844 -1.572 0.1168
## month10 -2.227225 1.694100 -1.315 0.1894
## month11 -1.791204 1.767840 -1.013 0.3116
## month12 -0.166943 1.779648 -0.094 0.9253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.452 on 386 degrees of freedom
## Multiple R-squared: 0.5302, Adjusted R-squared: 0.5144
## F-statistic: 33.51 on 13 and 386 DF, p-value: < 2.2e-16
Where can we go from here? We could probably implement backwards selection. I will use the base variables as building an offset for each of the variables would be time consuming and I used the base values to test for significance in the first place
iq.lm.working <- lm(total_cases ~ndvi_se + ndvi_sw + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg + month, data = iq.copy)
summary(iq.lm.working)
##
## Call:
## lm(formula = total_cases ~ ndvi_se + ndvi_sw + reanalysis_max_air_temp_k +
## reanalysis_specific_humidity_g_per_kg + month, data = iq.copy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.705 -5.366 -1.836 2.358 101.443
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176.7247 78.1727 2.261 0.024204 *
## ndvi_se -17.3746 8.5499 -2.032 0.042664 *
## ndvi_sw 13.4404 7.5100 1.790 0.074108 .
## reanalysis_max_air_temp_k -0.5871 0.2481 -2.367 0.018324 *
## reanalysis_specific_humidity_g_per_kg 0.9231 0.4382 2.107 0.035651 *
## month2 -0.1511 2.1622 -0.070 0.944324
## month3 -6.0482 2.1736 -2.783 0.005596 **
## month4 -7.6395 2.0583 -3.712 0.000229 ***
## month5 -7.5887 2.1761 -3.487 0.000530 ***
## month6 -8.4462 2.2429 -3.766 0.000186 ***
## month7 -7.3101 2.1991 -3.324 0.000952 ***
## month8 -5.7133 2.3611 -2.420 0.015885 *
## month9 -2.6291 2.2816 -1.152 0.249730
## month10 -0.0535 2.1968 -0.024 0.980582
## month11 -1.7738 2.2432 -0.791 0.429455
## month12 0.8517 2.1820 0.390 0.696467
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.12 on 504 degrees of freedom
## Multiple R-squared: 0.1414, Adjusted R-squared: 0.1159
## F-statistic: 5.535 on 15 and 504 DF, p-value: 1.74e-10
I will not show the history of me checking backwards selection but we note that the final model with statistically significant variables and the highest adjusted R squared included the southeast and southwest measurements of vegetation index, max air temperature, mean specific humidity and the months.
We can now offset these and see if it helps our AR model analysis
iq.copy$ndvi_se <- c(iq.copy$ndvi_se[1], iq.copy$ndvi_se[1:519])
iq.copy$ndvi_sw <- c(iq.copy$ndvi_sw[1], iq.copy$ndvi_sw[1:519])
iq.copy$reanalysis_max_air_temp_k <- c(iq.copy$reanalysis_max_air_temp_k[1], iq.copy$reanalysis_max_air_temp_k[1:519])
iq.copy$reanalysis_specific_humidity_g_per_kg <- c(iq.copy$reanalysis_specific_humidity_g_per_kg[1], iq.copy$reanalysis_specific_humidity_g_per_kg[1:519])
head(iq.copy)
## total_cases city year weekofyear week_start_date ndvi_ne ndvi_nw
## 937 0 iq 2000 26 2000-07-01 0.1928857 0.1322571
## 938 0 iq 2000 27 2000-07-08 0.2168333 0.2761000
## 939 0 iq 2000 28 2000-07-15 0.1767571 0.1731286
## 940 0 iq 2000 29 2000-07-22 0.2277286 0.1454286
## 941 0 iq 2000 30 2000-07-29 0.3286429 0.3221286
## 942 0 iq 2000 31 2000-08-05 0.2055286 0.1907571
## ndvi_se ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## 937 0.3408857 0.2472000 25.41 296.7400
## 938 0.3408857 0.2472000 60.61 296.6343
## 939 0.2894571 0.2416571 55.52 296.4157
## 940 0.2041143 0.1280143 5.60 295.3571
## 941 0.2542000 0.2003143 62.76 296.4329
## 942 0.2543714 0.3610429 16.24 297.1914
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## 937 298.4500 295.1843 307.3
## 938 298.4286 295.3586 307.3
## 939 297.3929 295.6229 306.6
## 940 296.2286 292.7971 304.5
## 941 297.6357 293.9571 303.6
## 942 298.2857 291.7257 307.0
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## 937 293.1 43.19
## 938 291.1 46.00
## 939 292.6 64.77
## 940 288.6 23.96
## 941 291.5 31.80
## 942 288.5 1.00
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## 937 92.41857 25.41
## 938 93.58143 60.61
## 939 95.84857 55.52
## 940 87.23429 5.60
## 941 88.16143 62.76
## 942 74.72857 16.24
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## 937 16.65143 8.928571 26.40000
## 938 16.65143 10.314286 26.90000
## 939 16.86286 7.385714 26.80000
## 940 17.12000 9.114286 25.76667
## 941 14.43143 9.500000 26.60000
## 942 15.44429 13.771429 25.34000
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## 937 10.77500 32.5 20.7
## 938 11.56667 34.0 20.8
## 939 11.46667 33.0 20.7
## 940 10.53333 31.5 14.7
## 941 11.48000 33.3 19.1
## 942 10.94000 32.0 17.0
## station_precip_mm month seq total_cases_t1
## 937 3.0 7 1 0
## 938 55.6 7 2 0
## 939 38.1 7 3 0
## 940 30.0 7 4 0
## 941 4.0 7 5 0
## 942 11.5 8 6 0
Now that our variables of interest are offset, lets try again
iq.lm.ar.2 <- lm(total_cases ~ seq + total_cases_t1 + month + ndvi_se + ndvi_sw + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg, data = iq.copy[1:400,])
summary(iq.lm.ar.2)
##
## Call:
## lm(formula = total_cases ~ seq + total_cases_t1 + month + ndvi_se +
## ndvi_sw + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg,
## data = iq.copy[1:400, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.635 -2.739 -0.715 1.368 65.224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.119436 68.781085 -0.089 0.9292
## seq 0.006786 0.003468 1.956 0.0511 .
## total_cases_t1 0.653831 0.038725 16.884 <2e-16 ***
## month2 -1.447127 1.788448 -0.809 0.4189
## month3 -3.773361 1.831697 -2.060 0.0401 *
## month4 -3.575902 1.762184 -2.029 0.0431 *
## month5 -3.034994 1.892722 -1.604 0.1096
## month6 -4.155902 1.923150 -2.161 0.0313 *
## month7 -2.665408 1.897893 -1.404 0.1610
## month8 -2.662360 2.039225 -1.306 0.1925
## month9 -1.790333 1.924959 -0.930 0.3529
## month10 -1.547048 1.830078 -0.845 0.3984
## month11 -1.700881 1.828886 -0.930 0.3530
## month12 -0.232688 1.830675 -0.127 0.8989
## ndvi_se -9.418190 7.073434 -1.331 0.1838
## ndvi_sw -2.747338 6.141100 -0.447 0.6549
## reanalysis_max_air_temp_k 0.017543 0.216560 0.081 0.9355
## reanalysis_specific_humidity_g_per_kg 0.423859 0.397664 1.066 0.2872
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.433 on 382 degrees of freedom
## Multiple R-squared: 0.5374, Adjusted R-squared: 0.5168
## F-statistic: 26.1 on 17 and 382 DF, p-value: < 2.2e-16
Our adjusted R squared really didn’t change all to much. I wonder if this has anything to do with how auto.arima() chose a seasonal component for the iq data.
It would then make sense that maybe each observation is not so dependent on the last, but instead on seasonality (hence low adjusted R squared on model which regresses on its most recent past value)
plot(iq.lm.ar.2)
So we can see from the residual plot that there are a few very influential points. 1169 specifically drags the scale location line towards it and lies outside cooks distance on the residual vs leverage plot.
However, this is not econometric and I don’t imagine we want to remove data points to make our model fit better; doing so seems counter-intuitive to the entire idea of trying to predict the results.
I will move on and instead try my luck at predicting the remaining portion of IQ data that was withheld from the AR model.
iq.ar.pred2 <- predict(iq.lm.ar.2, iq.copy[1:400,], level = 0.95, interval = "prediction")
iq.ar.pred <- predict(iq.lm.ar.2, iq.copy[401:520,], level = 0.95, interval = "prediction")
head(iq.ar.pred)
## fit lwr upr
## 1337 18.347361 3.346055 33.34867
## 1338 9.898052 -5.081817 24.87792
## 1339 5.641400 -9.396074 20.67887
## 1340 5.639740 -9.343395 20.62287
## 1341 7.331007 -7.587179 22.24919
## 1342 4.291772 -10.642346 19.22589
iq.ar.pred <- as.data.frame(iq.ar.pred)
iq.ar.pred$actual <- iq.copy$total_cases[401:520]
iq.ar.pred2 <- as.data.frame(iq.ar.pred2)
iq.ar.pred2$actual <- iq.copy$total_cases[1:400]
iq.totalpred <- rbind(iq.ar.pred2, iq.ar.pred)
plot(iq.copy$seq, iq.copy$total_cases, col = 'blue', type = 'l')
lines(iq.copy$seq, iq.totalpred$fit, col = 'red')
lines(iq.copy$seq,iq.totalpred$lwr, lty=3, col='dark gray')
lines(iq.copy$seq,iq.totalpred$upr, lty=3, col='dark gray')
So, we can see here that our predictions did not do as well of a job as above. This could have been expected given the lower R squared value found in the model; there still are unexplained patterns which our model could not pick up on.
However, both of these autoregressive models have done a much better job than the ARIMA model which was essentially just a naive forecast.
We can see just how wide the confidence intervals are at the 95% level but it seems like the test submission only cares about the point forecasts so for our purposes it may still be okay.
Finally, we will try the neural net model. I am not sure what to expect from this. On one hand, the more complex relationships can be modeled by the neural net could be of benefit to us. But on the other hand, the last univariate time series model (ARIMA) did not work so well and the neural net is essentially an ARIMA model if there are no nodes.
ts.sj.copy <- ts(sj.copy$total_cases, frequency = 52, start = c(1990, 18))
sj.nnetar <- nnetar(ts.sj.copy)
autoplot(sj.nnetar)
## Warning: Removed 52 row(s) containing missing values (geom_path).
For whatever reason the autoplot function isnt working but the forecast values look relatively decent. They have done a much better job already than the ARIMA model
Lets move onto the iq data
ts.iq.copy <- ts(iq.copy$total_cases, frequency = 52, start = c(2000, 26))
iq.nnetar <- nnetar(ts.iq.copy)
autoplot(iq.nnetar)
## Warning: Removed 52 row(s) containing missing values (geom_path).
The point forecasts for the iq data look very similar to a naive model with nearly the exact same values across. It seems like the neural net model had trouble capturing any sort of patterns from our data
checkresiduals(sj.nnetar)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
checkresiduals(iq.nnetar)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.
Residuals look like white noise so the neural net model was able to squeeze most of the information out of our two data sets. However, the point forecasts still left much to be desired.
Interesting, just to check. I removed the start date from the time series on our data and what occurred was a completely different result for the forecasts of the neural net models. Before (with start date), the point forecasts didnt seem to stray too far from the mean of the total_cases data. However, when the start date was removed the point forecasts reached nearly as high as the max value.
Obviously, since we have the information on the start date it would be wise to include it for the final model but I thought the interaction was interesting nonetheless. It makes sense I guess since when you add a specific start date you are capturing the location of your data in a cycle.
Only thing left to do is to test my forecasts. Need to setup test set.
FROM HERE ON OUT WE ARE WORKING ON TEST SET/PREDICTIONS
dengue.test.set <- cbind(submission_format, dengue_features_test)
summary(dengue.test.set)
## city year weekofyear total_cases
## Length:416 Min. :2008 Min. : 1.00 Min. :0
## Class :character 1st Qu.:2010 1st Qu.:13.75 1st Qu.:0
## Mode :character Median :2011 Median :26.00 Median :0
## Mean :2011 Mean :26.44 Mean :0
## 3rd Qu.:2012 3rd Qu.:39.00 3rd Qu.:0
## Max. :2013 Max. :53.00 Max. :0
##
## city year weekofyear week_start_date
## Length:416 Min. :2008 Min. : 1.00 Length:416
## Class :character 1st Qu.:2010 1st Qu.:13.75 Class :character
## Mode :character Median :2011 Median :26.00 Mode :character
## Mean :2011 Mean :26.44
## 3rd Qu.:2012 3rd Qu.:39.00
## Max. :2013 Max. :53.00
##
## ndvi_ne ndvi_nw ndvi_se ndvi_sw
## Min. :-0.4634 Min. :-0.21180 Min. :0.0062 Min. :-0.01467
## 1st Qu.:-0.0015 1st Qu.: 0.01597 1st Qu.:0.1487 1st Qu.: 0.13408
## Median : 0.1101 Median : 0.08870 Median :0.2042 Median : 0.18647
## Mean : 0.1260 Mean : 0.12680 Mean :0.2077 Mean : 0.20172
## 3rd Qu.: 0.2633 3rd Qu.: 0.24240 3rd Qu.:0.2549 3rd Qu.: 0.25324
## Max. : 0.5004 Max. : 0.64900 Max. :0.4530 Max. : 0.52904
## NA's :43 NA's :11 NA's :1 NA's :1
## precipitation_amt_mm reanalysis_air_temp_k reanalysis_avg_temp_k
## Min. : 0.000 Min. :294.6 Min. :295.2
## 1st Qu.: 8.175 1st Qu.:297.8 1st Qu.:298.3
## Median : 31.455 Median :298.5 Median :299.3
## Mean : 38.354 Mean :298.8 Mean :299.4
## 3rd Qu.: 57.773 3rd Qu.:300.2 3rd Qu.:300.5
## Max. :169.340 Max. :301.9 Max. :303.3
## NA's :2 NA's :2 NA's :2
## reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :290.8 Min. :298.2
## 1st Qu.:294.3 1st Qu.:301.4
## Median :295.8 Median :302.8
## Mean :295.4 Mean :303.6
## 3rd Qu.:296.6 3rd Qu.:305.8
## Max. :297.8 Max. :314.1
## NA's :2 NA's :2
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :286.2 Min. : 0.00
## 1st Qu.:293.5 1st Qu.: 9.43
## Median :296.3 Median : 25.85
## Mean :295.7 Mean : 42.17
## 3rd Qu.:298.3 3rd Qu.: 56.48
## Max. :299.7 Max. :301.40
## NA's :2 NA's :2
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :64.92 Min. : 0.000
## 1st Qu.:77.40 1st Qu.: 8.175
## Median :80.33 Median : 31.455
## Mean :82.50 Mean : 38.354
## 3rd Qu.:88.33 3rd Qu.: 57.773
## Max. :97.98 Max. :169.340
## NA's :2 NA's :2
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :12.54 Min. : 1.486 Min. :24.16
## 1st Qu.:15.79 1st Qu.: 2.446 1st Qu.:26.51
## Median :17.34 Median : 2.914 Median :27.48
## Mean :16.93 Mean : 5.125 Mean :27.37
## 3rd Qu.:18.17 3rd Qu.: 8.171 3rd Qu.:28.32
## Max. :19.60 Max. :14.486 Max. :30.27
## NA's :2 NA's :2 NA's :12
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. : 4.043 Min. :27.20 Min. :14.20
## 1st Qu.: 5.929 1st Qu.:31.10 1st Qu.:21.20
## Median : 6.643 Median :32.80 Median :22.20
## Mean : 7.811 Mean :32.53 Mean :22.37
## 3rd Qu.: 9.812 3rd Qu.:33.90 3rd Qu.:23.30
## Max. :14.725 Max. :38.40 Max. :26.70
## NA's :12 NA's :3 NA's :9
## station_precip_mm
## Min. : 0.00
## 1st Qu.: 9.10
## Median : 23.60
## Mean : 34.28
## 3rd Qu.: 47.75
## Max. :212.00
## NA's :5
dengue.test.set$weekofyear = NULL
dengue.test.set$city = NULL
dengue.test.set$year = NULL
Test set and the other variables are merged. Lets split into sj and iq and then we can format from there
sj.test <- dengue.test.set[1:260,]
iq.test <- dengue.test.set[261:416, ]
imputing median data for linear model forecasts.
for (i in 1:ncol(sj.test)){
for (j in 1:nrow(sj.test)){
if (is.na(sj.test[j,i] == TRUE)){
sj.test[j,i] <- median(sj.test[,i], na.rm = TRUE)
}
}
}
summary(sj.test)
## total_cases city year weekofyear
## Min. :0 Length:260 Min. :2008 Min. : 1.00
## 1st Qu.:0 Class :character 1st Qu.:2009 1st Qu.:13.75
## Median :0 Mode :character Median :2010 Median :26.50
## Mean :0 Mean :2010 Mean :26.50
## 3rd Qu.:0 3rd Qu.:2012 3rd Qu.:39.25
## Max. :0 Max. :2013 Max. :53.00
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Length:260 Min. :-0.46340 Min. :-0.211800 Min. :0.0062
## Class :character 1st Qu.:-0.03006 1st Qu.:-0.005313 1st Qu.:0.1319
## Mode :character Median : 0.01498 Median : 0.032140 Median :0.1694
## Mean : 0.02318 Mean : 0.036535 Mean :0.1770
## 3rd Qu.: 0.06828 3rd Qu.: 0.076275 3rd Qu.:0.2181
## Max. : 0.50040 Max. : 0.649000 Max. :0.3854
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :-0.01467 Min. : 0.000 Min. :296.7
## 1st Qu.: 0.11649 1st Qu.: 2.127 1st Qu.:298.3
## Median : 0.14802 Median : 13.975 Median :299.8
## Mean : 0.15321 Mean : 26.425 Mean :299.5
## 3rd Qu.: 0.19128 3rd Qu.: 41.810 3rd Qu.:300.6
## Max. : 0.31813 Max. :169.340 Max. :301.5
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :296.8 Min. :290.8 Min. :298.2
## 1st Qu.:298.4 1st Qu.:294.1 1st Qu.:300.4
## Median :299.8 Median :295.7 Median :301.9
## Mean :299.5 Mean :295.3 Mean :301.6
## 3rd Qu.:300.7 3rd Qu.:296.8 3rd Qu.:302.6
## Max. :301.5 Max. :297.8 Max. :304.1
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :293.8 Min. : 0.000
## 1st Qu.:296.6 1st Qu.: 6.763
## Median :297.7 Median : 15.245
## Mean :297.6 Mean : 23.701
## 3rd Qu.:298.8 3rd Qu.: 29.000
## Max. :299.7 Max. :301.400
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :64.92 Min. : 0.000
## 1st Qu.:76.06 1st Qu.: 2.127
## Median :78.38 Median : 13.975
## Mean :78.20 Mean : 26.425
## 3rd Qu.:80.42 3rd Qu.: 41.810
## Max. :86.78 Max. :169.340
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :12.54 Min. :1.486 Min. :24.16
## 1st Qu.:15.38 1st Qu.:2.229 1st Qu.:26.07
## Median :17.07 Median :2.543 Median :27.42
## Mean :16.76 Mean :2.587 Mean :27.27
## 3rd Qu.:18.18 3rd Qu.:2.857 3rd Qu.:28.51
## Max. :19.34 Max. :4.429 Max. :30.27
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. :4.043 Min. :27.20 Min. :20.00
## 1st Qu.:5.700 1st Qu.:30.60 1st Qu.:21.70
## Median :6.179 Median :31.70 Median :23.30
## Mean :6.153 Mean :31.68 Mean :23.11
## 3rd Qu.:6.586 3rd Qu.:32.80 3rd Qu.:24.40
## Max. :8.400 Max. :35.00 Max. :26.70
## station_precip_mm
## Min. : 0.00
## 1st Qu.: 7.35
## Median : 22.35
## Mean : 34.12
## 3rd Qu.: 48.25
## Max. :207.70
for (i in 1:ncol(iq.test)){
for (j in 1:nrow(iq.test)){
if (is.na(iq.test[j,i] == TRUE)){
iq.test[j,i] <- median(iq.test[,i], na.rm = TRUE)
}
}
}
summary(iq.test)
## total_cases city year weekofyear
## Min. :0 Length:156 Min. :2010 Min. : 1.00
## 1st Qu.:0 Class :character 1st Qu.:2011 1st Qu.:13.75
## Median :0 Mode :character Median :2012 Median :26.00
## Mean :0 Mean :2012 Mean :26.33
## 3rd Qu.:0 3rd Qu.:2012 3rd Qu.:39.00
## Max. :0 Max. :2013 Max. :52.00
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Length:156 Min. :0.08929 Min. :0.06321 Min. :0.09826
## Class :character 1st Qu.:0.21496 1st Qu.:0.22244 1st Qu.:0.21198
## Mode :character Median :0.26523 Median :0.26946 Median :0.25316
## Mean :0.26689 Mean :0.27057 Mean :0.25858
## 3rd Qu.:0.31924 3rd Qu.:0.32456 3rd Qu.:0.30164
## Max. :0.42999 Max. :0.46480 Max. :0.45304
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :0.08196 Min. : 2.28 Min. :294.6
## 1st Qu.:0.21750 1st Qu.: 36.95 1st Qu.:297.1
## Median :0.28153 Median : 51.29 Median :297.8
## Mean :0.28223 Mean : 57.92 Mean :297.8
## 3rd Qu.:0.34724 3rd Qu.: 72.81 3rd Qu.:298.3
## Max. :0.52904 Max. :152.32 Max. :301.9
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :295.2 Min. :292.0 Min. :302.8
## 1st Qu.:298.2 1st Qu.:294.8 1st Qu.:305.4
## Median :299.0 Median :295.9 Median :306.7
## Mean :299.0 Mean :295.6 Mean :307.0
## 3rd Qu.:299.8 3rd Qu.:296.5 3rd Qu.:308.5
## Max. :303.3 Max. :297.7 Max. :314.1
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :286.2 Min. : 2.60
## 1st Qu.:291.9 1st Qu.: 35.62
## Median :292.9 Median : 59.30
## Mean :292.7 Mean : 72.61
## 3rd Qu.:293.8 3rd Qu.: 98.97
## Max. :296.0 Max. :280.42
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :66.31 Min. : 2.28
## 1st Qu.:86.32 1st Qu.: 36.95
## Median :91.41 Median : 51.29
## Mean :89.61 Mean : 57.92
## 3rd Qu.:94.74 3rd Qu.: 72.81
## Max. :97.98 Max. :152.32
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :13.74 Min. : 4.800 Min. :24.84
## 1st Qu.:16.33 1st Qu.: 7.689 1st Qu.:27.05
## Median :17.55 Median : 9.371 Median :27.51
## Mean :17.21 Mean : 9.321 Mean :27.54
## 3rd Qu.:18.17 3rd Qu.:10.764 3rd Qu.:28.01
## Max. :19.60 Max. :14.486 Max. :29.13
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. : 6.45 Min. :29.60 Min. :14.20
## 1st Qu.: 9.70 1st Qu.:33.20 1st Qu.:20.60
## Median :10.73 Median :34.00 Median :21.20
## Mean :10.74 Mean :33.96 Mean :21.09
## 3rd Qu.:11.63 3rd Qu.:34.83 3rd Qu.:21.85
## Max. :14.72 Max. :38.40 Max. :23.20
## station_precip_mm
## Min. : 0.00
## 1st Qu.: 11.72
## Median : 27.20
## Mean : 34.25
## 3rd Qu.: 43.62
## Max. :212.00
setting date and creating month factor in each
sj.test$week_start_date <- as.Date(sj.test$week_start_date, format=c('%Y-%m-%d'))
sj.test$month <- as.factor(month(sj.test$week_start_date))
summary(sj.test)
## total_cases city year weekofyear
## Min. :0 Length:260 Min. :2008 Min. : 1.00
## 1st Qu.:0 Class :character 1st Qu.:2009 1st Qu.:13.75
## Median :0 Mode :character Median :2010 Median :26.50
## Mean :0 Mean :2010 Mean :26.50
## 3rd Qu.:0 3rd Qu.:2012 3rd Qu.:39.25
## Max. :0 Max. :2013 Max. :53.00
##
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Min. :2008-04-29 Min. :-0.46340 Min. :-0.211800 Min. :0.0062
## 1st Qu.:2009-07-28 1st Qu.:-0.03006 1st Qu.:-0.005313 1st Qu.:0.1319
## Median :2010-10-25 Median : 0.01498 Median : 0.032140 Median :0.1694
## Mean :2010-10-25 Mean : 0.02318 Mean : 0.036535 Mean :0.1770
## 3rd Qu.:2012-01-23 3rd Qu.: 0.06828 3rd Qu.: 0.076275 3rd Qu.:0.2181
## Max. :2013-04-23 Max. : 0.50040 Max. : 0.649000 Max. :0.3854
##
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :-0.01467 Min. : 0.000 Min. :296.7
## 1st Qu.: 0.11649 1st Qu.: 2.127 1st Qu.:298.3
## Median : 0.14802 Median : 13.975 Median :299.8
## Mean : 0.15321 Mean : 26.425 Mean :299.5
## 3rd Qu.: 0.19128 3rd Qu.: 41.810 3rd Qu.:300.6
## Max. : 0.31813 Max. :169.340 Max. :301.5
##
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :296.8 Min. :290.8 Min. :298.2
## 1st Qu.:298.4 1st Qu.:294.1 1st Qu.:300.4
## Median :299.8 Median :295.7 Median :301.9
## Mean :299.5 Mean :295.3 Mean :301.6
## 3rd Qu.:300.7 3rd Qu.:296.8 3rd Qu.:302.6
## Max. :301.5 Max. :297.8 Max. :304.1
##
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :293.8 Min. : 0.000
## 1st Qu.:296.6 1st Qu.: 6.763
## Median :297.7 Median : 15.245
## Mean :297.6 Mean : 23.701
## 3rd Qu.:298.8 3rd Qu.: 29.000
## Max. :299.7 Max. :301.400
##
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :64.92 Min. : 0.000
## 1st Qu.:76.06 1st Qu.: 2.127
## Median :78.38 Median : 13.975
## Mean :78.20 Mean : 26.425
## 3rd Qu.:80.42 3rd Qu.: 41.810
## Max. :86.78 Max. :169.340
##
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :12.54 Min. :1.486 Min. :24.16
## 1st Qu.:15.38 1st Qu.:2.229 1st Qu.:26.07
## Median :17.07 Median :2.543 Median :27.42
## Mean :16.76 Mean :2.587 Mean :27.27
## 3rd Qu.:18.18 3rd Qu.:2.857 3rd Qu.:28.51
## Max. :19.34 Max. :4.429 Max. :30.27
##
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. :4.043 Min. :27.20 Min. :20.00
## 1st Qu.:5.700 1st Qu.:30.60 1st Qu.:21.70
## Median :6.179 Median :31.70 Median :23.30
## Mean :6.153 Mean :31.68 Mean :23.11
## 3rd Qu.:6.586 3rd Qu.:32.80 3rd Qu.:24.40
## Max. :8.400 Max. :35.00 Max. :26.70
##
## station_precip_mm month
## Min. : 0.00 1 : 25
## 1st Qu.: 7.35 4 : 25
## Median : 22.35 7 : 25
## Mean : 34.12 10 : 23
## 3rd Qu.: 48.25 9 : 22
## Max. :207.70 2 : 20
## (Other):120
iq.test$week_start_date <- as.Date(iq.test$week_start_date, format=c('%Y-%m-%d'))
iq.test$month <- as.factor(month(iq.test$week_start_date))
summary(iq.test)
## total_cases city year weekofyear
## Min. :0 Length:156 Min. :2010 Min. : 1.00
## 1st Qu.:0 Class :character 1st Qu.:2011 1st Qu.:13.75
## Median :0 Mode :character Median :2012 Median :26.00
## Mean :0 Mean :2012 Mean :26.33
## 3rd Qu.:0 3rd Qu.:2012 3rd Qu.:39.00
## Max. :0 Max. :2013 Max. :52.00
##
## week_start_date ndvi_ne ndvi_nw ndvi_se
## Min. :2010-07-02 Min. :0.08929 Min. :0.06321 Min. :0.09826
## 1st Qu.:2011-03-31 1st Qu.:0.21496 1st Qu.:0.22244 1st Qu.:0.21198
## Median :2011-12-28 Median :0.26523 Median :0.26946 Median :0.25316
## Mean :2011-12-28 Mean :0.26689 Mean :0.27057 Mean :0.25858
## 3rd Qu.:2012-09-24 3rd Qu.:0.31924 3rd Qu.:0.32456 3rd Qu.:0.30164
## Max. :2013-06-25 Max. :0.42999 Max. :0.46480 Max. :0.45304
##
## ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## Min. :0.08196 Min. : 2.28 Min. :294.6
## 1st Qu.:0.21750 1st Qu.: 36.95 1st Qu.:297.1
## Median :0.28153 Median : 51.29 Median :297.8
## Mean :0.28223 Mean : 57.92 Mean :297.8
## 3rd Qu.:0.34724 3rd Qu.: 72.81 3rd Qu.:298.3
## Max. :0.52904 Max. :152.32 Max. :301.9
##
## reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## Min. :295.2 Min. :292.0 Min. :302.8
## 1st Qu.:298.2 1st Qu.:294.8 1st Qu.:305.4
## Median :299.0 Median :295.9 Median :306.7
## Mean :299.0 Mean :295.6 Mean :307.0
## 3rd Qu.:299.8 3rd Qu.:296.5 3rd Qu.:308.5
## Max. :303.3 Max. :297.7 Max. :314.1
##
## reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## Min. :286.2 Min. : 2.60
## 1st Qu.:291.9 1st Qu.: 35.62
## Median :292.9 Median : 59.30
## Mean :292.7 Mean : 72.61
## 3rd Qu.:293.8 3rd Qu.: 98.97
## Max. :296.0 Max. :280.42
##
## reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## Min. :66.31 Min. : 2.28
## 1st Qu.:86.32 1st Qu.: 36.95
## Median :91.41 Median : 51.29
## Mean :89.61 Mean : 57.92
## 3rd Qu.:94.74 3rd Qu.: 72.81
## Max. :97.98 Max. :152.32
##
## reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## Min. :13.74 Min. : 4.800 Min. :24.84
## 1st Qu.:16.33 1st Qu.: 7.689 1st Qu.:27.05
## Median :17.55 Median : 9.371 Median :27.51
## Mean :17.21 Mean : 9.321 Mean :27.54
## 3rd Qu.:18.17 3rd Qu.:10.764 3rd Qu.:28.01
## Max. :19.60 Max. :14.486 Max. :29.13
##
## station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## Min. : 6.45 Min. :29.60 Min. :14.20
## 1st Qu.: 9.70 1st Qu.:33.20 1st Qu.:20.60
## Median :10.73 Median :34.00 Median :21.20
## Mean :10.74 Mean :33.96 Mean :21.09
## 3rd Qu.:11.63 3rd Qu.:34.83 3rd Qu.:21.85
## Max. :14.72 Max. :38.40 Max. :23.20
##
## station_precip_mm month
## Min. : 0.00 1 :15
## 1st Qu.: 11.72 4 :15
## Median : 27.20 7 :15
## Mean : 34.25 10 :14
## 3rd Qu.: 43.62 9 :13
## Max. :212.00 2 :12
## (Other):72
Making univariate time series for arima or neural net models
ts.sj.test <- ts(sj.test$total_cases, frequency = 52, start = c(2008, 18))
ts.iq.test <- ts(iq.test$total_cases, frequency = 52, start = c(2010, 26))
try out ARIMA forecasts
sub.aa.sj <- forecast(aa.sj, newdata = ts.sj.test, h = 260)
## Warning in forecast.forecast_ARIMA(aa.sj, newdata = ts.sj.test, h = 260): The
## non-existent newdata arguments will be ignored.
sub.aa.iq <- forecast(aa.iq, newdata = ts.iq.test, h = 156)
## Warning in forecast.forecast_ARIMA(aa.iq, newdata = ts.iq.test, h = 156): The
## non-existent newdata arguments will be ignored.
As expected, our ARIMA models look like naive models
Moving onto our AR model.
sj.lambda <- BoxCox.lambda(sj.copy$total_cases)
## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.
sj.test$seq <- seq(1, 260)
sj.test$total_cases_t1 <- c(sj.test$total_cases[1], sj.test$total_cases[1:259])
predict.sj.ar <- forecast(sj.lm.ar, lambda = sj.lambda, newdata = sj.test)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj information
## not found, defaulting to FALSE.
iq.lambda <- BoxCox.lambda(iq.copy$total_cases)
## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.
iq.test$seq <- seq(1,156)
iq.test$total_cases_t1 <- iq.test$total_cases
iq.test$ndvi_se <- c(iq.test$ndvi_se[1], iq.test$ndvi_se[1:155])
iq.test$ndvi_sw <- c(iq.test$ndvi_sw[1], iq.test$ndvi_sw[1:155])
iq.test$reanalysis_max_air_temp_k <- c(iq.test$reanalysis_max_air_temp_k[1], iq.test$reanalysis_max_air_temp_k[1:155])
iq.test$reanalysis_specific_humidity_g_per_kg <- c(iq.test$reanalysis_specific_humidity_g_per_kg[1], iq.test$reanalysis_specific_humidity_g_per_kg[1:155])
predict.iq.ar <- forecast(iq.lm.ar.2, lambda = iq.lambda, newdata = iq.test)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj information
## not found, defaulting to FALSE.
The forecasts seem decent after a boxcox transformation. They are within a possibly expected value (around mean values of total cases) and seem to fluctuate with months/seasonality.
Lets try the neural net.
sub.sj.nn <- forecast(sj.nnetar, newdata = ts.sj.test, h = 260)
sub.iq.nn <- forecast(iq.nnetar, newdata = ts.iq.test, h = 156)
The forecasts for the neural net seem alright. As we notice the default forecast is set for two cycles out (104 observations) so I have specified the number of obervations using h = nrows(sj data, iq data)
setting up data frame and moving predicted values to submission page
df.predict.sj.ar <- as.data.frame(predict.sj.ar)
df.predict.iq.ar <- as.data.frame(predict.iq.ar)
df.sub.sj.nn <- as.data.frame(sub.sj.nn)
df.sub.iq.nn <- as.data.frame(sub.iq.nn)
df.sub.aa.sj <- as.data.frame(sub.aa.sj)
df.sub.aa.iq <- as.data.frame(sub.aa.iq)
Submission requires rounding so we will do that
df.predict.sj.ar <- round(df.predict.sj.ar)
df.predict.iq.ar <- round(df.predict.iq.ar)
df.sub.sj.nn <- round(df.sub.sj.nn)
df.sub.iq.nn <- round(df.sub.iq.nn)
df.sub.aa.sj <- round(df.sub.aa.sj)
df.sub.aa.iq <- round(df.sub.aa.iq)
Finally, putting our results back into the submission file
ar.submission <- submission_format
ar.submission$total_cases[1:260] <- df.predict.sj.ar$`Point Forecast`
ar.submission$total_cases[261: 416] <- df.predict.iq.ar$`Point Forecast`
write.csv(ar.submission, "autoregressive forecasts.csv")
nn.submission <- submission_format
nn.submission$total_cases[1:260] <- df.sub.sj.nn$`Point Forecast`
nn.submission$total_cases[261: 416] <- df.sub.iq.nn$`Point Forecast`
write.csv(nn.submission, "neural net forecasts.csv")
aa.submission <- submission_format
aa.submission$total_cases[1:260] <- df.sub.aa.sj$`Point Forecast`
aa.submission$total_cases[261: 416] <- df.sub.aa.iq$`Point Forecast`
write.csv(aa.submission, "auto arima forecasts.csv")