Loading in libraries

library(forecast)
## Warning: package 'forecast' was built under R version 4.0.4
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(ggplot2)
library(TTR)
## Warning: package 'TTR' was built under R version 4.0.4
library(stats)
library(readr)
library(fpp)
## Warning: package 'fpp' was built under R version 4.0.5
## Loading required package: fma
## Warning: package 'fma' was built under R version 4.0.5
## Loading required package: expsmooth
## Warning: package 'expsmooth' was built under R version 4.0.5
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 4.0.4
library(fpp2)
## Warning: package 'fpp2' was built under R version 4.0.5
## 
## Attaching package: 'fpp2'
## The following objects are masked from 'package:fpp':
## 
##     ausair, ausbeer, austa, austourists, debitcards, departures,
##     elecequip, euretail, guinearice, oil, sunspotarea, usmelec
library(seasonal)
## Warning: package 'seasonal' was built under R version 4.0.5
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(psych)
## Warning: package 'psych' was built under R version 4.0.5
## 
## Attaching package: 'psych'
## The following object is masked from 'package:seasonal':
## 
##     outlier
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.0.5
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.0.5
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.0.5
## Registered S3 methods overwritten by 'ggfortify':
##   method                 from    
##   autoplot.Arima         forecast
##   autoplot.acf           forecast
##   autoplot.ar            forecast
##   autoplot.bats          forecast
##   autoplot.decomposed.ts forecast
##   autoplot.ets           forecast
##   autoplot.forecast      forecast
##   autoplot.stl           forecast
##   autoplot.ts            forecast
##   fitted.ar              forecast
##   fortify.ts             forecast
##   residuals.ar           forecast
library(vars)
## Warning: package 'vars' was built under R version 4.0.5
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following objects are masked from 'package:fma':
## 
##     cement, housing, petrol
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.0.5
## Loading required package: sandwich
## Loading required package: urca
## Warning: package 'urca' was built under R version 4.0.4
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(AER)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## Loading required package: survival
library(MASS)

Clearing everything to make it easier to work

rm(list = ls())

Excel spreadsheets and data to build my models from

dengue_features_test <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\FINAL\\dengue_features_test.csv")
dengue_features_train <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\FINAL\\dengue_features_train.csv")
dengue_labels_train <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\FINAL\\dengue_labels_train.csv")
submission_format <- read.csv("C:\\Users\\wanga\\OneDrive\\Desktop\\PREDICTIVE ANALYTICS\\FINAL\\submission_format.csv")

So, we have four excel files worth of data. We first should merge by the training set and the labels which has the variable of interest we need.

The two excel files match in nrows so maybe we can just use cbind instead of merge.

dengue_merge_data.train <- cbind(dengue_labels_train, dengue_features_train)
summary(dengue_merge_data.train)
##      city                year        weekofyear     total_cases    
##  Length:1456        Min.   :1990   Min.   : 1.00   Min.   :  0.00  
##  Class :character   1st Qu.:1997   1st Qu.:13.75   1st Qu.:  5.00  
##  Mode  :character   Median :2002   Median :26.50   Median : 12.00  
##                     Mean   :2001   Mean   :26.50   Mean   : 24.68  
##                     3rd Qu.:2005   3rd Qu.:39.25   3rd Qu.: 28.00  
##                     Max.   :2010   Max.   :53.00   Max.   :461.00  
##                                                                    
##      city                year        weekofyear    week_start_date   
##  Length:1456        Min.   :1990   Min.   : 1.00   Length:1456       
##  Class :character   1st Qu.:1997   1st Qu.:13.75   Class :character  
##  Mode  :character   Median :2002   Median :26.50   Mode  :character  
##                     Mean   :2001   Mean   :26.50                     
##                     3rd Qu.:2005   3rd Qu.:39.25                     
##                     Max.   :2010   Max.   :53.00                     
##                                                                      
##     ndvi_ne            ndvi_nw            ndvi_se            ndvi_sw        
##  Min.   :-0.40625   Min.   :-0.45610   Min.   :-0.01553   Min.   :-0.06346  
##  1st Qu.: 0.04495   1st Qu.: 0.04922   1st Qu.: 0.15509   1st Qu.: 0.14421  
##  Median : 0.12882   Median : 0.12143   Median : 0.19605   Median : 0.18945  
##  Mean   : 0.14229   Mean   : 0.13055   Mean   : 0.20378   Mean   : 0.20231  
##  3rd Qu.: 0.24848   3rd Qu.: 0.21660   3rd Qu.: 0.24885   3rd Qu.: 0.24698  
##  Max.   : 0.50836   Max.   : 0.45443   Max.   : 0.53831   Max.   : 0.54602  
##  NA's   :194        NA's   :52         NA's   :22         NA's   :22        
##  precipitation_amt_mm reanalysis_air_temp_k reanalysis_avg_temp_k
##  Min.   :  0.00       Min.   :294.6         Min.   :294.9        
##  1st Qu.:  9.80       1st Qu.:297.7         1st Qu.:298.3        
##  Median : 38.34       Median :298.6         Median :299.3        
##  Mean   : 45.76       Mean   :298.7         Mean   :299.2        
##  3rd Qu.: 70.23       3rd Qu.:299.8         3rd Qu.:300.2        
##  Max.   :390.60       Max.   :302.2         Max.   :302.9        
##  NA's   :13           NA's   :10            NA's   :10           
##  reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :289.6               Min.   :297.8            
##  1st Qu.:294.1               1st Qu.:301.0            
##  Median :295.6               Median :302.4            
##  Mean   :295.2               Mean   :303.4            
##  3rd Qu.:296.5               3rd Qu.:305.5            
##  Max.   :298.4               Max.   :314.0            
##  NA's   :10                  NA's   :10               
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :286.9             Min.   :  0.00                 
##  1st Qu.:293.9             1st Qu.: 13.05                 
##  Median :296.2             Median : 27.25                 
##  Mean   :295.7             Mean   : 40.15                 
##  3rd Qu.:297.9             3rd Qu.: 52.20                 
##  Max.   :299.9             Max.   :570.50                 
##  NA's   :10                NA's   :10                     
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :57.79                        Min.   :  0.00              
##  1st Qu.:77.18                        1st Qu.:  9.80              
##  Median :80.30                        Median : 38.34              
##  Mean   :82.16                        Mean   : 45.76              
##  3rd Qu.:86.36                        3rd Qu.: 70.23              
##  Max.   :98.61                        Max.   :390.60              
##  NA's   :10                           NA's   :13                  
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :11.72                         Min.   : 1.357    Min.   :21.40     
##  1st Qu.:15.56                         1st Qu.: 2.329    1st Qu.:26.30     
##  Median :17.09                         Median : 2.857    Median :27.41     
##  Mean   :16.75                         Mean   : 4.904    Mean   :27.19     
##  3rd Qu.:17.98                         3rd Qu.: 7.625    3rd Qu.:28.16     
##  Max.   :20.46                         Max.   :16.029    Max.   :30.80     
##  NA's   :10                            NA's   :10        NA's   :43        
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   : 4.529          Min.   :26.70      Min.   :14.7      
##  1st Qu.: 6.514          1st Qu.:31.10      1st Qu.:21.1      
##  Median : 7.300          Median :32.80      Median :22.2      
##  Mean   : 8.059          Mean   :32.45      Mean   :22.1      
##  3rd Qu.: 9.567          3rd Qu.:33.90      3rd Qu.:23.3      
##  Max.   :15.800          Max.   :42.20      Max.   :25.6      
##  NA's   :43              NA's   :20         NA's   :14        
##  station_precip_mm
##  Min.   :  0.00   
##  1st Qu.:  8.70   
##  Median : 23.85   
##  Mean   : 39.33   
##  3rd Qu.: 53.90   
##  Max.   :543.30   
##  NA's   :22
dengue_merge_data.train$weekofyear = NULL
dengue_merge_data.train$city = NULL
dengue_merge_data.train$year = NULL

Setting week start date to actually be a date. We could also break the data frame into two for the two separate cities? No idea if necessary yet though.

dengue_merge_data.train$week_start_date <- as.Date(dengue_merge_data.train$week_start_date, format=c('%Y-%m-%d'))

Now we have essentially everything we need to build our models.

Running some exploratory data analysis. Boxplots always a nice place to start.

boxplot(dengue_merge_data.train$total_cases ~ dengue_merge_data.train$city)

boxplot(dengue_merge_data.train$total_cases ~ dengue_merge_data.train$city, outline = FALSE)

We can immediately see that sj is higher on average and in its outliers

boxplot(dengue_merge_data.train$total_cases ~ dengue_merge_data.train$weekofyear)

boxplot(dengue_merge_data.train$total_cases ~ dengue_merge_data.train$weekofyear, outline = FALSE)

We might notice that the late part of the year (around late fall/winter time) is when the number of cases ramp up. This also seems to hold true even when including the outlier points.

At row 936, that is the cutoff for city sj and the start of data for city iq. Similarly, there is over 10 years worth of data for each city so I think here that creating two separate data frames to analyze trend by city would be a decent option here

sj.data <- dengue_merge_data.train[1:936, ]
summary(sj.data)
##   total_cases         city                year        weekofyear   
##  Min.   :  0.00   Length:936         Min.   :1990   Min.   : 1.00  
##  1st Qu.:  9.00   Class :character   1st Qu.:1994   1st Qu.:13.75  
##  Median : 19.00   Mode  :character   Median :1999   Median :26.50  
##  Mean   : 34.18                      Mean   :1999   Mean   :26.50  
##  3rd Qu.: 37.00                      3rd Qu.:2003   3rd Qu.:39.25  
##  Max.   :461.00                      Max.   :2008   Max.   :53.00  
##                                                                    
##  week_start_date         ndvi_ne            ndvi_nw            ndvi_se        
##  Min.   :1990-04-30   Min.   :-0.40625   Min.   :-0.45610   Min.   :-0.01553  
##  1st Qu.:1994-10-27   1st Qu.: 0.00450   1st Qu.: 0.01642   1st Qu.: 0.13928  
##  Median :1999-04-26   Median : 0.05770   Median : 0.06807   Median : 0.17719  
##  Mean   :1999-04-26   Mean   : 0.05792   Mean   : 0.06747   Mean   : 0.17766  
##  3rd Qu.:2003-10-23   3rd Qu.: 0.11110   3rd Qu.: 0.11520   3rd Qu.: 0.21256  
##  Max.   :2008-04-22   Max.   : 0.49340   Max.   : 0.43710   Max.   : 0.39313  
##                       NA's   :191        NA's   :49         NA's   :19        
##     ndvi_sw         precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :-0.06346   Min.   :  0.00       Min.   :295.9        
##  1st Qu.: 0.12916   1st Qu.:  0.00       1st Qu.:298.2        
##  Median : 0.16597   Median : 20.80       Median :299.3        
##  Mean   : 0.16596   Mean   : 35.47       Mean   :299.2        
##  3rd Qu.: 0.20277   3rd Qu.: 52.18       3rd Qu.:300.1        
##  Max.   : 0.38142   Max.   :390.60       Max.   :302.2        
##  NA's   :19         NA's   :9            NA's   :6            
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :296.1         Min.   :289.6               Min.   :297.8            
##  1st Qu.:298.3         1st Qu.:293.8               1st Qu.:300.4            
##  Median :299.4         Median :295.5               Median :301.5            
##  Mean   :299.3         Mean   :295.1               Mean   :301.4            
##  3rd Qu.:300.2         3rd Qu.:296.4               3rd Qu.:302.4            
##  Max.   :302.2         Max.   :297.8               Max.   :304.3            
##  NA's   :6             NA's   :6                   NA's   :6                
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :292.6             Min.   :  0.00                 
##  1st Qu.:296.3             1st Qu.: 10.82                 
##  Median :297.5             Median : 21.30                 
##  Mean   :297.3             Mean   : 30.47                 
##  3rd Qu.:298.4             3rd Qu.: 37.00                 
##  Max.   :299.9             Max.   :570.50                 
##  NA's   :6                 NA's   :6                      
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :66.74                        Min.   :  0.00              
##  1st Qu.:76.25                        1st Qu.:  0.00              
##  Median :78.67                        Median : 20.80              
##  Mean   :78.57                        Mean   : 35.47              
##  3rd Qu.:80.96                        3rd Qu.: 52.18              
##  Max.   :87.58                        Max.   :390.60              
##  NA's   :6                            NA's   :9                   
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :11.72                         Min.   :1.357     Min.   :22.84     
##  1st Qu.:15.24                         1st Qu.:2.157     1st Qu.:25.84     
##  Median :16.85                         Median :2.457     Median :27.23     
##  Mean   :16.55                         Mean   :2.516     Mean   :27.01     
##  3rd Qu.:17.86                         3rd Qu.:2.800     3rd Qu.:28.19     
##  Max.   :19.44                         Max.   :4.429     Max.   :30.07     
##  NA's   :6                             NA's   :6         NA's   :6         
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   :4.529           Min.   :26.70      Min.   :17.8      
##  1st Qu.:6.200           1st Qu.:30.60      1st Qu.:21.7      
##  Median :6.757           Median :31.70      Median :22.8      
##  Mean   :6.757           Mean   :31.61      Mean   :22.6      
##  3rd Qu.:7.286           3rd Qu.:32.80      3rd Qu.:23.9      
##  Max.   :9.914           Max.   :35.60      Max.   :25.6      
##  NA's   :6               NA's   :6          NA's   :6         
##  station_precip_mm
##  Min.   :  0.000  
##  1st Qu.:  6.825  
##  Median : 17.750  
##  Mean   : 26.785  
##  3rd Qu.: 35.450  
##  Max.   :305.900  
##  NA's   :6
iq.data <- dengue_merge_data.train[937: 1456,]
summary(iq.data)
##   total_cases          city                year        weekofyear   
##  Min.   :  0.000   Length:520         Min.   :2000   Min.   : 1.00  
##  1st Qu.:  1.000   Class :character   1st Qu.:2003   1st Qu.:13.75  
##  Median :  5.000   Mode  :character   Median :2005   Median :26.50  
##  Mean   :  7.565                      Mean   :2005   Mean   :26.50  
##  3rd Qu.:  9.000                      3rd Qu.:2007   3rd Qu.:39.25  
##  Max.   :116.000                      Max.   :2010   Max.   :53.00  
##                                                                     
##  week_start_date         ndvi_ne           ndvi_nw           ndvi_se       
##  Min.   :2000-07-01   Min.   :0.06173   Min.   :0.03586   Min.   :0.02988  
##  1st Qu.:2002-12-30   1st Qu.:0.20000   1st Qu.:0.17954   1st Qu.:0.19474  
##  Median :2005-06-28   Median :0.26364   Median :0.23297   Median :0.24980  
##  Mean   :2005-06-28   Mean   :0.26387   Mean   :0.23878   Mean   :0.25013  
##  3rd Qu.:2007-12-26   3rd Qu.:0.31997   3rd Qu.:0.29393   3rd Qu.:0.30230  
##  Max.   :2010-06-25   Max.   :0.50836   Max.   :0.45443   Max.   :0.53831  
##                       NA's   :3         NA's   :3         NA's   :3        
##     ndvi_sw        precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :0.06418   Min.   :  0.00       Min.   :294.6        
##  1st Qu.:0.20413   1st Qu.: 39.10       1st Qu.:297.1        
##  Median :0.26214   Median : 60.47       Median :297.8        
##  Mean   :0.26678   Mean   : 64.25       Mean   :297.9        
##  3rd Qu.:0.32515   3rd Qu.: 85.76       3rd Qu.:298.6        
##  Max.   :0.54602   Max.   :210.83       Max.   :301.6        
##  NA's   :3         NA's   :4            NA's   :4            
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :294.9         Min.   :290.1               Min.   :300.0            
##  1st Qu.:298.2         1st Qu.:294.6               1st Qu.:305.2            
##  Median :299.1         Median :295.9               Median :307.1            
##  Mean   :299.1         Mean   :295.5               Mean   :307.1            
##  3rd Qu.:300.1         3rd Qu.:296.5               3rd Qu.:308.7            
##  Max.   :302.9         Max.   :298.4               Max.   :314.0            
##  NA's   :4             NA's   :4                   NA's   :4                
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :286.9             Min.   :  0.00                 
##  1st Qu.:292.0             1st Qu.: 24.07                 
##  Median :293.1             Median : 46.44                 
##  Mean   :292.9             Mean   : 57.61                 
##  3rd Qu.:294.2             3rd Qu.: 71.07                 
##  Max.   :296.0             Max.   :362.03                 
##  NA's   :4                 NA's   :4                      
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :57.79                        Min.   :  0.00              
##  1st Qu.:84.30                        1st Qu.: 39.10              
##  Median :90.92                        Median : 60.47              
##  Mean   :88.64                        Mean   : 64.25              
##  3rd Qu.:94.56                        3rd Qu.: 85.76              
##  Max.   :98.61                        Max.   :210.83              
##  NA's   :4                            NA's   :4                   
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :12.11                         Min.   : 3.714    Min.   :21.40     
##  1st Qu.:16.10                         1st Qu.: 7.371    1st Qu.:27.00     
##  Median :17.43                         Median : 8.964    Median :27.60     
##  Mean   :17.10                         Mean   : 9.207    Mean   :27.53     
##  3rd Qu.:18.18                         3rd Qu.:11.014    3rd Qu.:28.10     
##  Max.   :20.46                         Max.   :16.029    Max.   :30.80     
##  NA's   :4                             NA's   :4         NA's   :37        
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   : 5.20           Min.   :30.1       Min.   :14.7      
##  1st Qu.: 9.50           1st Qu.:33.2       1st Qu.:20.6      
##  Median :10.62           Median :34.0       Median :21.3      
##  Mean   :10.57           Mean   :34.0       Mean   :21.2      
##  3rd Qu.:11.65           3rd Qu.:34.9       3rd Qu.:22.0      
##  Max.   :15.80           Max.   :42.2       Max.   :24.2      
##  NA's   :37              NA's   :14         NA's   :8         
##  station_precip_mm
##  Min.   :  0.00   
##  1st Qu.: 17.20   
##  Median : 45.30   
##  Mean   : 62.47   
##  3rd Qu.: 85.95   
##  Max.   :543.30   
##  NA's   :16

more boxplots

boxplot(sj.data$total_cases ~ sj.data$weekofyear)

boxplot(sj.data$total_cases ~ sj.data$weekofyear, outline = FALSE)

boxplot(iq.data$total_cases ~ iq.data$weekofyear)

boxplot(iq.data$total_cases ~ iq.data$weekofyear, outline = FALSE)

Interesting how the trend changes by each week. Maybe in our analysis we will need some sort of dummy variable to distinguish between the two cities.

We should check how the total cases are distributed

hist(sj.data$total_cases)

hist(iq.data$total_cases)

That might need transforming later to make it more normal

Finally we can plot some stuff to get an idea for

I think we can start with an ETS/ARIMA model. Additionally, I think it would be wise to use the two separate data frames for sj and iq data to model since they follow clearly different patterns. If we were doing linear modeling, we could just create a dummy variable.

ts.sj.data <- ts(sj.data$total_cases, frequency = 52, start = c(1990, 18))
autoplot(ts.sj.data)

Maybe a neural net would be good here? It seems like there is some seasonality but peaks are not uniform. Lets check the iq time series data next

ts.iq.data <- ts(iq.data$total_cases, frequency = 52, start = c(2000, 26))
autoplot(ts.iq.data)

Checking if the decomposition command also shows seasonality

decompose(ts.sj.data, type = "additive") %>% autoplot()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Removed 104 row(s) containing missing values (geom_path).

decompose(ts.iq.data, type = "additive") %>% autoplot()
## Warning: attributes are not identical across measure variables;
## they will be dropped

## Warning: Removed 104 row(s) containing missing values (geom_path).

While the remainder is not white noise, there is still a seasonal and trend component to our data.

Same here for the iq data, there are some significant peaks and nothing seems uniform. I would like to see if our data is stationary since it actually looks to be that way.

ggtsdisplay(ts.sj.data)

pacf(ts.sj.data)

ggtsdisplay(ts.iq.data)

pacf(ts.iq.data)

Not stationary at all. Very periodical? So maybe an ARIMA model will still be good here!

ndiffs(ts.sj.data)
## [1] 1
nsdiffs(ts.sj.data)
## [1] 0
ts.sj.data %>% diff(1) %>% diff(1) %>% ggtsdisplay()

Differencing twice seemed to help it a bit more but there are still still plenty of significant lags in both the ACF and PACF plots.

ndiffs(ts.iq.data)
## [1] 1
nsdiffs(ts.iq.data)
## [1] 0
ts.iq.data %>% diff(1) %>% ggtsdisplay()

IQ data looks good after one difference but SJ is still a bit concerning. However, it is an improvement from the original ACF and PACF plots.

Lets just try ARIMA models and see what happens

aa.sj <- auto.arima(ts.sj.data)
summary(aa.sj)
## Series: ts.sj.data 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       0.7116  -0.5929
## s.e.  0.0948   0.1078
## 
## sigma^2 estimated as 180.9:  log likelihood=-3755.85
## AIC=7517.71   AICc=7517.73   BIC=7532.23
## 
## Training set error measures:
##                       ME     RMSE      MAE MPE MAPE      MASE        ACF1
## Training set 0.001467535 13.42959 8.047587 NaN  Inf 0.2202839 0.001092614
checkresiduals(aa.sj)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 95.049, df = 102, p-value = 0.6741
## 
## Model df: 2.   Total lags used: 104

So the book tells us that the ARIMA model could follow an ARIMA(p,d,0) if ACF is exp decay or sinusoidal (wave like) and PACF has no significant spikes past lag p. Our ACF and PACF plots fit both criteria for that on the SJ data. PACF has significant spikes up to 3, 6 and 9. Ill try all 3 and see how they look

sj.310 <- arima(ts.sj.data, order = c(3,1,0))
sj.610 <- arima(ts.sj.data, order = c(6,1,0))
sj.910 <- arima(ts.sj.data, order = c(9,1,0))

summary(sj.310)
## 
## Call:
## arima(x = ts.sj.data, order = c(3, 1, 0))
## 
## Coefficients:
##          ar1     ar2      ar3
##       0.1244  0.0878  -0.0043
## s.e.  0.0327  0.0328   0.0327
## 
## sigma^2 estimated as 180.9:  log likelihood = -3756.7,  aic = 7521.4
## 
## Training set error measures:
##                       ME     RMSE      MAE MPE MAPE     MASE         ACF1
## Training set 0.001461025 13.44178 8.094611 NaN  Inf 1.013317 0.0003701168
checkresiduals(sj.310)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,0)
## Q* = 101.61, df = 101, p-value = 0.4642
## 
## Model df: 3.   Total lags used: 104
summary(sj.610)
## 
## Call:
## arima(x = ts.sj.data, order = c(6, 1, 0))
## 
## Coefficients:
##          ar1     ar2      ar3     ar4     ar5     ar6
##       0.1197  0.0832  -0.0182  0.0517  0.0766  0.0024
## s.e.  0.0327  0.0328   0.0329  0.0328  0.0328  0.0326
## 
## sigma^2 estimated as 179.1:  log likelihood = -3752.13,  aic = 7518.26
## 
## Training set error measures:
##                       ME     RMSE      MAE MPE MAPE     MASE         ACF1
## Training set 0.001486091 13.37593 8.050992 NaN  Inf 1.007856 0.0001348531
checkresiduals(sj.610)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(6,1,0)
## Q* = 88.145, df = 98, p-value = 0.7521
## 
## Model df: 6.   Total lags used: 104
summary(sj.910)
## 
## Call:
## arima(x = ts.sj.data, order = c(9, 1, 0))
## 
## Coefficients:
##          ar1     ar2      ar3     ar4     ar5      ar6      ar7     ar8
##       0.1280  0.0819  -0.0239  0.0568  0.0835  -0.0058  -0.0023  0.0843
## s.e.  0.0325  0.0326   0.0327  0.0327  0.0326   0.0327   0.0326  0.0325
##           ar9
##       -0.1117
## s.e.   0.0324
## 
## sigma^2 estimated as 175.9:  log likelihood = -3743.83,  aic = 7507.65
## 
## Training set error measures:
##                       ME     RMSE      MAE MPE MAPE     MASE         ACF1
## Training set 0.001885691 13.25658 8.062695 NaN  Inf 1.009321 -0.004295591
checkresiduals(sj.910)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(9,1,0)
## Q* = 77.812, df = 95, p-value = 0.9001
## 
## Model df: 9.   Total lags used: 104

All 3 hand selected parameter models looked roughly the same compared with the auto.arima() model of order (1,1,1). However, the ARIMA(9,1,0) model had a slightly better AIC value than the others. Additionally, the resulting residual ACF plot had less significant lags. Based on that, I will use the ARIMA(9,1,0) model.

Lets move onto the IQ data.

aa.iq <- auto.arima(ts.iq.data)
summary(aa.iq)
## Series: ts.iq.data 
## ARIMA(0,1,2)(0,0,1)[52] 
## 
## Coefficients:
##           ma1      ma2    sma1
##       -0.2508  -0.2322  0.0568
## s.e.   0.0430   0.0449  0.0375
## 
## sigma^2 estimated as 52.12:  log likelihood=-1761.09
## AIC=3530.18   AICc=3530.26   BIC=3547.19
## 
## Training set error measures:
##                      ME     RMSE      MAE MPE MAPE      MASE         ACF1
## Training set 0.01292348 7.191928 3.841351 NaN  Inf 0.4069155 -0.002950598
checkresiduals(aa.iq)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,2)(0,0,1)[52]
## Q* = 93.91, df = 101, p-value = 0.6786
## 
## Model df: 3.   Total lags used: 104

Interestingly enough, the auto.arima() function chose a seasonal version of the model even though our nsdiffs() value for the IQ data was 0. Once again, our iq model seems to meet the criteria necessary for an ARIMA(0,d,q) with an sinusoidal ACF plot and significant lags at 3 and 5 in the PACF plot and none afterwards. (ACF and PACF plots above)

I will leave the seasonal component the same as the auto.arima() as I am not confident on how to address those parameters but will try changing the non seasonal components of our iq model.

iq.013  <- arima(ts.iq.data, order = c(0,1,3), seasonal = list(order = c(0,0,1)))
iq.015  <- arima(ts.iq.data, order = c(0,1,5), seasonal = list(order = c(0,0,1)))

summary(iq.013)
## 
## Call:
## arima(x = ts.iq.data, order = c(0, 1, 3), seasonal = list(order = c(0, 0, 1)))
## 
## Coefficients:
##           ma1      ma2     ma3    sma1
##       -0.2539  -0.2307  0.0153  0.0572
## s.e.   0.0441   0.0450  0.0527  0.0375
## 
## sigma^2 estimated as 51.82:  log likelihood = -1761.05,  aic = 3532.1
## 
## Training set error measures:
##                      ME     RMSE     MAE MPE MAPE    MASE          ACF1
## Training set 0.01229733 7.191378 3.83601 NaN  Inf 0.98267 -0.0001826112
checkresiduals(iq.013)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,3)(0,0,1)[52]
## Q* = 92.949, df = 100, p-value = 0.6784
## 
## Model df: 4.   Total lags used: 104
summary(iq.015)
## 
## Call:
## arima(x = ts.iq.data, order = c(0, 1, 5), seasonal = list(order = c(0, 0, 1)))
## 
## Coefficients:
##           ma1      ma2     ma3     ma4      ma5    sma1
##       -0.2948  -0.2269  0.0208  0.0390  -0.1683  0.0544
## s.e.   0.0452   0.0459  0.0528  0.0498   0.0495  0.0375
## 
## sigma^2 estimated as 50.62:  log likelihood = -1755.08,  aic = 3524.16
## 
## Training set error measures:
##                      ME     RMSE      MAE MPE MAPE      MASE       ACF1
## Training set 0.02222482 7.107777 3.814002 NaN  Inf 0.9770322 0.02084161
checkresiduals(iq.015)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,5)(0,0,1)[52]
## Q* = 84.997, df = 98, p-value = 0.8226
## 
## Model df: 6.   Total lags used: 104

Once again, the comparison parameters loos more or less the same on all 3 models. Nearly similar AIC values for all of them but the only exception is that the ARIMA(0,1,5)(0,0,1) model only had 3 significant lags in the residual ACF plot whereas the other two had 4. Therefore, I would use the ARIMA(0,1,5)(0,0,1) model.

I guess now we can try to do some forecasting with these models and see how they look?

fc.arima.sj <- forecast(sj.910)
autoplot(fc.arima.sj)
## Warning: `filter_()` is deprecated as of dplyr 0.7.0.
## Please use `filter()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

It looks like a naive model. I guess it works but i dont think its particularly useful

Moving onto the IQ forecast

fc.arima.iq <- forecast(iq.015)
autoplot(fc.arima.iq)

This forecast seems to have some attempt at seasonality but ultimately it still looks like a naive forecast.

Lets try some linear modeling. I will first make a copy of the separate data sets so that way I dont mess anything up

sj.copy <- sj.data
summary(sj.copy)
##   total_cases         city                year        weekofyear   
##  Min.   :  0.00   Length:936         Min.   :1990   Min.   : 1.00  
##  1st Qu.:  9.00   Class :character   1st Qu.:1994   1st Qu.:13.75  
##  Median : 19.00   Mode  :character   Median :1999   Median :26.50  
##  Mean   : 34.18                      Mean   :1999   Mean   :26.50  
##  3rd Qu.: 37.00                      3rd Qu.:2003   3rd Qu.:39.25  
##  Max.   :461.00                      Max.   :2008   Max.   :53.00  
##                                                                    
##  week_start_date         ndvi_ne            ndvi_nw            ndvi_se        
##  Min.   :1990-04-30   Min.   :-0.40625   Min.   :-0.45610   Min.   :-0.01553  
##  1st Qu.:1994-10-27   1st Qu.: 0.00450   1st Qu.: 0.01642   1st Qu.: 0.13928  
##  Median :1999-04-26   Median : 0.05770   Median : 0.06807   Median : 0.17719  
##  Mean   :1999-04-26   Mean   : 0.05792   Mean   : 0.06747   Mean   : 0.17766  
##  3rd Qu.:2003-10-23   3rd Qu.: 0.11110   3rd Qu.: 0.11520   3rd Qu.: 0.21256  
##  Max.   :2008-04-22   Max.   : 0.49340   Max.   : 0.43710   Max.   : 0.39313  
##                       NA's   :191        NA's   :49         NA's   :19        
##     ndvi_sw         precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :-0.06346   Min.   :  0.00       Min.   :295.9        
##  1st Qu.: 0.12916   1st Qu.:  0.00       1st Qu.:298.2        
##  Median : 0.16597   Median : 20.80       Median :299.3        
##  Mean   : 0.16596   Mean   : 35.47       Mean   :299.2        
##  3rd Qu.: 0.20277   3rd Qu.: 52.18       3rd Qu.:300.1        
##  Max.   : 0.38142   Max.   :390.60       Max.   :302.2        
##  NA's   :19         NA's   :9            NA's   :6            
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :296.1         Min.   :289.6               Min.   :297.8            
##  1st Qu.:298.3         1st Qu.:293.8               1st Qu.:300.4            
##  Median :299.4         Median :295.5               Median :301.5            
##  Mean   :299.3         Mean   :295.1               Mean   :301.4            
##  3rd Qu.:300.2         3rd Qu.:296.4               3rd Qu.:302.4            
##  Max.   :302.2         Max.   :297.8               Max.   :304.3            
##  NA's   :6             NA's   :6                   NA's   :6                
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :292.6             Min.   :  0.00                 
##  1st Qu.:296.3             1st Qu.: 10.82                 
##  Median :297.5             Median : 21.30                 
##  Mean   :297.3             Mean   : 30.47                 
##  3rd Qu.:298.4             3rd Qu.: 37.00                 
##  Max.   :299.9             Max.   :570.50                 
##  NA's   :6                 NA's   :6                      
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :66.74                        Min.   :  0.00              
##  1st Qu.:76.25                        1st Qu.:  0.00              
##  Median :78.67                        Median : 20.80              
##  Mean   :78.57                        Mean   : 35.47              
##  3rd Qu.:80.96                        3rd Qu.: 52.18              
##  Max.   :87.58                        Max.   :390.60              
##  NA's   :6                            NA's   :9                   
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :11.72                         Min.   :1.357     Min.   :22.84     
##  1st Qu.:15.24                         1st Qu.:2.157     1st Qu.:25.84     
##  Median :16.85                         Median :2.457     Median :27.23     
##  Mean   :16.55                         Mean   :2.516     Mean   :27.01     
##  3rd Qu.:17.86                         3rd Qu.:2.800     3rd Qu.:28.19     
##  Max.   :19.44                         Max.   :4.429     Max.   :30.07     
##  NA's   :6                             NA's   :6         NA's   :6         
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   :4.529           Min.   :26.70      Min.   :17.8      
##  1st Qu.:6.200           1st Qu.:30.60      1st Qu.:21.7      
##  Median :6.757           Median :31.70      Median :22.8      
##  Mean   :6.757           Mean   :31.61      Mean   :22.6      
##  3rd Qu.:7.286           3rd Qu.:32.80      3rd Qu.:23.9      
##  Max.   :9.914           Max.   :35.60      Max.   :25.6      
##  NA's   :6               NA's   :6          NA's   :6         
##  station_precip_mm
##  Min.   :  0.000  
##  1st Qu.:  6.825  
##  Median : 17.750  
##  Mean   : 26.785  
##  3rd Qu.: 35.450  
##  Max.   :305.900  
##  NA's   :6
iq.copy <- iq.data
summary(iq.copy)
##   total_cases          city                year        weekofyear   
##  Min.   :  0.000   Length:520         Min.   :2000   Min.   : 1.00  
##  1st Qu.:  1.000   Class :character   1st Qu.:2003   1st Qu.:13.75  
##  Median :  5.000   Mode  :character   Median :2005   Median :26.50  
##  Mean   :  7.565                      Mean   :2005   Mean   :26.50  
##  3rd Qu.:  9.000                      3rd Qu.:2007   3rd Qu.:39.25  
##  Max.   :116.000                      Max.   :2010   Max.   :53.00  
##                                                                     
##  week_start_date         ndvi_ne           ndvi_nw           ndvi_se       
##  Min.   :2000-07-01   Min.   :0.06173   Min.   :0.03586   Min.   :0.02988  
##  1st Qu.:2002-12-30   1st Qu.:0.20000   1st Qu.:0.17954   1st Qu.:0.19474  
##  Median :2005-06-28   Median :0.26364   Median :0.23297   Median :0.24980  
##  Mean   :2005-06-28   Mean   :0.26387   Mean   :0.23878   Mean   :0.25013  
##  3rd Qu.:2007-12-26   3rd Qu.:0.31997   3rd Qu.:0.29393   3rd Qu.:0.30230  
##  Max.   :2010-06-25   Max.   :0.50836   Max.   :0.45443   Max.   :0.53831  
##                       NA's   :3         NA's   :3         NA's   :3        
##     ndvi_sw        precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :0.06418   Min.   :  0.00       Min.   :294.6        
##  1st Qu.:0.20413   1st Qu.: 39.10       1st Qu.:297.1        
##  Median :0.26214   Median : 60.47       Median :297.8        
##  Mean   :0.26678   Mean   : 64.25       Mean   :297.9        
##  3rd Qu.:0.32515   3rd Qu.: 85.76       3rd Qu.:298.6        
##  Max.   :0.54602   Max.   :210.83       Max.   :301.6        
##  NA's   :3         NA's   :4            NA's   :4            
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :294.9         Min.   :290.1               Min.   :300.0            
##  1st Qu.:298.2         1st Qu.:294.6               1st Qu.:305.2            
##  Median :299.1         Median :295.9               Median :307.1            
##  Mean   :299.1         Mean   :295.5               Mean   :307.1            
##  3rd Qu.:300.1         3rd Qu.:296.5               3rd Qu.:308.7            
##  Max.   :302.9         Max.   :298.4               Max.   :314.0            
##  NA's   :4             NA's   :4                   NA's   :4                
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :286.9             Min.   :  0.00                 
##  1st Qu.:292.0             1st Qu.: 24.07                 
##  Median :293.1             Median : 46.44                 
##  Mean   :292.9             Mean   : 57.61                 
##  3rd Qu.:294.2             3rd Qu.: 71.07                 
##  Max.   :296.0             Max.   :362.03                 
##  NA's   :4                 NA's   :4                      
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :57.79                        Min.   :  0.00              
##  1st Qu.:84.30                        1st Qu.: 39.10              
##  Median :90.92                        Median : 60.47              
##  Mean   :88.64                        Mean   : 64.25              
##  3rd Qu.:94.56                        3rd Qu.: 85.76              
##  Max.   :98.61                        Max.   :210.83              
##  NA's   :4                            NA's   :4                   
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :12.11                         Min.   : 3.714    Min.   :21.40     
##  1st Qu.:16.10                         1st Qu.: 7.371    1st Qu.:27.00     
##  Median :17.43                         Median : 8.964    Median :27.60     
##  Mean   :17.10                         Mean   : 9.207    Mean   :27.53     
##  3rd Qu.:18.18                         3rd Qu.:11.014    3rd Qu.:28.10     
##  Max.   :20.46                         Max.   :16.029    Max.   :30.80     
##  NA's   :4                             NA's   :4         NA's   :37        
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   : 5.20           Min.   :30.1       Min.   :14.7      
##  1st Qu.: 9.50           1st Qu.:33.2       1st Qu.:20.6      
##  Median :10.62           Median :34.0       Median :21.3      
##  Mean   :10.57           Mean   :34.0       Mean   :21.2      
##  3rd Qu.:11.65           3rd Qu.:34.9       3rd Qu.:22.0      
##  Max.   :15.80           Max.   :42.2       Max.   :24.2      
##  NA's   :37              NA's   :14         NA's   :8         
##  station_precip_mm
##  Min.   :  0.00   
##  1st Qu.: 17.20   
##  Median : 45.30   
##  Mean   : 62.47   
##  3rd Qu.: 85.95   
##  Max.   :543.30   
##  NA's   :16

I think that maybe we could impute the median for all of these missing values?

for (i in 1:ncol(sj.copy)){
  for (j in 1:nrow(sj.copy)){
    if (is.na(sj.copy[j,i] == TRUE)){
      sj.copy[j,i] <- median(sj.copy[,i], na.rm = TRUE)
    } 
  }
}
summary(sj.copy)
##   total_cases         city                year        weekofyear   
##  Min.   :  0.00   Length:936         Min.   :1990   Min.   : 1.00  
##  1st Qu.:  9.00   Class :character   1st Qu.:1994   1st Qu.:13.75  
##  Median : 19.00   Mode  :character   Median :1999   Median :26.50  
##  Mean   : 34.18                      Mean   :1999   Mean   :26.50  
##  3rd Qu.: 37.00                      3rd Qu.:2003   3rd Qu.:39.25  
##  Max.   :461.00                      Max.   :2008   Max.   :53.00  
##  week_start_date         ndvi_ne            ndvi_nw            ndvi_se        
##  Min.   :1990-04-30   Min.   :-0.40625   Min.   :-0.45610   Min.   :-0.01553  
##  1st Qu.:1994-10-27   1st Qu.: 0.02123   1st Qu.: 0.02047   1st Qu.: 0.14012  
##  Median :1999-04-26   Median : 0.05770   Median : 0.06807   Median : 0.17719  
##  Mean   :1999-04-26   Mean   : 0.05788   Mean   : 0.06750   Mean   : 0.17765  
##  3rd Qu.:2003-10-23   3rd Qu.: 0.09885   3rd Qu.: 0.11270   3rd Qu.: 0.21160  
##  Max.   :2008-04-22   Max.   : 0.49340   Max.   : 0.43710   Max.   : 0.39313  
##     ndvi_sw         precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :-0.06346   Min.   :  0.00       Min.   :295.9        
##  1st Qu.: 0.12993   1st Qu.:  0.00       1st Qu.:298.2        
##  Median : 0.16597   Median : 20.80       Median :299.3        
##  Mean   : 0.16596   Mean   : 35.33       Mean   :299.2        
##  3rd Qu.: 0.20226   3rd Qu.: 51.66       3rd Qu.:300.1        
##  Max.   : 0.38142   Max.   :390.60       Max.   :302.2        
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :296.1         Min.   :289.6               Min.   :297.8            
##  1st Qu.:298.3         1st Qu.:293.9               1st Qu.:300.4            
##  Median :299.4         Median :295.5               Median :301.5            
##  Mean   :299.3         Mean   :295.1               Mean   :301.4            
##  3rd Qu.:300.2         3rd Qu.:296.4               3rd Qu.:302.4            
##  Max.   :302.2         Max.   :297.8               Max.   :304.3            
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :292.6             Min.   :  0.00                 
##  1st Qu.:296.3             1st Qu.: 10.90                 
##  Median :297.5             Median : 21.30                 
##  Mean   :297.3             Mean   : 30.41                 
##  3rd Qu.:298.4             3rd Qu.: 36.92                 
##  Max.   :299.9             Max.   :570.50                 
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :66.74                        Min.   :  0.00              
##  1st Qu.:76.26                        1st Qu.:  0.00              
##  Median :78.67                        Median : 20.80              
##  Mean   :78.57                        Mean   : 35.33              
##  3rd Qu.:80.95                        3rd Qu.: 51.66              
##  Max.   :87.58                        Max.   :390.60              
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :11.72                         Min.   :1.357     Min.   :22.84     
##  1st Qu.:15.25                         1st Qu.:2.157     1st Qu.:25.84     
##  Median :16.85                         Median :2.457     Median :27.23     
##  Mean   :16.55                         Mean   :2.516     Mean   :27.01     
##  3rd Qu.:17.85                         3rd Qu.:2.789     3rd Qu.:28.18     
##  Max.   :19.44                         Max.   :4.429     Max.   :30.07     
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   :4.529           Min.   :26.70      Min.   :17.8      
##  1st Qu.:6.211           1st Qu.:30.60      1st Qu.:21.7      
##  Median :6.757           Median :31.70      Median :22.8      
##  Mean   :6.757           Mean   :31.61      Mean   :22.6      
##  3rd Qu.:7.286           3rd Qu.:32.80      3rd Qu.:23.9      
##  Max.   :9.914           Max.   :35.60      Max.   :25.6      
##  station_precip_mm
##  Min.   :  0.00   
##  1st Qu.:  6.90   
##  Median : 17.75   
##  Mean   : 26.73   
##  3rd Qu.: 35.30   
##  Max.   :305.90
for (i in 1:ncol(iq.copy)){
  for (j in 1:nrow(iq.copy)){
    if (is.na(iq.copy[j,i] == TRUE)){
      iq.copy[j,i] <- median(iq.copy[,i], na.rm = TRUE)
    } 
  }
}
summary(iq.copy)
##   total_cases          city                year        weekofyear   
##  Min.   :  0.000   Length:520         Min.   :2000   Min.   : 1.00  
##  1st Qu.:  1.000   Class :character   1st Qu.:2003   1st Qu.:13.75  
##  Median :  5.000   Mode  :character   Median :2005   Median :26.50  
##  Mean   :  7.565                      Mean   :2005   Mean   :26.50  
##  3rd Qu.:  9.000                      3rd Qu.:2007   3rd Qu.:39.25  
##  Max.   :116.000                      Max.   :2010   Max.   :53.00  
##  week_start_date         ndvi_ne           ndvi_nw           ndvi_se       
##  Min.   :2000-07-01   Min.   :0.06173   Min.   :0.03586   Min.   :0.02988  
##  1st Qu.:2002-12-30   1st Qu.:0.20015   1st Qu.:0.18026   1st Qu.:0.19475  
##  Median :2005-06-28   Median :0.26364   Median :0.23297   Median :0.24980  
##  Mean   :2005-06-28   Mean   :0.26387   Mean   :0.23875   Mean   :0.25012  
##  3rd Qu.:2007-12-26   3rd Qu.:0.31962   3rd Qu.:0.29383   3rd Qu.:0.30213  
##  Max.   :2010-06-25   Max.   :0.50836   Max.   :0.45443   Max.   :0.53831  
##     ndvi_sw        precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :0.06418   Min.   :  0.00       Min.   :294.6        
##  1st Qu.:0.20430   1st Qu.: 39.15       1st Qu.:297.1        
##  Median :0.26214   Median : 60.47       Median :297.8        
##  Mean   :0.26675   Mean   : 64.22       Mean   :297.9        
##  3rd Qu.:0.32488   3rd Qu.: 85.64       3rd Qu.:298.6        
##  Max.   :0.54602   Max.   :210.83       Max.   :301.6        
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :294.9         Min.   :290.1               Min.   :300.0            
##  1st Qu.:298.2         1st Qu.:294.6               1st Qu.:305.2            
##  Median :299.1         Median :295.9               Median :307.1            
##  Mean   :299.1         Mean   :295.5               Mean   :307.1            
##  3rd Qu.:300.1         3rd Qu.:296.5               3rd Qu.:308.7            
##  Max.   :302.9         Max.   :298.4               Max.   :314.0            
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :286.9             Min.   :  0.00                 
##  1st Qu.:292.0             1st Qu.: 24.21                 
##  Median :293.1             Median : 46.44                 
##  Mean   :292.9             Mean   : 57.52                 
##  3rd Qu.:294.1             3rd Qu.: 70.43                 
##  Max.   :296.0             Max.   :362.03                 
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :57.79                        Min.   :  0.00              
##  1st Qu.:84.39                        1st Qu.: 39.15              
##  Median :90.92                        Median : 60.47              
##  Mean   :88.66                        Mean   : 64.22              
##  3rd Qu.:94.55                        3rd Qu.: 85.64              
##  Max.   :98.61                        Max.   :210.83              
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :12.11                         Min.   : 3.714    Min.   :21.40     
##  1st Qu.:16.12                         1st Qu.: 7.371    1st Qu.:27.05     
##  Median :17.43                         Median : 8.964    Median :27.60     
##  Mean   :17.10                         Mean   : 9.205    Mean   :27.54     
##  3rd Qu.:18.18                         3rd Qu.:11.004    3rd Qu.:28.04     
##  Max.   :20.46                         Max.   :16.029    Max.   :30.80     
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   : 5.20           Min.   :30.1       Min.   :14.7      
##  1st Qu.: 9.55           1st Qu.:33.2       1st Qu.:20.6      
##  Median :10.62           Median :34.0       Median :21.3      
##  Mean   :10.57           Mean   :34.0       Mean   :21.2      
##  3rd Qu.:11.60           3rd Qu.:34.8       3rd Qu.:22.0      
##  Max.   :15.80           Max.   :42.2       Max.   :24.2      
##  station_precip_mm
##  Min.   :  0.00   
##  1st Qu.: 18.00   
##  Median : 45.30   
##  Mean   : 61.94   
##  3rd Qu.: 83.35   
##  Max.   :543.30

This is probably the simplest imputation we can do but lets just start here

sj.lm.1 <- lm(total_cases ~ ndvi_ne + ndvi_nw + ndvi_se + ndvi_sw, data = sj.copy)
summary(sj.lm.1)
## 
## Call:
## lm(formula = total_cases ~ ndvi_ne + ndvi_nw + ndvi_se + ndvi_sw, 
##     data = sj.copy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37.72 -23.89 -15.20   2.26 426.80 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9364     5.6764   5.978 3.21e-09 ***
## ndvi_ne      -5.8315    22.4915  -0.259    0.795    
## ndvi_nw      33.5166    23.9061   1.402    0.161    
## ndvi_se      -0.4145    52.6665  -0.008    0.994    
## ndvi_sw      -9.6840    53.5738  -0.181    0.857    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.42 on 931 degrees of freedom
## Multiple R-squared:  0.002673,   Adjusted R-squared:  -0.001612 
## F-statistic: 0.6238 on 4 and 931 DF,  p-value: 0.6456
sj.lm.2 <- lm(total_cases ~ precipitation_amt_mm + reanalysis_air_temp_k +reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_precip_amt_kg_per_m2 + reanalysis_relative_humidity_percent + reanalysis_sat_precip_amt_mm +  reanalysis_specific_humidity_g_per_kg+ reanalysis_tdtr_k+ station_avg_temp_c +station_diur_temp_rng_c, data = sj.copy)
summary(sj.lm.2)
## 
## Call:
## lm(formula = total_cases ~ precipitation_amt_mm + reanalysis_air_temp_k + 
##     reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + 
##     reanalysis_min_air_temp_k + reanalysis_precip_amt_kg_per_m2 + 
##     reanalysis_relative_humidity_percent + reanalysis_sat_precip_amt_mm + 
##     reanalysis_specific_humidity_g_per_kg + reanalysis_tdtr_k + 
##     station_avg_temp_c + station_diur_temp_rng_c, data = sj.copy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68.35 -23.60  -9.99   6.02 395.20 
## 
## Coefficients: (1 not defined because of singularities)
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           7531.97796 6461.34651   1.166 0.244038
## precipitation_amt_mm                    -0.07123    0.04695  -1.517 0.129631
## reanalysis_air_temp_k                   34.43433   63.31903   0.544 0.586695
## reanalysis_avg_temp_k                  -44.77643   21.70573  -2.063 0.039403
## reanalysis_dew_point_temp_k            -39.68892   58.34138  -0.680 0.496493
## reanalysis_max_air_temp_k               18.73630    4.94775   3.787 0.000162
## reanalysis_min_air_temp_k                3.33871    5.35669   0.623 0.533256
## reanalysis_precip_amt_kg_per_m2          0.06011    0.06438   0.934 0.350705
## reanalysis_relative_humidity_percent     0.84440   12.75105   0.066 0.947216
## reanalysis_sat_precip_amt_mm                  NA         NA      NA       NA
## reanalysis_specific_humidity_g_per_kg   36.85390   22.14988   1.664 0.096483
## reanalysis_tdtr_k                      -17.89419    5.70138  -3.139 0.001752
## station_avg_temp_c                       0.35655    3.00517   0.119 0.905581
## station_diur_temp_rng_c                  5.01060    2.33911   2.142 0.032446
##                                          
## (Intercept)                              
## precipitation_amt_mm                     
## reanalysis_air_temp_k                    
## reanalysis_avg_temp_k                 *  
## reanalysis_dew_point_temp_k              
## reanalysis_max_air_temp_k             ***
## reanalysis_min_air_temp_k                
## reanalysis_precip_amt_kg_per_m2          
## reanalysis_relative_humidity_percent     
## reanalysis_sat_precip_amt_mm             
## reanalysis_specific_humidity_g_per_kg .  
## reanalysis_tdtr_k                     ** 
## station_avg_temp_c                       
## station_diur_temp_rng_c               *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.59 on 923 degrees of freedom
## Multiple R-squared:  0.08038,    Adjusted R-squared:  0.06842 
## F-statistic: 6.723 on 12 and 923 DF,  p-value: 1.102e-11
sj.lm.3 <- lm(total_cases ~ station_max_temp_c + station_min_temp_c + station_precip_mm, data = sj.copy)
summary(sj.lm.3)
## 
## Call:
## lm(formula = total_cases ~ station_max_temp_c + station_min_temp_c + 
##     station_precip_mm, data = sj.copy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.18 -25.12 -11.78   6.53 408.59 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -158.95702   31.02149  -5.124 3.63e-07 ***
## station_max_temp_c    3.99980    1.30404   3.067  0.00222 ** 
## station_min_temp_c    2.86117    1.49202   1.918  0.05546 .  
## station_precip_mm     0.07639    0.05667   1.348  0.17798    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.36 on 932 degrees of freedom
## Multiple R-squared:  0.04232,    Adjusted R-squared:  0.03924 
## F-statistic: 13.73 on 3 and 932 DF,  p-value: 9.105e-09

No apparent trend and using season would be bad in a tslm() since my frequency is 52. But I will create factor variables for months since it seemed like there was seasonality present in the boxplots

sj.copy$month <- as.factor(month(sj.copy$week_start_date))
summary(sj.copy$month)
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 90 72 72 90 72 72 90 72 76 86 72 72
sj.lm.4 <- lm(total_cases ~ month, data = sj.copy)
summary(sj.lm.4)
## 
## Call:
## lm(formula = total_cases ~ month, data = sj.copy)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66.00 -19.36  -7.10   4.59 390.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   31.422      5.007   6.275 5.36e-10 ***
## month2       -10.325      7.511  -1.375  0.16957    
## month3       -16.825      7.511  -2.240  0.02532 *  
## month4       -21.400      7.081  -3.022  0.00258 ** 
## month5       -20.283      7.511  -2.701  0.00705 ** 
## month6       -15.103      7.511  -2.011  0.04464 *  
## month7        -4.067      7.081  -0.574  0.56592    
## month8        16.619      7.511   2.213  0.02716 *  
## month9        24.328      7.400   3.287  0.00105 ** 
## month10       39.578      7.163   5.525 4.28e-08 ***
## month11       30.397      7.511   4.047 5.62e-05 ***
## month12       10.258      7.511   1.366  0.17234    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.5 on 924 degrees of freedom
## Multiple R-squared:  0.1553, Adjusted R-squared:  0.1453 
## F-statistic: 15.44 on 11 and 924 DF,  p-value: < 2.2e-16

Not surprisingly, the months variable have the most explanatory power for the total cases. Maybe we can try an autoregressive term (like from week 1’s discussion).

Following what was written there

sj.copy$seq <- seq(1,936)
sj.copy$total_cases_t1 <- c(sj.copy$total_cases[1], sj.copy$total_cases[1:935])

sj.lm.ar <- lm(total_cases ~ total_cases_t1 + seq + month, data = sj.copy[1:800,])
summary(sj.lm.ar)
## 
## Call:
## lm(formula = total_cases ~ total_cases_t1 + seq + month, data = sj.copy[1:800, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.274  -5.554  -0.454   4.871  98.091 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.5297220  1.8496088   0.286   0.7746    
## total_cases_t1  0.9630156  0.0098321  97.946   <2e-16 ***
## seq            -0.0009883  0.0021186  -0.467   0.6410    
## month2         -1.5225277  2.3458012  -0.649   0.5165    
## month3         -1.1132167  2.3504160  -0.474   0.6359    
## month4         -0.3087953  2.2152379  -0.139   0.8892    
## month5          0.7440402  2.3166753   0.321   0.7482    
## month6          2.2634187  2.3126499   0.979   0.3280    
## month7          4.7167953  2.1781654   2.165   0.0307 *  
## month8          3.7758713  2.3045609   1.638   0.1017    
## month9          4.8694858  2.2880341   2.128   0.0336 *  
## month10         5.7773149  2.2643995   2.551   0.0109 *  
## month11         1.0882387  2.3659468   0.460   0.6457    
## month12        -5.8403595  2.3504852  -2.485   0.0132 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.53 on 786 degrees of freedom
## Multiple R-squared:  0.9383, Adjusted R-squared:  0.9373 
## F-statistic: 920.1 on 13 and 786 DF,  p-value: < 2.2e-16

By creating an autoregressive term with an offset of 1 and using that and the monthly variable we are able to explain nearly everything with the total cases.

Checking residuals on this

plot(sj.lm.ar)

The residuals plot tell us that there is heteroskedasticity present in our data. This was also apparent from the earlier boxplots as there were a few total_cases values magnitudes above any of the weekly means.

I have built the model on this whole data set but I want to try and split the sj data into a training and validation set and see where we can go from there

sj.ar.pred2 <- predict(sj.lm.ar, sj.copy[1:800,],level = 0.95, interval = "prediction")

sj.ar.pred <- predict(sj.lm.ar, sj.copy[801:936,], level = 0.95, interval = "prediction")
head(sj.ar.pred)
##           fit       lwr       upr
## 801 130.76258 103.88540 157.63977
## 802 112.46430  85.61015 139.31845
## 803  84.48067  57.67375 111.28759
## 804  75.81254  49.00858 102.61650
## 805  46.92109  20.11953  73.72264
## 806  58.47629  31.67490  85.27767
sj.ar.pred <- as.data.frame(sj.ar.pred)
sj.ar.pred$actual <- sj.copy$total_cases[801:936]
sj.ar.pred2 <- as.data.frame(sj.ar.pred2)
sj.ar.pred2$actual <- sj.copy$total_cases[1:800]
totalpred <- rbind(sj.ar.pred2, sj.ar.pred)

plot(sj.copy$seq, sj.copy$total_cases, col = 'blue', type = 'l')
lines(sj.copy$seq, totalpred$fit, col = 'red')
lines(sj.copy$seq,totalpred$lwr, lty=3, col='dark gray')
lines(sj.copy$seq,totalpred$upr, lty=3, col='dark gray')

We can see that the fitted values track the test set values relatively well compared with the ARIMA model from before. Lets try to do the same thing for our iq data.

First we begin with the other variables

iq.lm.1 <- lm(total_cases ~ ndvi_ne + ndvi_nw + ndvi_se + ndvi_sw, data = iq.copy)
summary(iq.lm.1)
## 
## Call:
## lm(formula = total_cases ~ ndvi_ne + ndvi_nw + ndvi_se + ndvi_sw, 
##     data = iq.copy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.142  -6.136  -2.722   1.622 107.964 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.086      1.742   4.643 4.37e-06 ***
## ndvi_ne       11.776     12.539   0.939   0.3481    
## ndvi_nw      -10.526     10.219  -1.030   0.3035    
## ndvi_se      -20.534      9.756  -2.105   0.0358 *  
## ndvi_sw       15.075     10.988   1.372   0.1707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.74 on 515 degrees of freedom
## Multiple R-squared:  0.01252,    Adjusted R-squared:  0.004853 
## F-statistic: 1.633 on 4 and 515 DF,  p-value: 0.1646
iq.lm.2 <- lm(total_cases ~ precipitation_amt_mm + reanalysis_air_temp_k +reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + reanalysis_min_air_temp_k + reanalysis_precip_amt_kg_per_m2 + reanalysis_relative_humidity_percent + reanalysis_sat_precip_amt_mm +  reanalysis_specific_humidity_g_per_kg+ reanalysis_tdtr_k+ station_avg_temp_c +station_diur_temp_rng_c, data = iq.copy)
summary(iq.lm.2)
## 
## Call:
## lm(formula = total_cases ~ precipitation_amt_mm + reanalysis_air_temp_k + 
##     reanalysis_avg_temp_k + reanalysis_dew_point_temp_k + reanalysis_max_air_temp_k + 
##     reanalysis_min_air_temp_k + reanalysis_precip_amt_kg_per_m2 + 
##     reanalysis_relative_humidity_percent + reanalysis_sat_precip_amt_mm + 
##     reanalysis_specific_humidity_g_per_kg + reanalysis_tdtr_k + 
##     station_avg_temp_c + station_diur_temp_rng_c, data = iq.copy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.501  -5.278  -2.741   1.579 105.714 
## 
## Coefficients: (1 not defined because of singularities)
##                                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                           2237.31781 1700.68915   1.316    0.189  
## precipitation_amt_mm                    -0.00425    0.01507  -0.282    0.778  
## reanalysis_air_temp_k                    0.04476    4.97250   0.009    0.993  
## reanalysis_avg_temp_k                   -0.46683    2.12672  -0.220    0.826  
## reanalysis_dew_point_temp_k             -7.25981    6.65697  -1.091    0.276  
## reanalysis_max_air_temp_k               -0.39068    0.43609  -0.896    0.371  
## reanalysis_min_air_temp_k                0.12312    0.64784   0.190    0.849  
## reanalysis_precip_amt_kg_per_m2         -0.01303    0.01251  -1.041    0.298  
## reanalysis_relative_humidity_percent    -0.42937    1.06935  -0.402    0.688  
## reanalysis_sat_precip_amt_mm                  NA         NA      NA       NA  
## reanalysis_specific_humidity_g_per_kg   10.21622    5.74591   1.778    0.076 .
## reanalysis_tdtr_k                       -0.41877    0.73579  -0.569    0.570  
## station_avg_temp_c                      -0.20457    0.75968  -0.269    0.788  
## station_diur_temp_rng_c                 -0.03732    0.44636  -0.084    0.933  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 507 degrees of freedom
## Multiple R-squared:  0.07332,    Adjusted R-squared:  0.05139 
## F-statistic: 3.343 on 12 and 507 DF,  p-value: 0.0001058
iq.lm.3 <- lm(total_cases ~ station_max_temp_c + station_min_temp_c + station_precip_mm, data = iq.copy)
summary(iq.lm.3)
## 
## Call:
## lm(formula = total_cases ~ station_max_temp_c + station_min_temp_c + 
##     station_precip_mm, data = iq.copy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.479  -6.004  -3.018   1.955 108.432 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -39.684051  13.804523  -2.875  0.00421 ** 
## station_max_temp_c   0.409500   0.363137   1.128  0.25998    
## station_min_temp_c   1.563356   0.382145   4.091 4.98e-05 ***
## station_precip_mm    0.002974   0.007706   0.386  0.69975    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.59 on 516 degrees of freedom
## Multiple R-squared:  0.03866,    Adjusted R-squared:  0.03307 
## F-statistic: 6.917 on 3 and 516 DF,  p-value: 0.0001426

Similarly, we once again create some month factors as that seemed to be the most important variable in the sj data

iq.copy$month <- as.factor(month(iq.copy$week_start_date))
summary(iq.copy$month)
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 50 40 40 50 40 40 50 40 43 47 40 40
iq.lm.4 <- lm(total_cases ~ month, data = iq.copy)
summary(iq.lm.4)
## 
## Call:
## lm(formula = total_cases ~ month, data = iq.copy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.900  -5.700  -1.700   2.331 103.100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.080      1.452   8.321 8.12e-16 ***
## month2         0.495      2.178   0.227 0.820280    
## month3        -4.830      2.178  -2.218 0.027002 *  
## month4        -6.380      2.053  -3.107 0.001993 ** 
## month5        -7.230      2.178  -3.320 0.000965 ***
## month6        -8.705      2.178  -3.997 7.36e-05 ***
## month7        -9.020      2.053  -4.393 1.36e-05 ***
## month8        -8.930      2.178  -4.101 4.80e-05 ***
## month9        -5.638      2.135  -2.641 0.008528 ** 
## month10       -2.335      2.086  -1.120 0.263372    
## month11       -2.255      2.178  -1.035 0.300933    
## month12        0.820      2.178   0.377 0.706671    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.27 on 508 degrees of freedom
## Multiple R-squared:  0.1099, Adjusted R-squared:  0.09067 
## F-statistic: 5.705 on 11 and 508 DF,  p-value: 1.061e-08

Month seems to not be as significant when compared with the sj model but it is still the most statistically significant factor nonetheless.

Moving onto AR model

iq.copy$seq <- seq(1,520)
iq.copy$total_cases_t1 <- c(iq.copy$total_cases[1], iq.copy$total_cases[1:519])

iq.lm.ar <- lm(total_cases ~ total_cases_t1 + seq + month, data = iq.copy[1:400,])
summary(iq.lm.ar)
## 
## Call:
## lm(formula = total_cases ~ total_cases_t1 + seq + month, data = iq.copy[1:400, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.314  -2.609  -0.702   1.390  66.258 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.281138   1.379281   2.379   0.0179 *  
## total_cases_t1  0.658746   0.038366  17.170   <2e-16 ***
## seq             0.007662   0.003340   2.294   0.0223 *  
## month2         -1.081819   1.767804  -0.612   0.5409    
## month3         -3.207606   1.820379  -1.762   0.0789 .  
## month4         -2.938256   1.730741  -1.698   0.0904 .  
## month5         -2.587347   1.844905  -1.402   0.1616    
## month6         -4.112597   1.848442  -2.225   0.0267 *  
## month7         -3.382219   1.687260  -2.005   0.0457 *  
## month8         -3.928849   1.788266  -2.197   0.0286 *  
## month9         -2.757111   1.753844  -1.572   0.1168    
## month10        -2.227225   1.694100  -1.315   0.1894    
## month11        -1.791204   1.767840  -1.013   0.3116    
## month12        -0.166943   1.779648  -0.094   0.9253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.452 on 386 degrees of freedom
## Multiple R-squared:  0.5302, Adjusted R-squared:  0.5144 
## F-statistic: 33.51 on 13 and 386 DF,  p-value: < 2.2e-16

Where can we go from here? We could probably implement backwards selection. I will use the base variables as building an offset for each of the variables would be time consuming and I used the base values to test for significance in the first place

iq.lm.working <- lm(total_cases ~ndvi_se + ndvi_sw + reanalysis_max_air_temp_k +  reanalysis_specific_humidity_g_per_kg + month, data = iq.copy)
summary(iq.lm.working)
## 
## Call:
## lm(formula = total_cases ~ ndvi_se + ndvi_sw + reanalysis_max_air_temp_k + 
##     reanalysis_specific_humidity_g_per_kg + month, data = iq.copy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.705  -5.366  -1.836   2.358 101.443 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           176.7247    78.1727   2.261 0.024204 *  
## ndvi_se                               -17.3746     8.5499  -2.032 0.042664 *  
## ndvi_sw                                13.4404     7.5100   1.790 0.074108 .  
## reanalysis_max_air_temp_k              -0.5871     0.2481  -2.367 0.018324 *  
## reanalysis_specific_humidity_g_per_kg   0.9231     0.4382   2.107 0.035651 *  
## month2                                 -0.1511     2.1622  -0.070 0.944324    
## month3                                 -6.0482     2.1736  -2.783 0.005596 ** 
## month4                                 -7.6395     2.0583  -3.712 0.000229 ***
## month5                                 -7.5887     2.1761  -3.487 0.000530 ***
## month6                                 -8.4462     2.2429  -3.766 0.000186 ***
## month7                                 -7.3101     2.1991  -3.324 0.000952 ***
## month8                                 -5.7133     2.3611  -2.420 0.015885 *  
## month9                                 -2.6291     2.2816  -1.152 0.249730    
## month10                                -0.0535     2.1968  -0.024 0.980582    
## month11                                -1.7738     2.2432  -0.791 0.429455    
## month12                                 0.8517     2.1820   0.390 0.696467    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.12 on 504 degrees of freedom
## Multiple R-squared:  0.1414, Adjusted R-squared:  0.1159 
## F-statistic: 5.535 on 15 and 504 DF,  p-value: 1.74e-10

I will not show the history of me checking backwards selection but we note that the final model with statistically significant variables and the highest adjusted R squared included the southeast and southwest measurements of vegetation index, max air temperature, mean specific humidity and the months.

We can now offset these and see if it helps our AR model analysis

iq.copy$ndvi_se <- c(iq.copy$ndvi_se[1], iq.copy$ndvi_se[1:519])
iq.copy$ndvi_sw <- c(iq.copy$ndvi_sw[1], iq.copy$ndvi_sw[1:519])
iq.copy$reanalysis_max_air_temp_k <- c(iq.copy$reanalysis_max_air_temp_k[1], iq.copy$reanalysis_max_air_temp_k[1:519])
iq.copy$reanalysis_specific_humidity_g_per_kg <- c(iq.copy$reanalysis_specific_humidity_g_per_kg[1], iq.copy$reanalysis_specific_humidity_g_per_kg[1:519])

head(iq.copy)
##     total_cases city year weekofyear week_start_date   ndvi_ne   ndvi_nw
## 937           0   iq 2000         26      2000-07-01 0.1928857 0.1322571
## 938           0   iq 2000         27      2000-07-08 0.2168333 0.2761000
## 939           0   iq 2000         28      2000-07-15 0.1767571 0.1731286
## 940           0   iq 2000         29      2000-07-22 0.2277286 0.1454286
## 941           0   iq 2000         30      2000-07-29 0.3286429 0.3221286
## 942           0   iq 2000         31      2000-08-05 0.2055286 0.1907571
##       ndvi_se   ndvi_sw precipitation_amt_mm reanalysis_air_temp_k
## 937 0.3408857 0.2472000                25.41              296.7400
## 938 0.3408857 0.2472000                60.61              296.6343
## 939 0.2894571 0.2416571                55.52              296.4157
## 940 0.2041143 0.1280143                 5.60              295.3571
## 941 0.2542000 0.2003143                62.76              296.4329
## 942 0.2543714 0.3610429                16.24              297.1914
##     reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
## 937              298.4500                    295.1843                     307.3
## 938              298.4286                    295.3586                     307.3
## 939              297.3929                    295.6229                     306.6
## 940              296.2286                    292.7971                     304.5
## 941              297.6357                    293.9571                     303.6
## 942              298.2857                    291.7257                     307.0
##     reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
## 937                     293.1                           43.19
## 938                     291.1                           46.00
## 939                     292.6                           64.77
## 940                     288.6                           23.96
## 941                     291.5                           31.80
## 942                     288.5                            1.00
##     reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
## 937                             92.41857                        25.41
## 938                             93.58143                        60.61
## 939                             95.84857                        55.52
## 940                             87.23429                         5.60
## 941                             88.16143                        62.76
## 942                             74.72857                        16.24
##     reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
## 937                              16.65143          8.928571           26.40000
## 938                              16.65143         10.314286           26.90000
## 939                              16.86286          7.385714           26.80000
## 940                              17.12000          9.114286           25.76667
## 941                              14.43143          9.500000           26.60000
## 942                              15.44429         13.771429           25.34000
##     station_diur_temp_rng_c station_max_temp_c station_min_temp_c
## 937                10.77500               32.5               20.7
## 938                11.56667               34.0               20.8
## 939                11.46667               33.0               20.7
## 940                10.53333               31.5               14.7
## 941                11.48000               33.3               19.1
## 942                10.94000               32.0               17.0
##     station_precip_mm month seq total_cases_t1
## 937               3.0     7   1              0
## 938              55.6     7   2              0
## 939              38.1     7   3              0
## 940              30.0     7   4              0
## 941               4.0     7   5              0
## 942              11.5     8   6              0

Now that our variables of interest are offset, lets try again

iq.lm.ar.2 <- lm(total_cases ~ seq + total_cases_t1 + month + ndvi_se + ndvi_sw + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg, data = iq.copy[1:400,])
summary(iq.lm.ar.2)
## 
## Call:
## lm(formula = total_cases ~ seq + total_cases_t1 + month + ndvi_se + 
##     ndvi_sw + reanalysis_max_air_temp_k + reanalysis_specific_humidity_g_per_kg, 
##     data = iq.copy[1:400, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.635  -2.739  -0.715   1.368  65.224 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           -6.119436  68.781085  -0.089   0.9292    
## seq                                    0.006786   0.003468   1.956   0.0511 .  
## total_cases_t1                         0.653831   0.038725  16.884   <2e-16 ***
## month2                                -1.447127   1.788448  -0.809   0.4189    
## month3                                -3.773361   1.831697  -2.060   0.0401 *  
## month4                                -3.575902   1.762184  -2.029   0.0431 *  
## month5                                -3.034994   1.892722  -1.604   0.1096    
## month6                                -4.155902   1.923150  -2.161   0.0313 *  
## month7                                -2.665408   1.897893  -1.404   0.1610    
## month8                                -2.662360   2.039225  -1.306   0.1925    
## month9                                -1.790333   1.924959  -0.930   0.3529    
## month10                               -1.547048   1.830078  -0.845   0.3984    
## month11                               -1.700881   1.828886  -0.930   0.3530    
## month12                               -0.232688   1.830675  -0.127   0.8989    
## ndvi_se                               -9.418190   7.073434  -1.331   0.1838    
## ndvi_sw                               -2.747338   6.141100  -0.447   0.6549    
## reanalysis_max_air_temp_k              0.017543   0.216560   0.081   0.9355    
## reanalysis_specific_humidity_g_per_kg  0.423859   0.397664   1.066   0.2872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.433 on 382 degrees of freedom
## Multiple R-squared:  0.5374, Adjusted R-squared:  0.5168 
## F-statistic:  26.1 on 17 and 382 DF,  p-value: < 2.2e-16

Our adjusted R squared really didn’t change all to much. I wonder if this has anything to do with how auto.arima() chose a seasonal component for the iq data.

It would then make sense that maybe each observation is not so dependent on the last, but instead on seasonality (hence low adjusted R squared on model which regresses on its most recent past value)

plot(iq.lm.ar.2)

So we can see from the residual plot that there are a few very influential points. 1169 specifically drags the scale location line towards it and lies outside cooks distance on the residual vs leverage plot.

However, this is not econometric and I don’t imagine we want to remove data points to make our model fit better; doing so seems counter-intuitive to the entire idea of trying to predict the results.

I will move on and instead try my luck at predicting the remaining portion of IQ data that was withheld from the AR model.

iq.ar.pred2 <- predict(iq.lm.ar.2, iq.copy[1:400,], level = 0.95, interval = "prediction")

iq.ar.pred <- predict(iq.lm.ar.2, iq.copy[401:520,], level = 0.95, interval = "prediction")
head(iq.ar.pred)
##            fit        lwr      upr
## 1337 18.347361   3.346055 33.34867
## 1338  9.898052  -5.081817 24.87792
## 1339  5.641400  -9.396074 20.67887
## 1340  5.639740  -9.343395 20.62287
## 1341  7.331007  -7.587179 22.24919
## 1342  4.291772 -10.642346 19.22589
iq.ar.pred <- as.data.frame(iq.ar.pred)
iq.ar.pred$actual <- iq.copy$total_cases[401:520]
iq.ar.pred2 <- as.data.frame(iq.ar.pred2)
iq.ar.pred2$actual <- iq.copy$total_cases[1:400]
iq.totalpred <- rbind(iq.ar.pred2, iq.ar.pred)

plot(iq.copy$seq, iq.copy$total_cases, col = 'blue', type = 'l')
lines(iq.copy$seq, iq.totalpred$fit, col = 'red')
lines(iq.copy$seq,iq.totalpred$lwr, lty=3, col='dark gray')
lines(iq.copy$seq,iq.totalpred$upr, lty=3, col='dark gray')

So, we can see here that our predictions did not do as well of a job as above. This could have been expected given the lower R squared value found in the model; there still are unexplained patterns which our model could not pick up on.

However, both of these autoregressive models have done a much better job than the ARIMA model which was essentially just a naive forecast.

We can see just how wide the confidence intervals are at the 95% level but it seems like the test submission only cares about the point forecasts so for our purposes it may still be okay.

Finally, we will try the neural net model. I am not sure what to expect from this. On one hand, the more complex relationships can be modeled by the neural net could be of benefit to us. But on the other hand, the last univariate time series model (ARIMA) did not work so well and the neural net is essentially an ARIMA model if there are no nodes.

ts.sj.copy <- ts(sj.copy$total_cases, frequency = 52, start = c(1990, 18))
sj.nnetar <- nnetar(ts.sj.copy)
autoplot(sj.nnetar)
## Warning: Removed 52 row(s) containing missing values (geom_path).

For whatever reason the autoplot function isnt working but the forecast values look relatively decent. They have done a much better job already than the ARIMA model

Lets move onto the iq data

ts.iq.copy <- ts(iq.copy$total_cases, frequency = 52, start = c(2000, 26))
iq.nnetar <- nnetar(ts.iq.copy)
autoplot(iq.nnetar)
## Warning: Removed 52 row(s) containing missing values (geom_path).

The point forecasts for the iq data look very similar to a naive model with nearly the exact same values across. It seems like the neural net model had trouble capturing any sort of patterns from our data

checkresiduals(sj.nnetar)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

checkresiduals(iq.nnetar)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

Residuals look like white noise so the neural net model was able to squeeze most of the information out of our two data sets. However, the point forecasts still left much to be desired.

Interesting, just to check. I removed the start date from the time series on our data and what occurred was a completely different result for the forecasts of the neural net models. Before (with start date), the point forecasts didnt seem to stray too far from the mean of the total_cases data. However, when the start date was removed the point forecasts reached nearly as high as the max value.

Obviously, since we have the information on the start date it would be wise to include it for the final model but I thought the interaction was interesting nonetheless. It makes sense I guess since when you add a specific start date you are capturing the location of your data in a cycle.

Only thing left to do is to test my forecasts. Need to setup test set.

FROM HERE ON OUT WE ARE WORKING ON TEST SET/PREDICTIONS

dengue.test.set <- cbind(submission_format, dengue_features_test)
summary(dengue.test.set)
##      city                year        weekofyear     total_cases
##  Length:416         Min.   :2008   Min.   : 1.00   Min.   :0   
##  Class :character   1st Qu.:2010   1st Qu.:13.75   1st Qu.:0   
##  Mode  :character   Median :2011   Median :26.00   Median :0   
##                     Mean   :2011   Mean   :26.44   Mean   :0   
##                     3rd Qu.:2012   3rd Qu.:39.00   3rd Qu.:0   
##                     Max.   :2013   Max.   :53.00   Max.   :0   
##                                                                
##      city                year        weekofyear    week_start_date   
##  Length:416         Min.   :2008   Min.   : 1.00   Length:416        
##  Class :character   1st Qu.:2010   1st Qu.:13.75   Class :character  
##  Mode  :character   Median :2011   Median :26.00   Mode  :character  
##                     Mean   :2011   Mean   :26.44                     
##                     3rd Qu.:2012   3rd Qu.:39.00                     
##                     Max.   :2013   Max.   :53.00                     
##                                                                      
##     ndvi_ne           ndvi_nw            ndvi_se          ndvi_sw        
##  Min.   :-0.4634   Min.   :-0.21180   Min.   :0.0062   Min.   :-0.01467  
##  1st Qu.:-0.0015   1st Qu.: 0.01597   1st Qu.:0.1487   1st Qu.: 0.13408  
##  Median : 0.1101   Median : 0.08870   Median :0.2042   Median : 0.18647  
##  Mean   : 0.1260   Mean   : 0.12680   Mean   :0.2077   Mean   : 0.20172  
##  3rd Qu.: 0.2633   3rd Qu.: 0.24240   3rd Qu.:0.2549   3rd Qu.: 0.25324  
##  Max.   : 0.5004   Max.   : 0.64900   Max.   :0.4530   Max.   : 0.52904  
##  NA's   :43        NA's   :11         NA's   :1        NA's   :1         
##  precipitation_amt_mm reanalysis_air_temp_k reanalysis_avg_temp_k
##  Min.   :  0.000      Min.   :294.6         Min.   :295.2        
##  1st Qu.:  8.175      1st Qu.:297.8         1st Qu.:298.3        
##  Median : 31.455      Median :298.5         Median :299.3        
##  Mean   : 38.354      Mean   :298.8         Mean   :299.4        
##  3rd Qu.: 57.773      3rd Qu.:300.2         3rd Qu.:300.5        
##  Max.   :169.340      Max.   :301.9         Max.   :303.3        
##  NA's   :2            NA's   :2             NA's   :2            
##  reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :290.8               Min.   :298.2            
##  1st Qu.:294.3               1st Qu.:301.4            
##  Median :295.8               Median :302.8            
##  Mean   :295.4               Mean   :303.6            
##  3rd Qu.:296.6               3rd Qu.:305.8            
##  Max.   :297.8               Max.   :314.1            
##  NA's   :2                   NA's   :2                
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :286.2             Min.   :  0.00                 
##  1st Qu.:293.5             1st Qu.:  9.43                 
##  Median :296.3             Median : 25.85                 
##  Mean   :295.7             Mean   : 42.17                 
##  3rd Qu.:298.3             3rd Qu.: 56.48                 
##  Max.   :299.7             Max.   :301.40                 
##  NA's   :2                 NA's   :2                      
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :64.92                        Min.   :  0.000             
##  1st Qu.:77.40                        1st Qu.:  8.175             
##  Median :80.33                        Median : 31.455             
##  Mean   :82.50                        Mean   : 38.354             
##  3rd Qu.:88.33                        3rd Qu.: 57.773             
##  Max.   :97.98                        Max.   :169.340             
##  NA's   :2                            NA's   :2                   
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :12.54                         Min.   : 1.486    Min.   :24.16     
##  1st Qu.:15.79                         1st Qu.: 2.446    1st Qu.:26.51     
##  Median :17.34                         Median : 2.914    Median :27.48     
##  Mean   :16.93                         Mean   : 5.125    Mean   :27.37     
##  3rd Qu.:18.17                         3rd Qu.: 8.171    3rd Qu.:28.32     
##  Max.   :19.60                         Max.   :14.486    Max.   :30.27     
##  NA's   :2                             NA's   :2         NA's   :12        
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   : 4.043          Min.   :27.20      Min.   :14.20     
##  1st Qu.: 5.929          1st Qu.:31.10      1st Qu.:21.20     
##  Median : 6.643          Median :32.80      Median :22.20     
##  Mean   : 7.811          Mean   :32.53      Mean   :22.37     
##  3rd Qu.: 9.812          3rd Qu.:33.90      3rd Qu.:23.30     
##  Max.   :14.725          Max.   :38.40      Max.   :26.70     
##  NA's   :12              NA's   :3          NA's   :9         
##  station_precip_mm
##  Min.   :  0.00   
##  1st Qu.:  9.10   
##  Median : 23.60   
##  Mean   : 34.28   
##  3rd Qu.: 47.75   
##  Max.   :212.00   
##  NA's   :5
dengue.test.set$weekofyear = NULL
dengue.test.set$city = NULL
dengue.test.set$year = NULL

Test set and the other variables are merged. Lets split into sj and iq and then we can format from there

sj.test <- dengue.test.set[1:260,]
iq.test <- dengue.test.set[261:416, ]

imputing median data for linear model forecasts.

for (i in 1:ncol(sj.test)){
  for (j in 1:nrow(sj.test)){
    if (is.na(sj.test[j,i] == TRUE)){
      sj.test[j,i] <- median(sj.test[,i], na.rm = TRUE)
    } 
  }
}
summary(sj.test)
##   total_cases     city                year        weekofyear   
##  Min.   :0    Length:260         Min.   :2008   Min.   : 1.00  
##  1st Qu.:0    Class :character   1st Qu.:2009   1st Qu.:13.75  
##  Median :0    Mode  :character   Median :2010   Median :26.50  
##  Mean   :0                       Mean   :2010   Mean   :26.50  
##  3rd Qu.:0                       3rd Qu.:2012   3rd Qu.:39.25  
##  Max.   :0                       Max.   :2013   Max.   :53.00  
##  week_start_date       ndvi_ne            ndvi_nw             ndvi_se      
##  Length:260         Min.   :-0.46340   Min.   :-0.211800   Min.   :0.0062  
##  Class :character   1st Qu.:-0.03006   1st Qu.:-0.005313   1st Qu.:0.1319  
##  Mode  :character   Median : 0.01498   Median : 0.032140   Median :0.1694  
##                     Mean   : 0.02318   Mean   : 0.036535   Mean   :0.1770  
##                     3rd Qu.: 0.06828   3rd Qu.: 0.076275   3rd Qu.:0.2181  
##                     Max.   : 0.50040   Max.   : 0.649000   Max.   :0.3854  
##     ndvi_sw         precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :-0.01467   Min.   :  0.000      Min.   :296.7        
##  1st Qu.: 0.11649   1st Qu.:  2.127      1st Qu.:298.3        
##  Median : 0.14802   Median : 13.975      Median :299.8        
##  Mean   : 0.15321   Mean   : 26.425      Mean   :299.5        
##  3rd Qu.: 0.19128   3rd Qu.: 41.810      3rd Qu.:300.6        
##  Max.   : 0.31813   Max.   :169.340      Max.   :301.5        
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :296.8         Min.   :290.8               Min.   :298.2            
##  1st Qu.:298.4         1st Qu.:294.1               1st Qu.:300.4            
##  Median :299.8         Median :295.7               Median :301.9            
##  Mean   :299.5         Mean   :295.3               Mean   :301.6            
##  3rd Qu.:300.7         3rd Qu.:296.8               3rd Qu.:302.6            
##  Max.   :301.5         Max.   :297.8               Max.   :304.1            
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :293.8             Min.   :  0.000                
##  1st Qu.:296.6             1st Qu.:  6.763                
##  Median :297.7             Median : 15.245                
##  Mean   :297.6             Mean   : 23.701                
##  3rd Qu.:298.8             3rd Qu.: 29.000                
##  Max.   :299.7             Max.   :301.400                
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :64.92                        Min.   :  0.000             
##  1st Qu.:76.06                        1st Qu.:  2.127             
##  Median :78.38                        Median : 13.975             
##  Mean   :78.20                        Mean   : 26.425             
##  3rd Qu.:80.42                        3rd Qu.: 41.810             
##  Max.   :86.78                        Max.   :169.340             
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :12.54                         Min.   :1.486     Min.   :24.16     
##  1st Qu.:15.38                         1st Qu.:2.229     1st Qu.:26.07     
##  Median :17.07                         Median :2.543     Median :27.42     
##  Mean   :16.76                         Mean   :2.587     Mean   :27.27     
##  3rd Qu.:18.18                         3rd Qu.:2.857     3rd Qu.:28.51     
##  Max.   :19.34                         Max.   :4.429     Max.   :30.27     
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   :4.043           Min.   :27.20      Min.   :20.00     
##  1st Qu.:5.700           1st Qu.:30.60      1st Qu.:21.70     
##  Median :6.179           Median :31.70      Median :23.30     
##  Mean   :6.153           Mean   :31.68      Mean   :23.11     
##  3rd Qu.:6.586           3rd Qu.:32.80      3rd Qu.:24.40     
##  Max.   :8.400           Max.   :35.00      Max.   :26.70     
##  station_precip_mm
##  Min.   :  0.00   
##  1st Qu.:  7.35   
##  Median : 22.35   
##  Mean   : 34.12   
##  3rd Qu.: 48.25   
##  Max.   :207.70
for (i in 1:ncol(iq.test)){
  for (j in 1:nrow(iq.test)){
    if (is.na(iq.test[j,i] == TRUE)){
      iq.test[j,i] <- median(iq.test[,i], na.rm = TRUE)
    } 
  }
}
summary(iq.test)
##   total_cases     city                year        weekofyear   
##  Min.   :0    Length:156         Min.   :2010   Min.   : 1.00  
##  1st Qu.:0    Class :character   1st Qu.:2011   1st Qu.:13.75  
##  Median :0    Mode  :character   Median :2012   Median :26.00  
##  Mean   :0                       Mean   :2012   Mean   :26.33  
##  3rd Qu.:0                       3rd Qu.:2012   3rd Qu.:39.00  
##  Max.   :0                       Max.   :2013   Max.   :52.00  
##  week_start_date       ndvi_ne           ndvi_nw           ndvi_se       
##  Length:156         Min.   :0.08929   Min.   :0.06321   Min.   :0.09826  
##  Class :character   1st Qu.:0.21496   1st Qu.:0.22244   1st Qu.:0.21198  
##  Mode  :character   Median :0.26523   Median :0.26946   Median :0.25316  
##                     Mean   :0.26689   Mean   :0.27057   Mean   :0.25858  
##                     3rd Qu.:0.31924   3rd Qu.:0.32456   3rd Qu.:0.30164  
##                     Max.   :0.42999   Max.   :0.46480   Max.   :0.45304  
##     ndvi_sw        precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :0.08196   Min.   :  2.28       Min.   :294.6        
##  1st Qu.:0.21750   1st Qu.: 36.95       1st Qu.:297.1        
##  Median :0.28153   Median : 51.29       Median :297.8        
##  Mean   :0.28223   Mean   : 57.92       Mean   :297.8        
##  3rd Qu.:0.34724   3rd Qu.: 72.81       3rd Qu.:298.3        
##  Max.   :0.52904   Max.   :152.32       Max.   :301.9        
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :295.2         Min.   :292.0               Min.   :302.8            
##  1st Qu.:298.2         1st Qu.:294.8               1st Qu.:305.4            
##  Median :299.0         Median :295.9               Median :306.7            
##  Mean   :299.0         Mean   :295.6               Mean   :307.0            
##  3rd Qu.:299.8         3rd Qu.:296.5               3rd Qu.:308.5            
##  Max.   :303.3         Max.   :297.7               Max.   :314.1            
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :286.2             Min.   :  2.60                 
##  1st Qu.:291.9             1st Qu.: 35.62                 
##  Median :292.9             Median : 59.30                 
##  Mean   :292.7             Mean   : 72.61                 
##  3rd Qu.:293.8             3rd Qu.: 98.97                 
##  Max.   :296.0             Max.   :280.42                 
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :66.31                        Min.   :  2.28              
##  1st Qu.:86.32                        1st Qu.: 36.95              
##  Median :91.41                        Median : 51.29              
##  Mean   :89.61                        Mean   : 57.92              
##  3rd Qu.:94.74                        3rd Qu.: 72.81              
##  Max.   :97.98                        Max.   :152.32              
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :13.74                         Min.   : 4.800    Min.   :24.84     
##  1st Qu.:16.33                         1st Qu.: 7.689    1st Qu.:27.05     
##  Median :17.55                         Median : 9.371    Median :27.51     
##  Mean   :17.21                         Mean   : 9.321    Mean   :27.54     
##  3rd Qu.:18.17                         3rd Qu.:10.764    3rd Qu.:28.01     
##  Max.   :19.60                         Max.   :14.486    Max.   :29.13     
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   : 6.45           Min.   :29.60      Min.   :14.20     
##  1st Qu.: 9.70           1st Qu.:33.20      1st Qu.:20.60     
##  Median :10.73           Median :34.00      Median :21.20     
##  Mean   :10.74           Mean   :33.96      Mean   :21.09     
##  3rd Qu.:11.63           3rd Qu.:34.83      3rd Qu.:21.85     
##  Max.   :14.72           Max.   :38.40      Max.   :23.20     
##  station_precip_mm
##  Min.   :  0.00   
##  1st Qu.: 11.72   
##  Median : 27.20   
##  Mean   : 34.25   
##  3rd Qu.: 43.62   
##  Max.   :212.00

setting date and creating month factor in each

sj.test$week_start_date <- as.Date(sj.test$week_start_date, format=c('%Y-%m-%d'))
sj.test$month <- as.factor(month(sj.test$week_start_date))
summary(sj.test)
##   total_cases     city                year        weekofyear   
##  Min.   :0    Length:260         Min.   :2008   Min.   : 1.00  
##  1st Qu.:0    Class :character   1st Qu.:2009   1st Qu.:13.75  
##  Median :0    Mode  :character   Median :2010   Median :26.50  
##  Mean   :0                       Mean   :2010   Mean   :26.50  
##  3rd Qu.:0                       3rd Qu.:2012   3rd Qu.:39.25  
##  Max.   :0                       Max.   :2013   Max.   :53.00  
##                                                                
##  week_start_date         ndvi_ne            ndvi_nw             ndvi_se      
##  Min.   :2008-04-29   Min.   :-0.46340   Min.   :-0.211800   Min.   :0.0062  
##  1st Qu.:2009-07-28   1st Qu.:-0.03006   1st Qu.:-0.005313   1st Qu.:0.1319  
##  Median :2010-10-25   Median : 0.01498   Median : 0.032140   Median :0.1694  
##  Mean   :2010-10-25   Mean   : 0.02318   Mean   : 0.036535   Mean   :0.1770  
##  3rd Qu.:2012-01-23   3rd Qu.: 0.06828   3rd Qu.: 0.076275   3rd Qu.:0.2181  
##  Max.   :2013-04-23   Max.   : 0.50040   Max.   : 0.649000   Max.   :0.3854  
##                                                                              
##     ndvi_sw         precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :-0.01467   Min.   :  0.000      Min.   :296.7        
##  1st Qu.: 0.11649   1st Qu.:  2.127      1st Qu.:298.3        
##  Median : 0.14802   Median : 13.975      Median :299.8        
##  Mean   : 0.15321   Mean   : 26.425      Mean   :299.5        
##  3rd Qu.: 0.19128   3rd Qu.: 41.810      3rd Qu.:300.6        
##  Max.   : 0.31813   Max.   :169.340      Max.   :301.5        
##                                                               
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :296.8         Min.   :290.8               Min.   :298.2            
##  1st Qu.:298.4         1st Qu.:294.1               1st Qu.:300.4            
##  Median :299.8         Median :295.7               Median :301.9            
##  Mean   :299.5         Mean   :295.3               Mean   :301.6            
##  3rd Qu.:300.7         3rd Qu.:296.8               3rd Qu.:302.6            
##  Max.   :301.5         Max.   :297.8               Max.   :304.1            
##                                                                             
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :293.8             Min.   :  0.000                
##  1st Qu.:296.6             1st Qu.:  6.763                
##  Median :297.7             Median : 15.245                
##  Mean   :297.6             Mean   : 23.701                
##  3rd Qu.:298.8             3rd Qu.: 29.000                
##  Max.   :299.7             Max.   :301.400                
##                                                           
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :64.92                        Min.   :  0.000             
##  1st Qu.:76.06                        1st Qu.:  2.127             
##  Median :78.38                        Median : 13.975             
##  Mean   :78.20                        Mean   : 26.425             
##  3rd Qu.:80.42                        3rd Qu.: 41.810             
##  Max.   :86.78                        Max.   :169.340             
##                                                                   
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :12.54                         Min.   :1.486     Min.   :24.16     
##  1st Qu.:15.38                         1st Qu.:2.229     1st Qu.:26.07     
##  Median :17.07                         Median :2.543     Median :27.42     
##  Mean   :16.76                         Mean   :2.587     Mean   :27.27     
##  3rd Qu.:18.18                         3rd Qu.:2.857     3rd Qu.:28.51     
##  Max.   :19.34                         Max.   :4.429     Max.   :30.27     
##                                                                            
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   :4.043           Min.   :27.20      Min.   :20.00     
##  1st Qu.:5.700           1st Qu.:30.60      1st Qu.:21.70     
##  Median :6.179           Median :31.70      Median :23.30     
##  Mean   :6.153           Mean   :31.68      Mean   :23.11     
##  3rd Qu.:6.586           3rd Qu.:32.80      3rd Qu.:24.40     
##  Max.   :8.400           Max.   :35.00      Max.   :26.70     
##                                                               
##  station_precip_mm     month    
##  Min.   :  0.00    1      : 25  
##  1st Qu.:  7.35    4      : 25  
##  Median : 22.35    7      : 25  
##  Mean   : 34.12    10     : 23  
##  3rd Qu.: 48.25    9      : 22  
##  Max.   :207.70    2      : 20  
##                    (Other):120
iq.test$week_start_date <- as.Date(iq.test$week_start_date, format=c('%Y-%m-%d'))
iq.test$month <- as.factor(month(iq.test$week_start_date))
summary(iq.test)
##   total_cases     city                year        weekofyear   
##  Min.   :0    Length:156         Min.   :2010   Min.   : 1.00  
##  1st Qu.:0    Class :character   1st Qu.:2011   1st Qu.:13.75  
##  Median :0    Mode  :character   Median :2012   Median :26.00  
##  Mean   :0                       Mean   :2012   Mean   :26.33  
##  3rd Qu.:0                       3rd Qu.:2012   3rd Qu.:39.00  
##  Max.   :0                       Max.   :2013   Max.   :52.00  
##                                                                
##  week_start_date         ndvi_ne           ndvi_nw           ndvi_se       
##  Min.   :2010-07-02   Min.   :0.08929   Min.   :0.06321   Min.   :0.09826  
##  1st Qu.:2011-03-31   1st Qu.:0.21496   1st Qu.:0.22244   1st Qu.:0.21198  
##  Median :2011-12-28   Median :0.26523   Median :0.26946   Median :0.25316  
##  Mean   :2011-12-28   Mean   :0.26689   Mean   :0.27057   Mean   :0.25858  
##  3rd Qu.:2012-09-24   3rd Qu.:0.31924   3rd Qu.:0.32456   3rd Qu.:0.30164  
##  Max.   :2013-06-25   Max.   :0.42999   Max.   :0.46480   Max.   :0.45304  
##                                                                            
##     ndvi_sw        precipitation_amt_mm reanalysis_air_temp_k
##  Min.   :0.08196   Min.   :  2.28       Min.   :294.6        
##  1st Qu.:0.21750   1st Qu.: 36.95       1st Qu.:297.1        
##  Median :0.28153   Median : 51.29       Median :297.8        
##  Mean   :0.28223   Mean   : 57.92       Mean   :297.8        
##  3rd Qu.:0.34724   3rd Qu.: 72.81       3rd Qu.:298.3        
##  Max.   :0.52904   Max.   :152.32       Max.   :301.9        
##                                                              
##  reanalysis_avg_temp_k reanalysis_dew_point_temp_k reanalysis_max_air_temp_k
##  Min.   :295.2         Min.   :292.0               Min.   :302.8            
##  1st Qu.:298.2         1st Qu.:294.8               1st Qu.:305.4            
##  Median :299.0         Median :295.9               Median :306.7            
##  Mean   :299.0         Mean   :295.6               Mean   :307.0            
##  3rd Qu.:299.8         3rd Qu.:296.5               3rd Qu.:308.5            
##  Max.   :303.3         Max.   :297.7               Max.   :314.1            
##                                                                             
##  reanalysis_min_air_temp_k reanalysis_precip_amt_kg_per_m2
##  Min.   :286.2             Min.   :  2.60                 
##  1st Qu.:291.9             1st Qu.: 35.62                 
##  Median :292.9             Median : 59.30                 
##  Mean   :292.7             Mean   : 72.61                 
##  3rd Qu.:293.8             3rd Qu.: 98.97                 
##  Max.   :296.0             Max.   :280.42                 
##                                                           
##  reanalysis_relative_humidity_percent reanalysis_sat_precip_amt_mm
##  Min.   :66.31                        Min.   :  2.28              
##  1st Qu.:86.32                        1st Qu.: 36.95              
##  Median :91.41                        Median : 51.29              
##  Mean   :89.61                        Mean   : 57.92              
##  3rd Qu.:94.74                        3rd Qu.: 72.81              
##  Max.   :97.98                        Max.   :152.32              
##                                                                   
##  reanalysis_specific_humidity_g_per_kg reanalysis_tdtr_k station_avg_temp_c
##  Min.   :13.74                         Min.   : 4.800    Min.   :24.84     
##  1st Qu.:16.33                         1st Qu.: 7.689    1st Qu.:27.05     
##  Median :17.55                         Median : 9.371    Median :27.51     
##  Mean   :17.21                         Mean   : 9.321    Mean   :27.54     
##  3rd Qu.:18.17                         3rd Qu.:10.764    3rd Qu.:28.01     
##  Max.   :19.60                         Max.   :14.486    Max.   :29.13     
##                                                                            
##  station_diur_temp_rng_c station_max_temp_c station_min_temp_c
##  Min.   : 6.45           Min.   :29.60      Min.   :14.20     
##  1st Qu.: 9.70           1st Qu.:33.20      1st Qu.:20.60     
##  Median :10.73           Median :34.00      Median :21.20     
##  Mean   :10.74           Mean   :33.96      Mean   :21.09     
##  3rd Qu.:11.63           3rd Qu.:34.83      3rd Qu.:21.85     
##  Max.   :14.72           Max.   :38.40      Max.   :23.20     
##                                                               
##  station_precip_mm     month   
##  Min.   :  0.00    1      :15  
##  1st Qu.: 11.72    4      :15  
##  Median : 27.20    7      :15  
##  Mean   : 34.25    10     :14  
##  3rd Qu.: 43.62    9      :13  
##  Max.   :212.00    2      :12  
##                    (Other):72

Making univariate time series for arima or neural net models

ts.sj.test <- ts(sj.test$total_cases, frequency = 52, start = c(2008, 18))
ts.iq.test <- ts(iq.test$total_cases, frequency = 52, start = c(2010, 26))

try out ARIMA forecasts

sub.aa.sj <- forecast(aa.sj, newdata = ts.sj.test, h = 260)
## Warning in forecast.forecast_ARIMA(aa.sj, newdata = ts.sj.test, h = 260): The
## non-existent newdata arguments will be ignored.
sub.aa.iq <- forecast(aa.iq, newdata = ts.iq.test, h = 156)
## Warning in forecast.forecast_ARIMA(aa.iq, newdata = ts.iq.test, h = 156): The
## non-existent newdata arguments will be ignored.

As expected, our ARIMA models look like naive models

Moving onto our AR model.

sj.lambda <- BoxCox.lambda(sj.copy$total_cases)
## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.
sj.test$seq <- seq(1, 260)
sj.test$total_cases_t1 <- c(sj.test$total_cases[1], sj.test$total_cases[1:259])
predict.sj.ar <- forecast(sj.lm.ar, lambda = sj.lambda, newdata = sj.test)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj information
## not found, defaulting to FALSE.
iq.lambda <- BoxCox.lambda(iq.copy$total_cases)
## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.
iq.test$seq <- seq(1,156)
iq.test$total_cases_t1 <- iq.test$total_cases
iq.test$ndvi_se <- c(iq.test$ndvi_se[1], iq.test$ndvi_se[1:155])
iq.test$ndvi_sw <- c(iq.test$ndvi_sw[1], iq.test$ndvi_sw[1:155])
iq.test$reanalysis_max_air_temp_k <- c(iq.test$reanalysis_max_air_temp_k[1], iq.test$reanalysis_max_air_temp_k[1:155])
iq.test$reanalysis_specific_humidity_g_per_kg <- c(iq.test$reanalysis_specific_humidity_g_per_kg[1], iq.test$reanalysis_specific_humidity_g_per_kg[1:155])

predict.iq.ar <- forecast(iq.lm.ar.2, lambda = iq.lambda, newdata = iq.test)
## Warning in InvBoxCox(fcast$mean, lambda, biasadj, fcast): biasadj information
## not found, defaulting to FALSE.

The forecasts seem decent after a boxcox transformation. They are within a possibly expected value (around mean values of total cases) and seem to fluctuate with months/seasonality.

Lets try the neural net.

sub.sj.nn <- forecast(sj.nnetar, newdata = ts.sj.test, h = 260)
sub.iq.nn <- forecast(iq.nnetar, newdata = ts.iq.test, h = 156)

The forecasts for the neural net seem alright. As we notice the default forecast is set for two cycles out (104 observations) so I have specified the number of obervations using h = nrows(sj data, iq data)

setting up data frame and moving predicted values to submission page

df.predict.sj.ar <- as.data.frame(predict.sj.ar)
df.predict.iq.ar <- as.data.frame(predict.iq.ar)
df.sub.sj.nn <- as.data.frame(sub.sj.nn)
df.sub.iq.nn <- as.data.frame(sub.iq.nn)
df.sub.aa.sj <- as.data.frame(sub.aa.sj)
df.sub.aa.iq <- as.data.frame(sub.aa.iq)

Submission requires rounding so we will do that

df.predict.sj.ar <- round(df.predict.sj.ar)
df.predict.iq.ar <- round(df.predict.iq.ar)

df.sub.sj.nn <- round(df.sub.sj.nn)
df.sub.iq.nn <- round(df.sub.iq.nn)

df.sub.aa.sj <- round(df.sub.aa.sj)
df.sub.aa.iq <- round(df.sub.aa.iq)

Finally, putting our results back into the submission file

ar.submission <- submission_format
ar.submission$total_cases[1:260] <- df.predict.sj.ar$`Point Forecast`
ar.submission$total_cases[261: 416] <- df.predict.iq.ar$`Point Forecast`

write.csv(ar.submission, "autoregressive forecasts.csv")
nn.submission <- submission_format
nn.submission$total_cases[1:260] <- df.sub.sj.nn$`Point Forecast`
nn.submission$total_cases[261: 416] <- df.sub.iq.nn$`Point Forecast`

write.csv(nn.submission, "neural net forecasts.csv")
aa.submission <- submission_format
aa.submission$total_cases[1:260] <- df.sub.aa.sj$`Point Forecast`
aa.submission$total_cases[261: 416] <- df.sub.aa.iq$`Point Forecast`

write.csv(aa.submission, "auto arima forecasts.csv")