Supervised Learning

Summary

dengue: mosquito-borne diseae

Dengue is a mosquito-borne disease. It occurs mainly in the tropical and subtropical parts of the world. Since, it is transmitted by mosquitoes, the transmission of the disease is related to the climatic conditions and environmental variables such as precipitation and temperature. The disease is prevalent in Southeast Asia and Pacific Islands and epidemics of this disease are expected based on differences in climatic conditions. Nearly half a million cases of the dengue fever every year are reported in the Latin America, as reported by “DataDriven.org”.

Data Source The dataset was collected and publically shared by “DrivenData.org”.The link to original dataset can be found here. The data for this competition comes from the multiple sources aimed at supporting to predict the next epidemics. The environmental data has been collected by the U.S. Federal Government agencies - Centers for Disease Control (CDC) and Prevention to the National Oceanic and Atmospheric Administration (NOAA). Accurate dengue predictions can help the public health workers and people around the world take steps to mitigate the impact of the epidemics. Predicting dengue is a hefty task that calls for the consolidation of different data sets on disease incidence, weather, and the environment.

Goal The goal of this project is to build supervised learning model to predict the number of dengue fever cases each week in the cities of San Juan, Puerto Rico and Iquitos, Peru.

Supervised machine learning algorithms are designed to learn by example. When training a supervised learning model, the training data will consist of inputs paired with the correct outputs. During training, the algorithm will search for patterns in the data that correlate with the desired outputs. After training, a supervised learning model will get new unseen inputs and will identify which label the new inputs will be classified, based on prior training data. The objective of a supervised learning model is to predict the correct output label for newly presented input data. Supervised learning can be split into two subcategories: Classification and Regression

Classification - During training, a classification algorithm will be given data points with an assigned category. The job of a classification model is to then take an input value and assign it a class, or category it fits into, based on the training data provided. The example of classification is determining if email is spam or not. Some classification algos are - Linear Classifiers, Support Vector Machines, Decision Trees, K-Nearest Neighbor, Random Forest.

Regression - Regression is a predictive statistical process where the model attempts to find the important relationship between dependent and independent variables. The goal of a regression algorithm is to predict a continuous number such as sales, income, and test scores.There are many different types of regression algorithms. The three most common listed methods are - Linear Regression, Logistic Regression and Polynomial Regression.

Algorithms used -
To predict the number of the dengue cases, we have used various supervised learning models including Knn, GLMNET,Partial Least Squares (PLS), Decision Trees, Random Forest and Extreme Gradient Boosting and compared their performance.

Libraries

# install.packages("RCurl")
# install.packages("e1071")
# install.packages("caret")
# install.packages("doSNOW")
# install.packages("ipred")
# install.packages("xgboost")
# install.packages("dplyr")
# install.packages("tidyr")
# install.packages("naniar")
# install.packages("corrplot")
# install.packages("gbm")
# install.packages("mda")
# install.packages("psych")
# install.packages("kknn")
# install.packages("pls")
# install.packages("pamr")
# install.packages("mda")
# install.packages("rattle")
# install.packages("vtreat")
library(RCurl)
library(e1071)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(doSNOW)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: snow

library(ipred)
library(xgboost)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:xgboost':
## 
##     slice

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:RCurl':
## 
##     complete

library(naniar)
library(corrplot)

## corrplot 0.84 loaded

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(grid)
library(ggplot2)
library(kknn)

## 
## Attaching package: 'kknn'

## The following object is masked from 'package:caret':
## 
##     contr.dummy

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:corrplot':
## 
##     corrplot

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

library(pamr)

## Loading required package: cluster

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

library(mda)

## Loading required package: class

## Loaded mda 0.5-2

library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## 
## Attaching package: 'rattle'

## The following object is masked from 'package:xgboost':
## 
##     xgboost

library(vtreat)

## Loading required package: wrapr

## 
## Attaching package: 'wrapr'

## The following object is masked from 'package:tibble':
## 
##     view

## The following objects are masked from 'package:tidyr':
## 
##     pack, unpack

## The following object is masked from 'package:dplyr':
## 
##     coalesce

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:wrapr':
## 
##     pack, unpack

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.0-2

Data Preparation

The dataset contains both train and test data. We will split train data and use one part (i.e. the major part of the split) to train the predictive model and use the other smaller part to test the performance of the predictive model/regressor. The new test dataset will be used for validation.

Data Import

Importing dengue_features_train and dengue_labels_train dataset using “getURL” method from the RCurl package. This dataset contains information about the various features that can affect the incidence of the cases of dengue (mosquito-borne disease) per week.

Importing the training data features and labels and then merging them by their composite keys (i.e., a combination of ‘city’, ‘year’ and ‘week of year’)

trfeat <- getURL("https://s3.amazonaws.com/drivendata/data/44/public/dengue_features_train.csv")
trfeat <-read.csv(text = trfeat)
trfeat <- trfeat[, -c(4)]
trlabel <- getURL("https://s3.amazonaws.com/drivendata/data/44/public/dengue_labels_train.csv")
trlabel <- read.csv(text = trlabel)
trmerge <- merge(trfeat, trlabel, by=c("city", "year", "weekofyear"))
names(trmerge)

##  [1] "city"                                 
##  [2] "year"                                 
##  [3] "weekofyear"                           
##  [4] "ndvi_ne"                              
##  [5] "ndvi_nw"                              
##  [6] "ndvi_se"                              
##  [7] "ndvi_sw"                              
##  [8] "precipitation_amt_mm"                 
##  [9] "reanalysis_air_temp_k"                
## [10] "reanalysis_avg_temp_k"                
## [11] "reanalysis_dew_point_temp_k"          
## [12] "reanalysis_max_air_temp_k"            
## [13] "reanalysis_min_air_temp_k"            
## [14] "reanalysis_precip_amt_kg_per_m2"      
## [15] "reanalysis_relative_humidity_percent" 
## [16] "reanalysis_sat_precip_amt_mm"         
## [17] "reanalysis_specific_humidity_g_per_kg"
## [18] "reanalysis_tdtr_k"                    
## [19] "station_avg_temp_c"                   
## [20] "station_diur_temp_rng_c"              
## [21] "station_max_temp_c"                   
## [22] "station_min_temp_c"                   
## [23] "station_precip_mm"                    
## [24] "total_cases"

dengue_train <- trmerge[,c(-2)]
names(dengue_train)

##  [1] "city"                                 
##  [2] "weekofyear"                           
##  [3] "ndvi_ne"                              
##  [4] "ndvi_nw"                              
##  [5] "ndvi_se"                              
##  [6] "ndvi_sw"                              
##  [7] "precipitation_amt_mm"                 
##  [8] "reanalysis_air_temp_k"                
##  [9] "reanalysis_avg_temp_k"                
## [10] "reanalysis_dew_point_temp_k"          
## [11] "reanalysis_max_air_temp_k"            
## [12] "reanalysis_min_air_temp_k"            
## [13] "reanalysis_precip_amt_kg_per_m2"      
## [14] "reanalysis_relative_humidity_percent" 
## [15] "reanalysis_sat_precip_amt_mm"         
## [16] "reanalysis_specific_humidity_g_per_kg"
## [17] "reanalysis_tdtr_k"                    
## [18] "station_avg_temp_c"                   
## [19] "station_diur_temp_rng_c"              
## [20] "station_max_temp_c"                   
## [21] "station_min_temp_c"                   
## [22] "station_precip_mm"                    
## [23] "total_cases"

dim(dengue_train)

## [1] 1456   23

Missing Values

# Visualizing missing values for the training data
vis_miss(dengue_train)

gg_miss_var(dengue_train) + theme_minimal()

gg_miss_var(dengue_train, facet = city) + theme_minimal()

ggplot(dengue_train, aes(x=ndvi_ne, y = total_cases)) + geom_point()

## Warning: Removed 194 rows containing missing values (geom_point).

ggplot(dengue_train, aes(x=ndvi_ne, y = total_cases)) + geom_miss_point()

### Imputation Bagging (Bootstrap aggregating) is one of the ensemble methods. When used in missing value imputation, it will use the remaining variables as predictors to train a bagging tree and then use the tree to predict the missing values.Here we use preProcess() to impute values using bagImpute method

#Imputation of missing values using the bag impute method? Why Bag impute?
pre.process <- preProcess(dengue_train, method = "bagImpute")
imputed.data <- predict(pre.process, dengue_train) 
dengue_train$ndvi_ne <- imputed.data[,3]
dengue_train$ndvi_nw <- imputed.data[,4]
dengue_train$ndvi_se <- imputed.data[,5]
dengue_train$ndvi_sw <- imputed.data[,6]
dengue_train$precipitation_amt_mm <- imputed.data[,7]
dengue_train$reanalysis_air_temp_k <- imputed.data[, 8]
dengue_train$reanalysis_avg_temp_k <- imputed.data[,9]
dengue_train$reanalysis_dew_point_temp_k <- imputed.data[,10]
dengue_train$reanalysis_max_air_temp_k <- imputed.data[,11]
dengue_train$reanalysis_min_air_temp_k <- imputed.data[,12]
dengue_train$reanalysis_precip_amt_kg_per_m2 <- imputed.data[,13]
dengue_train$reanalysis_relative_humidity_percent <- imputed.data[,14]
dengue_train$reanalysis_sat_precip_amt_mm <- imputed.data[,15]
dengue_train$reanalysis_specific_humidity_g_per_kg <- imputed.data[,16]
dengue_train$reanalysis_tdtr_k <- imputed.data[,17]
dengue_train$station_avg_temp_c <- imputed.data[,18]
dengue_train$station_diur_temp_rng_c <- imputed.data[,19]
dengue_train$station_max_temp_c <- imputed.data[,20]
dengue_train$station_min_temp_c <- imputed.data[,21]
dengue_train$station_precip_mm <- imputed.data[,22]

Check the missing values after bagImpute

anyNA(dengue_train)

## [1] FALSE

vis_miss(dengue_train)

Randomize

Randomization helps to avoid selection bias (where some groups are underrepresented) and accidental bias (nuisance variables increase the noise which increase the variability) in the dataset

random_index <- sample(1:nrow(dengue_train), nrow(dengue_train))
random_train <- dengue_train[random_index, ]
names(random_train)

##  [1] "city"                                 
##  [2] "weekofyear"                           
##  [3] "ndvi_ne"                              
##  [4] "ndvi_nw"                              
##  [5] "ndvi_se"                              
##  [6] "ndvi_sw"                              
##  [7] "precipitation_amt_mm"                 
##  [8] "reanalysis_air_temp_k"                
##  [9] "reanalysis_avg_temp_k"                
## [10] "reanalysis_dew_point_temp_k"          
## [11] "reanalysis_max_air_temp_k"            
## [12] "reanalysis_min_air_temp_k"            
## [13] "reanalysis_precip_amt_kg_per_m2"      
## [14] "reanalysis_relative_humidity_percent" 
## [15] "reanalysis_sat_precip_amt_mm"         
## [16] "reanalysis_specific_humidity_g_per_kg"
## [17] "reanalysis_tdtr_k"                    
## [18] "station_avg_temp_c"                   
## [19] "station_diur_temp_rng_c"              
## [20] "station_max_temp_c"                   
## [21] "station_min_temp_c"                   
## [22] "station_precip_mm"                    
## [23] "total_cases"

dim(random_train)

## [1] 1456   23

anyNA(random_train)

## [1] FALSE

Tuning grid

expand.grid() is used to create a dataframe with the values that can be formed with the combination of vectors/factors passed to the function as an argument

grid <- expand.grid(eta = c( 0.1),
                         nrounds = c(70),
                         max_depth = 5,
                         min_child_weight = c(5),
                         colsample_bytree = c(0.5),
                         gamma = c(0),
                         subsample = 1)

Defining trainControl for the ML Algorithms To modify the resampling method, a trainControl function is used. The method controls the type of resampling and defaults to “boot”. Another method, repeatedcv is used to specify repeated K-fold cross-validation, the argument repeats controls the number of repetitions). K is controlled by the number argument and defaults to 10.

train.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 5,
                              search = "grid")

ML Algos

Applying ML Algorithms For Training the Prediction Model

Knn

set.seed(45220)
model_kknn <- caret::train(total_cases ~ .,
                           data = random_train,
                           type="prob",
                           method = "kknn",
                           tuneLength = 10,
                           preProcess = NULL,
                           trControl = train.control)
model_kknn

## k-Nearest Neighbors 
## 
## 1456 samples
##   22 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1311, 1311, 1312, 1309, 1310, 1309, ... 
## Resampling results across tuning parameters:
## 
##   kmax  RMSE      Rsquared   MAE     
##    5    38.29166  0.2534999  19.64970
##    7    37.41438  0.2645651  19.29339
##    9    36.98550  0.2700151  19.12037
##   11    36.70442  0.2744291  19.00166
##   13    36.48939  0.2783097  18.89848
##   15    36.31815  0.2818369  18.80673
##   17    36.18968  0.2845206  18.72077
##   19    36.08982  0.2867246  18.64944
##   21    36.02212  0.2883338  18.58902
##   23    35.99799  0.2889752  18.55507
## 
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
##  parameter 'kernel' was held constant at a value of optimal
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 23, distance = 2 and kernel
##  = optimal.

GLMNET

Glmnet (generalized linear model) via penalized maximum likelihood; the regularization path is computed for elasticnet penalty at a grid of values for the regularization parameter lambada.

set.seed(45220)
model_glmnet <- caret::train(total_cases ~ .,
                             data = random_train,
                             method = "glmnet",
                             preProcess = NULL,
                             trControl = train.control)
model_glmnet

## glmnet 
## 
## 1456 samples
##   22 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1311, 1311, 1312, 1309, 1310, 1309, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE      Rsquared   MAE     
##   0.10   0.02843789  38.75176  0.1659920  21.02240
##   0.10   0.28437887  38.71732  0.1671348  20.94485
##   0.10   2.84378869  38.65638  0.1685529  20.50883
##   0.55   0.02843789  38.74965  0.1660109  21.00985
##   0.55   0.28437887  38.67153  0.1687204  20.81907
##   0.55   2.84378869  38.71279  0.1672631  20.15018
##   1.00   0.02843789  38.74227  0.1662844  20.99642
##   1.00   0.28437887  38.65085  0.1693779  20.72881
##   1.00   2.84378869  38.85472  0.1644852  20.05433
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2843789.

Random Forest

x <- random_train[,2:22]

metric <- "MAE"
mtry <- sqrt(ncol(x))
model_rf <- caret::train(total_cases ~ ., 
                         data = random_train,
                         method = "rf",
                         preProcess = NULL,
                         metric = metric,
                         tuneGrid = expand.grid(.mtry = mtry),
                         trControl = train.control)
model_rf

## Random Forest 
## 
## 1456 samples
##   22 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1310, 1310, 1310, 1312, 1310, 1310, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   34.4189  0.3626462  17.98255
## 
## Tuning parameter 'mtry' was held constant at a value of 4.582576

Regression Tree

set.seed(123)
model_rpart <- caret::train(total_cases ~ ., data = random_train,
                               method = "rpart",
                               preProcess = NULL,
                               trControl = train.control)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

model_rpart

## CART 
## 
## 1456 samples
##   22 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1308, 1310, 1312, 1311, 1312, 1311, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared    MAE     
##   0.02037613  37.50562  0.23905879  19.43720
##   0.07623702  39.81780  0.13994029  20.26915
##   0.10278796  41.89839  0.08470509  22.42253
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02037613.

fancyRpartPlot(model_rpart$finalModel)

Partial Least Squares (PLS)

set.seed(27)
model_pls <- caret::train(total_cases ~ .,
                          data = random_train,
                          method = "pls",
                          preProcess = NULL,
                          trControl = train.control)
model_pls

## Partial Least Squares 
## 
## 1456 samples
##   22 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1311, 1310, 1311, 1310, 1311, 1311, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared    MAE     
##   1      42.46910  0.01569133  22.77335
##   2      41.73123  0.05362288  22.31229
##   3      40.94488  0.08906215  22.00556
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

Extreme Gradient Boosting

cl <- makeCluster(3, type = "SOCK")
registerDoSNOW(cl)

model_xgb <- caret::train(total_cases ~ .,
                          data = random_train,
                          method = "xgbTree",
                          tuneGrid = grid,
                          trControl = train.control)

## [16:13:02] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.

model_xgb

## eXtreme Gradient Boosting 
## 
## 1456 samples
##   22 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1311, 1311, 1309, 1311, 1310, 1311, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   30.68598  0.4694012  17.04265
## 
## Tuning parameter 'nrounds' was held constant at a value of 70
## Tuning
##  held constant at a value of 5
## Tuning parameter 'subsample' was held
##  constant at a value of 1

Comparing Prediction Models

models <- list( xgb = model_xgb,
                rf = model_rf, 
                glmnet = model_glmnet, 
                kknn = model_kknn, 
                pls = model_pls,
                tree = model_rpart
)
resample_results <- resamples(models)
resample_results

## 
## Call:
## resamples.default(x = models)
## 
## Models: xgb, rf, glmnet, kknn, pls, tree 
## Number of resamples: 50 
## Performance metrics: MAE, RMSE, Rsquared 
## Time estimates for: everything, final model fit

summary(resample_results)

## 
## Call:
## summary.resamples(object = resample_results)
## 
## Models: xgb, rf, glmnet, kknn, pls, tree 
## Number of resamples: 50 
## 
## MAE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## xgb    12.79761 15.80697 16.60513 17.04265 18.14508 22.35765    0
## rf     14.03544 16.30508 17.63306 17.98255 19.49935 22.30573    0
## glmnet 15.55843 18.51459 20.56696 20.72881 22.72348 27.41620    0
## kknn   13.60547 17.10898 18.26732 18.55507 20.65554 24.75693    0
## pls    17.78255 20.74412 21.77177 22.00556 22.99865 27.91764    0
## tree   15.88393 17.90236 18.85906 19.43720 21.03425 24.87198    0
## 
## RMSE 
##            Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## xgb    20.98064 26.39972 29.56435 30.68598 33.67789 48.50493    0
## rf     21.27962 28.30013 33.90581 34.41890 37.74828 54.77143    0
## glmnet 19.75373 28.37603 38.50791 38.65085 46.63829 67.77671    0
## kknn   20.06282 28.26916 36.13062 35.99799 43.00113 58.09925    0
## pls    24.69686 34.34397 41.06235 40.94488 47.22071 60.42955    0
## tree   23.59381 30.24829 37.52072 37.50562 42.99712 55.32374    0
## 
## Rsquared 
##              Min.    1st Qu.     Median       Mean   3rd Qu.      Max. NA's
## xgb    0.13884196 0.29097845 0.50341281 0.46940117 0.6007764 0.7558065    0
## rf     0.14094760 0.29004008 0.35409888 0.36264615 0.4283880 0.6641370    0
## glmnet 0.06394065 0.13670332 0.16485529 0.16937787 0.1936709 0.2681710    0
## kknn   0.15587050 0.22639517 0.28190003 0.28897522 0.3375357 0.5307537    0
## pls    0.04516654 0.06885199 0.08529575 0.08906215 0.1017315 0.1829123    0
## tree   0.04714377 0.12365097 0.19103784 0.23905879 0.2874392 0.6744623    0

Test data

Importing the test data features on which the predictive model will be applied to predict total number of cases per week at a future date.

testset <- getURL("https://s3.amazonaws.com/drivendata/data/44/public/dengue_features_test.csv")
testset <- read.csv(text=testset)
names(testset)

##  [1] "city"                                 
##  [2] "year"                                 
##  [3] "weekofyear"                           
##  [4] "week_start_date"                      
##  [5] "ndvi_ne"                              
##  [6] "ndvi_nw"                              
##  [7] "ndvi_se"                              
##  [8] "ndvi_sw"                              
##  [9] "precipitation_amt_mm"                 
## [10] "reanalysis_air_temp_k"                
## [11] "reanalysis_avg_temp_k"                
## [12] "reanalysis_dew_point_temp_k"          
## [13] "reanalysis_max_air_temp_k"            
## [14] "reanalysis_min_air_temp_k"            
## [15] "reanalysis_precip_amt_kg_per_m2"      
## [16] "reanalysis_relative_humidity_percent" 
## [17] "reanalysis_sat_precip_amt_mm"         
## [18] "reanalysis_specific_humidity_g_per_kg"
## [19] "reanalysis_tdtr_k"                    
## [20] "station_avg_temp_c"                   
## [21] "station_diur_temp_rng_c"              
## [22] "station_max_temp_c"                   
## [23] "station_min_temp_c"                   
## [24] "station_precip_mm"

dengue_test <- testset[, -c(2, 4)] 
names(dengue_test)

##  [1] "city"                                 
##  [2] "weekofyear"                           
##  [3] "ndvi_ne"                              
##  [4] "ndvi_nw"                              
##  [5] "ndvi_se"                              
##  [6] "ndvi_sw"                              
##  [7] "precipitation_amt_mm"                 
##  [8] "reanalysis_air_temp_k"                
##  [9] "reanalysis_avg_temp_k"                
## [10] "reanalysis_dew_point_temp_k"          
## [11] "reanalysis_max_air_temp_k"            
## [12] "reanalysis_min_air_temp_k"            
## [13] "reanalysis_precip_amt_kg_per_m2"      
## [14] "reanalysis_relative_humidity_percent" 
## [15] "reanalysis_sat_precip_amt_mm"         
## [16] "reanalysis_specific_humidity_g_per_kg"
## [17] "reanalysis_tdtr_k"                    
## [18] "station_avg_temp_c"                   
## [19] "station_diur_temp_rng_c"              
## [20] "station_max_temp_c"                   
## [21] "station_min_temp_c"                   
## [22] "station_precip_mm"

# Visualizing missing values for the test data
vis_miss(dengue_test)

Imputation

Imputation of missing values in the test data.

names(dengue_test)

##  [1] "city"                                 
##  [2] "weekofyear"                           
##  [3] "ndvi_ne"                              
##  [4] "ndvi_nw"                              
##  [5] "ndvi_se"                              
##  [6] "ndvi_sw"                              
##  [7] "precipitation_amt_mm"                 
##  [8] "reanalysis_air_temp_k"                
##  [9] "reanalysis_avg_temp_k"                
## [10] "reanalysis_dew_point_temp_k"          
## [11] "reanalysis_max_air_temp_k"            
## [12] "reanalysis_min_air_temp_k"            
## [13] "reanalysis_precip_amt_kg_per_m2"      
## [14] "reanalysis_relative_humidity_percent" 
## [15] "reanalysis_sat_precip_amt_mm"         
## [16] "reanalysis_specific_humidity_g_per_kg"
## [17] "reanalysis_tdtr_k"                    
## [18] "station_avg_temp_c"                   
## [19] "station_diur_temp_rng_c"              
## [20] "station_max_temp_c"                   
## [21] "station_min_temp_c"                   
## [22] "station_precip_mm"

pre.process <- preProcess(dengue_test, method = "bagImpute")
imputed.data <- predict(pre.process, dengue_test) 
dengue_test$ndvi_ne <- imputed.data[,3]
dengue_test$ndvi_nw <- imputed.data[,4]
dengue_test$ndvi_se <- imputed.data[,5]
dengue_test$ndvi_sw <- imputed.data[,6]
dengue_test$precipitation_amt_mm <- imputed.data[,7]
dengue_test$reanalysis_air_temp_k <- imputed.data[, 8]
dengue_test$reanalysis_avg_temp_k <- imputed.data[,9]
dengue_test$reanalysis_dew_point_temp_k <- imputed.data[,10]
dengue_test$reanalysis_max_air_temp_k <- imputed.data[,11]
dengue_test$reanalysis_min_air_temp_k <- imputed.data[,12]
dengue_test$reanalysis_precip_amt_kg_per_m2 <- imputed.data[,13]
dengue_test$reanalysis_relative_humidity_percent <- imputed.data[,14]
dengue_test$reanalysis_sat_precip_amt_mm <- imputed.data[,15]
dengue_test$reanalysis_specific_humidity_g_per_kg <- imputed.data[,16]
dengue_test$reanalysis_tdtr_k <- imputed.data[,17]
dengue_test$station_avg_temp_c <- imputed.data[,18]
dengue_test$station_diur_temp_rng_c <- imputed.data[,19]
dengue_test$station_max_temp_c <- imputed.data[,20]
dengue_test$station_min_temp_c <- imputed.data[,21]
dengue_test$station_precip_mm <- imputed.data[,22]

dim(dengue_test)

## [1] 416  22

anyNA(dengue_test)

## [1] FALSE

vis_miss(dengue_test)

Prediction

Predicting the total cases on the test data

# predict values for test data
pred <- predict(model_xgb, dengue_test)
dengue_test$total_cases <- round(pred, digits = 0)

# Visualizing the time-series total cases on the test data
plot(dengue_test$total_cases)

# Summary of the predicted total cases
summary(dengue_test$total_cases)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -4.00    7.00   15.00   23.04   32.00  141.00

#Entering the predicted 'total_cases' from the test-set into the submission form
Submitformat <- getURL("https://s3.amazonaws.com/drivendata/data/44/public/submission_format.csv")
submitformat <- read.csv(text=Submitformat)
submitformat$total_cases<- dengue_test$total_cases

# Exporting the output (total cases) to local drive as an Excel file
write.csv(submitformat, "//Users//radhika//Documents//submit.csv", row.names = FALSE)

Supervised Learning - Predict the Next Pandemic

Singh | Sood

Summary