Dengue is a mosquito-borne disease. It occurs mainly in the tropical and subtropical parts of the world. Since, it is transmitted by mosquitoes, the transmission of the disease is related to the climatic conditions and environmental variables such as precipitation and temperature. The disease is prevalent in Southeast Asia and Pacific Islands and epidemics of this disease are expected based on differences in climatic conditions. Nearly half a million cases of the dengue fever every year are reported in the Latin America, as reported by “DataDriven.org”.
Data Source The dataset was collected and publically shared by “DrivenData.org”.The link to original dataset can be found here. The data for this competition comes from the multiple sources aimed at supporting to predict the next epidemics. The environmental data has been collected by the U.S. Federal Government agencies - Centers for Disease Control (CDC) and Prevention to the National Oceanic and Atmospheric Administration (NOAA). Accurate dengue predictions can help the public health workers and people around the world take steps to mitigate the impact of the epidemics. Predicting dengue is a hefty task that calls for the consolidation of different data sets on disease incidence, weather, and the environment.
Goal The goal of this project is to build supervised learning model to predict the number of dengue fever cases each week in the cities of San Juan, Puerto Rico and Iquitos, Peru.
Supervised machine learning algorithms are designed to learn by example. When training a supervised learning model, the training data will consist of inputs paired with the correct outputs. During training, the algorithm will search for patterns in the data that correlate with the desired outputs. After training, a supervised learning model will get new unseen inputs and will identify which label the new inputs will be classified, based on prior training data. The objective of a supervised learning model is to predict the correct output label for newly presented input data. Supervised learning can be split into two subcategories: Classification and Regression
Classification - During training, a classification algorithm will be given data points with an assigned category. The job of a classification model is to then take an input value and assign it a class, or category it fits into, based on the training data provided. The example of classification is determining if email is spam or not. Some classification algos are - Linear Classifiers, Support Vector Machines, Decision Trees, K-Nearest Neighbor, Random Forest.
Regression - Regression is a predictive statistical process where the model attempts to find the important relationship between dependent and independent variables. The goal of a regression algorithm is to predict a continuous number such as sales, income, and test scores.There are many different types of regression algorithms. The three most common listed methods are - Linear Regression, Logistic Regression and Polynomial Regression.
Algorithms used -
To predict the number of the dengue cases, we have used various supervised learning models including Knn, GLMNET,Partial Least Squares (PLS), Decision Trees, Random Forest and Extreme Gradient Boosting and compared their performance.
# install.packages("RCurl")
# install.packages("e1071")
# install.packages("caret")
# install.packages("doSNOW")
# install.packages("ipred")
# install.packages("xgboost")
# install.packages("dplyr")
# install.packages("tidyr")
# install.packages("naniar")
# install.packages("corrplot")
# install.packages("gbm")
# install.packages("mda")
# install.packages("psych")
# install.packages("kknn")
# install.packages("pls")
# install.packages("pamr")
# install.packages("mda")
# install.packages("rattle")
# install.packages("vtreat")
library(RCurl)
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(doSNOW)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: snow
library(ipred)
library(xgboost)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:xgboost':
##
## slice
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
library(naniar)
library(corrplot)
## corrplot 0.84 loaded
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(grid)
library(ggplot2)
library(kknn)
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(pamr)
## Loading required package: cluster
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
library(mda)
## Loading required package: class
## Loaded mda 0.5-2
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:xgboost':
##
## xgboost
library(vtreat)
## Loading required package: wrapr
##
## Attaching package: 'wrapr'
## The following object is masked from 'package:tibble':
##
## view
## The following objects are masked from 'package:tidyr':
##
## pack, unpack
## The following object is masked from 'package:dplyr':
##
## coalesce
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:wrapr':
##
## pack, unpack
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.0-2
The dataset contains both train and test data. We will split train data and use one part (i.e. the major part of the split) to train the predictive model and use the other smaller part to test the performance of the predictive model/regressor. The new test dataset will be used for validation.
Importing dengue_features_train and dengue_labels_train dataset using “getURL” method from the RCurl package. This dataset contains information about the various features that can affect the incidence of the cases of dengue (mosquito-borne disease) per week.
Importing the training data features and labels and then merging them by their composite keys (i.e., a combination of ‘city’, ‘year’ and ‘week of year’)
trfeat <- getURL("https://s3.amazonaws.com/drivendata/data/44/public/dengue_features_train.csv")
trfeat <-read.csv(text = trfeat)
trfeat <- trfeat[, -c(4)]
trlabel <- getURL("https://s3.amazonaws.com/drivendata/data/44/public/dengue_labels_train.csv")
trlabel <- read.csv(text = trlabel)
trmerge <- merge(trfeat, trlabel, by=c("city", "year", "weekofyear"))
names(trmerge)
## [1] "city"
## [2] "year"
## [3] "weekofyear"
## [4] "ndvi_ne"
## [5] "ndvi_nw"
## [6] "ndvi_se"
## [7] "ndvi_sw"
## [8] "precipitation_amt_mm"
## [9] "reanalysis_air_temp_k"
## [10] "reanalysis_avg_temp_k"
## [11] "reanalysis_dew_point_temp_k"
## [12] "reanalysis_max_air_temp_k"
## [13] "reanalysis_min_air_temp_k"
## [14] "reanalysis_precip_amt_kg_per_m2"
## [15] "reanalysis_relative_humidity_percent"
## [16] "reanalysis_sat_precip_amt_mm"
## [17] "reanalysis_specific_humidity_g_per_kg"
## [18] "reanalysis_tdtr_k"
## [19] "station_avg_temp_c"
## [20] "station_diur_temp_rng_c"
## [21] "station_max_temp_c"
## [22] "station_min_temp_c"
## [23] "station_precip_mm"
## [24] "total_cases"
dengue_train <- trmerge[,c(-2)]
names(dengue_train)
## [1] "city"
## [2] "weekofyear"
## [3] "ndvi_ne"
## [4] "ndvi_nw"
## [5] "ndvi_se"
## [6] "ndvi_sw"
## [7] "precipitation_amt_mm"
## [8] "reanalysis_air_temp_k"
## [9] "reanalysis_avg_temp_k"
## [10] "reanalysis_dew_point_temp_k"
## [11] "reanalysis_max_air_temp_k"
## [12] "reanalysis_min_air_temp_k"
## [13] "reanalysis_precip_amt_kg_per_m2"
## [14] "reanalysis_relative_humidity_percent"
## [15] "reanalysis_sat_precip_amt_mm"
## [16] "reanalysis_specific_humidity_g_per_kg"
## [17] "reanalysis_tdtr_k"
## [18] "station_avg_temp_c"
## [19] "station_diur_temp_rng_c"
## [20] "station_max_temp_c"
## [21] "station_min_temp_c"
## [22] "station_precip_mm"
## [23] "total_cases"
dim(dengue_train)
## [1] 1456 23
# Visualizing missing values for the training data
vis_miss(dengue_train)
gg_miss_var(dengue_train) + theme_minimal()
gg_miss_var(dengue_train, facet = city) + theme_minimal()
ggplot(dengue_train, aes(x=ndvi_ne, y = total_cases)) + geom_point()
## Warning: Removed 194 rows containing missing values (geom_point).
ggplot(dengue_train, aes(x=ndvi_ne, y = total_cases)) + geom_miss_point()
### Imputation Bagging (Bootstrap aggregating) is one of the ensemble methods. When used in missing value imputation, it will use the remaining variables as predictors to train a bagging tree and then use the tree to predict the missing values.Here we use preProcess() to impute values using bagImpute method
#Imputation of missing values using the bag impute method? Why Bag impute?
pre.process <- preProcess(dengue_train, method = "bagImpute")
imputed.data <- predict(pre.process, dengue_train)
dengue_train$ndvi_ne <- imputed.data[,3]
dengue_train$ndvi_nw <- imputed.data[,4]
dengue_train$ndvi_se <- imputed.data[,5]
dengue_train$ndvi_sw <- imputed.data[,6]
dengue_train$precipitation_amt_mm <- imputed.data[,7]
dengue_train$reanalysis_air_temp_k <- imputed.data[, 8]
dengue_train$reanalysis_avg_temp_k <- imputed.data[,9]
dengue_train$reanalysis_dew_point_temp_k <- imputed.data[,10]
dengue_train$reanalysis_max_air_temp_k <- imputed.data[,11]
dengue_train$reanalysis_min_air_temp_k <- imputed.data[,12]
dengue_train$reanalysis_precip_amt_kg_per_m2 <- imputed.data[,13]
dengue_train$reanalysis_relative_humidity_percent <- imputed.data[,14]
dengue_train$reanalysis_sat_precip_amt_mm <- imputed.data[,15]
dengue_train$reanalysis_specific_humidity_g_per_kg <- imputed.data[,16]
dengue_train$reanalysis_tdtr_k <- imputed.data[,17]
dengue_train$station_avg_temp_c <- imputed.data[,18]
dengue_train$station_diur_temp_rng_c <- imputed.data[,19]
dengue_train$station_max_temp_c <- imputed.data[,20]
dengue_train$station_min_temp_c <- imputed.data[,21]
dengue_train$station_precip_mm <- imputed.data[,22]
Check the missing values after bagImpute
anyNA(dengue_train)
## [1] FALSE
vis_miss(dengue_train)
Randomization helps to avoid selection bias (where some groups are underrepresented) and accidental bias (nuisance variables increase the noise which increase the variability) in the dataset
random_index <- sample(1:nrow(dengue_train), nrow(dengue_train))
random_train <- dengue_train[random_index, ]
names(random_train)
## [1] "city"
## [2] "weekofyear"
## [3] "ndvi_ne"
## [4] "ndvi_nw"
## [5] "ndvi_se"
## [6] "ndvi_sw"
## [7] "precipitation_amt_mm"
## [8] "reanalysis_air_temp_k"
## [9] "reanalysis_avg_temp_k"
## [10] "reanalysis_dew_point_temp_k"
## [11] "reanalysis_max_air_temp_k"
## [12] "reanalysis_min_air_temp_k"
## [13] "reanalysis_precip_amt_kg_per_m2"
## [14] "reanalysis_relative_humidity_percent"
## [15] "reanalysis_sat_precip_amt_mm"
## [16] "reanalysis_specific_humidity_g_per_kg"
## [17] "reanalysis_tdtr_k"
## [18] "station_avg_temp_c"
## [19] "station_diur_temp_rng_c"
## [20] "station_max_temp_c"
## [21] "station_min_temp_c"
## [22] "station_precip_mm"
## [23] "total_cases"
dim(random_train)
## [1] 1456 23
anyNA(random_train)
## [1] FALSE
expand.grid() is used to create a dataframe with the values that can be formed with the combination of vectors/factors passed to the function as an argument
grid <- expand.grid(eta = c( 0.1),
nrounds = c(70),
max_depth = 5,
min_child_weight = c(5),
colsample_bytree = c(0.5),
gamma = c(0),
subsample = 1)
Defining trainControl for the ML Algorithms To modify the resampling method, a trainControl function is used. The method controls the type of resampling and defaults to “boot”. Another method, repeatedcv is used to specify repeated K-fold cross-validation, the argument repeats controls the number of repetitions). K is controlled by the number argument and defaults to 10.
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
search = "grid")
Applying ML Algorithms For Training the Prediction Model
set.seed(45220)
model_kknn <- caret::train(total_cases ~ .,
data = random_train,
type="prob",
method = "kknn",
tuneLength = 10,
preProcess = NULL,
trControl = train.control)
model_kknn
## k-Nearest Neighbors
##
## 1456 samples
## 22 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1311, 1311, 1312, 1309, 1310, 1309, ...
## Resampling results across tuning parameters:
##
## kmax RMSE Rsquared MAE
## 5 38.29166 0.2534999 19.64970
## 7 37.41438 0.2645651 19.29339
## 9 36.98550 0.2700151 19.12037
## 11 36.70442 0.2744291 19.00166
## 13 36.48939 0.2783097 18.89848
## 15 36.31815 0.2818369 18.80673
## 17 36.18968 0.2845206 18.72077
## 19 36.08982 0.2867246 18.64944
## 21 36.02212 0.2883338 18.58902
## 23 35.99799 0.2889752 18.55507
##
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
## parameter 'kernel' was held constant at a value of optimal
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 23, distance = 2 and kernel
## = optimal.
Glmnet (generalized linear model) via penalized maximum likelihood; the regularization path is computed for elasticnet penalty at a grid of values for the regularization parameter lambada.
set.seed(45220)
model_glmnet <- caret::train(total_cases ~ .,
data = random_train,
method = "glmnet",
preProcess = NULL,
trControl = train.control)
model_glmnet
## glmnet
##
## 1456 samples
## 22 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1311, 1311, 1312, 1309, 1310, 1309, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.02843789 38.75176 0.1659920 21.02240
## 0.10 0.28437887 38.71732 0.1671348 20.94485
## 0.10 2.84378869 38.65638 0.1685529 20.50883
## 0.55 0.02843789 38.74965 0.1660109 21.00985
## 0.55 0.28437887 38.67153 0.1687204 20.81907
## 0.55 2.84378869 38.71279 0.1672631 20.15018
## 1.00 0.02843789 38.74227 0.1662844 20.99642
## 1.00 0.28437887 38.65085 0.1693779 20.72881
## 1.00 2.84378869 38.85472 0.1644852 20.05433
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2843789.
x <- random_train[,2:22]
metric <- "MAE"
mtry <- sqrt(ncol(x))
model_rf <- caret::train(total_cases ~ .,
data = random_train,
method = "rf",
preProcess = NULL,
metric = metric,
tuneGrid = expand.grid(.mtry = mtry),
trControl = train.control)
model_rf
## Random Forest
##
## 1456 samples
## 22 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1310, 1310, 1310, 1312, 1310, 1310, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 34.4189 0.3626462 17.98255
##
## Tuning parameter 'mtry' was held constant at a value of 4.582576
set.seed(123)
model_rpart <- caret::train(total_cases ~ ., data = random_train,
method = "rpart",
preProcess = NULL,
trControl = train.control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
model_rpart
## CART
##
## 1456 samples
## 22 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1308, 1310, 1312, 1311, 1312, 1311, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.02037613 37.50562 0.23905879 19.43720
## 0.07623702 39.81780 0.13994029 20.26915
## 0.10278796 41.89839 0.08470509 22.42253
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02037613.
fancyRpartPlot(model_rpart$finalModel)
set.seed(27)
model_pls <- caret::train(total_cases ~ .,
data = random_train,
method = "pls",
preProcess = NULL,
trControl = train.control)
model_pls
## Partial Least Squares
##
## 1456 samples
## 22 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1311, 1310, 1311, 1310, 1311, 1311, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 42.46910 0.01569133 22.77335
## 2 41.73123 0.05362288 22.31229
## 3 40.94488 0.08906215 22.00556
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
cl <- makeCluster(3, type = "SOCK")
registerDoSNOW(cl)
model_xgb <- caret::train(total_cases ~ .,
data = random_train,
method = "xgbTree",
tuneGrid = grid,
trControl = train.control)
## [16:13:02] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror.
model_xgb
## eXtreme Gradient Boosting
##
## 1456 samples
## 22 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1311, 1311, 1309, 1311, 1310, 1311, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 30.68598 0.4694012 17.04265
##
## Tuning parameter 'nrounds' was held constant at a value of 70
## Tuning
## held constant at a value of 5
## Tuning parameter 'subsample' was held
## constant at a value of 1
models <- list( xgb = model_xgb,
rf = model_rf,
glmnet = model_glmnet,
kknn = model_kknn,
pls = model_pls,
tree = model_rpart
)
resample_results <- resamples(models)
resample_results
##
## Call:
## resamples.default(x = models)
##
## Models: xgb, rf, glmnet, kknn, pls, tree
## Number of resamples: 50
## Performance metrics: MAE, RMSE, Rsquared
## Time estimates for: everything, final model fit
summary(resample_results)
##
## Call:
## summary.resamples(object = resample_results)
##
## Models: xgb, rf, glmnet, kknn, pls, tree
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## xgb 12.79761 15.80697 16.60513 17.04265 18.14508 22.35765 0
## rf 14.03544 16.30508 17.63306 17.98255 19.49935 22.30573 0
## glmnet 15.55843 18.51459 20.56696 20.72881 22.72348 27.41620 0
## kknn 13.60547 17.10898 18.26732 18.55507 20.65554 24.75693 0
## pls 17.78255 20.74412 21.77177 22.00556 22.99865 27.91764 0
## tree 15.88393 17.90236 18.85906 19.43720 21.03425 24.87198 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## xgb 20.98064 26.39972 29.56435 30.68598 33.67789 48.50493 0
## rf 21.27962 28.30013 33.90581 34.41890 37.74828 54.77143 0
## glmnet 19.75373 28.37603 38.50791 38.65085 46.63829 67.77671 0
## kknn 20.06282 28.26916 36.13062 35.99799 43.00113 58.09925 0
## pls 24.69686 34.34397 41.06235 40.94488 47.22071 60.42955 0
## tree 23.59381 30.24829 37.52072 37.50562 42.99712 55.32374 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## xgb 0.13884196 0.29097845 0.50341281 0.46940117 0.6007764 0.7558065 0
## rf 0.14094760 0.29004008 0.35409888 0.36264615 0.4283880 0.6641370 0
## glmnet 0.06394065 0.13670332 0.16485529 0.16937787 0.1936709 0.2681710 0
## kknn 0.15587050 0.22639517 0.28190003 0.28897522 0.3375357 0.5307537 0
## pls 0.04516654 0.06885199 0.08529575 0.08906215 0.1017315 0.1829123 0
## tree 0.04714377 0.12365097 0.19103784 0.23905879 0.2874392 0.6744623 0
Importing the test data features on which the predictive model will be applied to predict total number of cases per week at a future date.
testset <- getURL("https://s3.amazonaws.com/drivendata/data/44/public/dengue_features_test.csv")
testset <- read.csv(text=testset)
names(testset)
## [1] "city"
## [2] "year"
## [3] "weekofyear"
## [4] "week_start_date"
## [5] "ndvi_ne"
## [6] "ndvi_nw"
## [7] "ndvi_se"
## [8] "ndvi_sw"
## [9] "precipitation_amt_mm"
## [10] "reanalysis_air_temp_k"
## [11] "reanalysis_avg_temp_k"
## [12] "reanalysis_dew_point_temp_k"
## [13] "reanalysis_max_air_temp_k"
## [14] "reanalysis_min_air_temp_k"
## [15] "reanalysis_precip_amt_kg_per_m2"
## [16] "reanalysis_relative_humidity_percent"
## [17] "reanalysis_sat_precip_amt_mm"
## [18] "reanalysis_specific_humidity_g_per_kg"
## [19] "reanalysis_tdtr_k"
## [20] "station_avg_temp_c"
## [21] "station_diur_temp_rng_c"
## [22] "station_max_temp_c"
## [23] "station_min_temp_c"
## [24] "station_precip_mm"
dengue_test <- testset[, -c(2, 4)]
names(dengue_test)
## [1] "city"
## [2] "weekofyear"
## [3] "ndvi_ne"
## [4] "ndvi_nw"
## [5] "ndvi_se"
## [6] "ndvi_sw"
## [7] "precipitation_amt_mm"
## [8] "reanalysis_air_temp_k"
## [9] "reanalysis_avg_temp_k"
## [10] "reanalysis_dew_point_temp_k"
## [11] "reanalysis_max_air_temp_k"
## [12] "reanalysis_min_air_temp_k"
## [13] "reanalysis_precip_amt_kg_per_m2"
## [14] "reanalysis_relative_humidity_percent"
## [15] "reanalysis_sat_precip_amt_mm"
## [16] "reanalysis_specific_humidity_g_per_kg"
## [17] "reanalysis_tdtr_k"
## [18] "station_avg_temp_c"
## [19] "station_diur_temp_rng_c"
## [20] "station_max_temp_c"
## [21] "station_min_temp_c"
## [22] "station_precip_mm"
# Visualizing missing values for the test data
vis_miss(dengue_test)
Imputation of missing values in the test data.
names(dengue_test)
## [1] "city"
## [2] "weekofyear"
## [3] "ndvi_ne"
## [4] "ndvi_nw"
## [5] "ndvi_se"
## [6] "ndvi_sw"
## [7] "precipitation_amt_mm"
## [8] "reanalysis_air_temp_k"
## [9] "reanalysis_avg_temp_k"
## [10] "reanalysis_dew_point_temp_k"
## [11] "reanalysis_max_air_temp_k"
## [12] "reanalysis_min_air_temp_k"
## [13] "reanalysis_precip_amt_kg_per_m2"
## [14] "reanalysis_relative_humidity_percent"
## [15] "reanalysis_sat_precip_amt_mm"
## [16] "reanalysis_specific_humidity_g_per_kg"
## [17] "reanalysis_tdtr_k"
## [18] "station_avg_temp_c"
## [19] "station_diur_temp_rng_c"
## [20] "station_max_temp_c"
## [21] "station_min_temp_c"
## [22] "station_precip_mm"
pre.process <- preProcess(dengue_test, method = "bagImpute")
imputed.data <- predict(pre.process, dengue_test)
dengue_test$ndvi_ne <- imputed.data[,3]
dengue_test$ndvi_nw <- imputed.data[,4]
dengue_test$ndvi_se <- imputed.data[,5]
dengue_test$ndvi_sw <- imputed.data[,6]
dengue_test$precipitation_amt_mm <- imputed.data[,7]
dengue_test$reanalysis_air_temp_k <- imputed.data[, 8]
dengue_test$reanalysis_avg_temp_k <- imputed.data[,9]
dengue_test$reanalysis_dew_point_temp_k <- imputed.data[,10]
dengue_test$reanalysis_max_air_temp_k <- imputed.data[,11]
dengue_test$reanalysis_min_air_temp_k <- imputed.data[,12]
dengue_test$reanalysis_precip_amt_kg_per_m2 <- imputed.data[,13]
dengue_test$reanalysis_relative_humidity_percent <- imputed.data[,14]
dengue_test$reanalysis_sat_precip_amt_mm <- imputed.data[,15]
dengue_test$reanalysis_specific_humidity_g_per_kg <- imputed.data[,16]
dengue_test$reanalysis_tdtr_k <- imputed.data[,17]
dengue_test$station_avg_temp_c <- imputed.data[,18]
dengue_test$station_diur_temp_rng_c <- imputed.data[,19]
dengue_test$station_max_temp_c <- imputed.data[,20]
dengue_test$station_min_temp_c <- imputed.data[,21]
dengue_test$station_precip_mm <- imputed.data[,22]
dim(dengue_test)
## [1] 416 22
anyNA(dengue_test)
## [1] FALSE
vis_miss(dengue_test)
Predicting the total cases on the test data
# predict values for test data
pred <- predict(model_xgb, dengue_test)
dengue_test$total_cases <- round(pred, digits = 0)
# Visualizing the time-series total cases on the test data
plot(dengue_test$total_cases)
# Summary of the predicted total cases
summary(dengue_test$total_cases)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.00 7.00 15.00 23.04 32.00 141.00
#Entering the predicted 'total_cases' from the test-set into the submission form
Submitformat <- getURL("https://s3.amazonaws.com/drivendata/data/44/public/submission_format.csv")
submitformat <- read.csv(text=Submitformat)
submitformat$total_cases<- dengue_test$total_cases
# Exporting the output (total cases) to local drive as an Excel file
write.csv(submitformat, "//Users//radhika//Documents//submit.csv", row.names = FALSE)