2-class problem: Model over-predicts accuracy and leaves out the small examples where it is not accurate (ex: model predicts 99% accuracy but reads as 100% accurate)
Overfitting: A model that is not generalized enough, instead focused on a given dataset
Confusion matrix a: True positive
b: False negative
c: False positive
d: True negative
Accuracy = TP+TN / TP+TN+FP+FN = a+d / a+b+c+d
Precision metric Precision = Number of correctly identified positives / Number of positive predictions (correct or not)
Measures the fraction of a time the model is correct
In a perfect precision metric, there are no false positives
Recall/sensitivity
Recall = Number of correctly identified positives / All actual positives (true positives and false negatives, which were supposed to be positives)
In a perfect recall metric, there are no false negatives
Precision and recall are conflicting tradeoffs
We decide which metric to used based on the cost
F1 Score
Harmonic mean of precision and recall - balance of both, not the highest of each
F1 = 2(recall*precision) / recall + precision
Decision trees: Allowing maximum depth to be higher
When only using one feature, the prediction error is higher
If model is too complex, reduce depth
If model is too simple, increase depth
Linear regression models: Reducing the regularization penalty
Boosted ensemble: Increasing the number of trees
# Load the COVID dataset into R
covid <- read.csv("C:/Users/Monte Richardson/Desktop/Data Science Dojo/covid_19_data.csv", header=TRUE , na.strings = c(""))
# Explore the data set
dim(covid)
## [1] 4247 8
str(covid)
## 'data.frame': 4247 obs. of 8 variables:
## $ SNo : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ObservationDate: Factor w/ 47 levels "01/22/2020","01/23/2020",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Province.State : Factor w/ 182 levels " Montreal, QC",..: 4 8 21 45 47 51 52 53 54 57 ...
## $ Country.Region : Factor w/ 111 levels " Azerbaijan",..: 59 59 59 59 59 59 59 59 59 59 ...
## $ Last.Update : Factor w/ 1190 levels "1/22/2020 17:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Confirmed : num 1 14 6 1 0 26 2 1 4 1 ...
## $ Deaths : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Recovered : num 0 0 0 0 0 0 0 0 0 0 ...
# View the summary statistics of the dataset
summary(covid)
## SNo ObservationDate Province.State Country.Region
## Min. : 1 03/08/2020: 255 Anhui : 47 Mainland China:1451
## 1st Qu.:1062 03/07/2020: 225 Beijing : 47 US : 783
## Median :2124 03/06/2020: 199 Chongqing: 47 Australia : 201
## Mean :2124 03/05/2020: 173 Fujian : 47 Canada : 133
## 3rd Qu.:3186 03/04/2020: 160 Gansu : 47 Hong Kong : 47
## Max. :4247 03/03/2020: 151 (Other) :2514 Japan : 47
## (Other) :3084 NA's :1498 (Other) :1585
## Last.Update Confirmed Deaths
## 2020-02-01T19:43:03: 63 Min. : 0.0 Min. : 0.00
## 2020-02-01T19:53:03: 63 1st Qu.: 1.0 1st Qu.: 0.00
## 1/31/2020 23:59 : 62 Median : 9.0 Median : 0.00
## 2020-02-24T23:33:02: 60 Mean : 586.9 Mean : 17.53
## 1/30/20 16:00 : 58 3rd Qu.: 99.5 3rd Qu.: 1.00
## 1/29/20 19:30 : 54 Max. :67707.0 Max. :2986.00
## (Other) :3887
## Recovered
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 1.0
## Mean : 187.9
## 3rd Qu.: 16.0
## Max. :45235.0
##
# Any missing values?
# Count the missing values
sum(is.na(covid$Deaths))
## [1] 0
# Remove SNo, Last.Update, Proince.State, and ObservationDate
covid.data <- covid[ , -c(1, 2, 3, 5)]
# Filtering
covid2 <- covid.data %>%
filter(Country.Region %in% c('Japan', 'US', 'Mainland China', 'Italy', 'Iran', 'South Korea'))
# View the data structure
str(covid2)
## 'data.frame': 2385 obs. of 4 variables:
## $ Country.Region: Factor w/ 111 levels " Azerbaijan",..: 59 59 59 59 59 59 59 59 59 59 ...
## $ Confirmed : num 1 14 6 1 0 26 2 1 4 1 ...
## $ Deaths : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Recovered : num 0 0 0 0 0 0 0 0 0 0 ...
summary(covid2)
## Country.Region Confirmed Deaths Recovered
## Mainland China:1451 Min. : 0 Min. : 0.00 Min. : 0.0
## US : 783 1st Qu.: 2 1st Qu.: 0.00 1st Qu.: 0.0
## Japan : 47 Median : 58 Median : 0.00 Median : 2.0
## South Korea : 47 Mean : 1024 Mean : 31.06 Mean : 332.3
## Italy : 38 3rd Qu.: 281 3rd Qu.: 2.00 3rd Qu.: 86.0
## Iran : 19 Max. :67707 Max. :2986.00 Max. :45235.0
## (Other) : 0
# Cast target attribute and other categorical atrributes to factor
covid2$Country.Region <- as.factor(covid2$Country.Region)
# Randomly select 70% of the data as training set
set.seed(27)
train.index <- sample(1:nrow(covid2), 0.7*nrow(covid2))
covid.train <- covid2[train.index, ]
dim(covid.train)
## [1] 1669 4
summary(covid.train$Deaths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 35.22 2.00 2959.00
# Use the remaining 30% as the testing data
covid.test <- covid2[-train.index,]
dim(covid.test)
## [1] 716 4
summary(covid.test$Deaths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 21.35 2.00 2986.00
# rpart: You feed it the equation, headed up by the variable of interest and followed by the variables used for prediction
covid.dt.model <- rpart(Deaths ~ Recovered + Confirmed + Country.Region, data = covid.train, method = )
# The decision tree object
print(covid.dt.model)
## n= 1669
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1669 122342500.0 35.224690
## 2) Recovered< 3072.5 1650 6386908.0 7.438788
## 4) Confirmed< 12349.5 1641 592748.0 3.231566 *
## 5) Confirmed>=12349.5 9 468900.2 774.555600 *
## 3) Recovered>=3072.5 19 4054171.0 2448.211000 *
# Let's examine the tree
rpart.plot(covid.dt.model)
This model is too simplistic. Needs more features to be informative.
# Predict deaths as the anova outcome
covid.dt.predictions <- predict(covid.dt.model, covid.test, type="matrix")
# Build the confusion matrix
covid.dt.confusion <- table(covid.dt.predictions, covid.test$Deaths)
print(covid.dt.confusion)
##
## covid.dt.predictions 0 1 2 3 4 5 6 7 8 9 10 11 12
## 3.2315661182206 447 85 44 33 12 14 31 7 5 1 1 2 6
## 774.555555555556 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2448.21052631579 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## covid.dt.predictions 13 17 19 20 22 28 29 40 124 125 145 197 549
## 3.2315661182206 5 3 3 1 2 1 1 1 1 1 1 1 0
## 774.555555555556 0 0 0 0 0 0 0 0 0 0 0 0 1
## 2448.21052631579 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## covid.dt.predictions 1457 1596 1921 2346 2727 2986
## 3.2315661182206 0 0 0 0 0 0
## 774.555555555556 0 0 0 0 0 0
## 2448.21052631579 1 1 1 1 1 1
# Accuracy
covid.dt.accuracy <- sum(diag(covid.dt.confusion)) / sum(covid.dt.confusion)
print(covid.dt.accuracy)
## [1] 0.6243017
# Precision
covid.dt.precision <- covid.dt.confusion[2,2] / sum(covid.dt.confusion[2,])
print(covid.dt.precision)
## [1] 0
# Recall
covid.dt.recall <- covid.dt.confusion[2,2] / sum(covid.dt.confusion[,2])
print(covid.dt.recall)
## [1] 0
# F1 score
covid.dt.F1 <- 2 * covid.dt.precision * covid.dt.recall / (covid.dt.precision + covid.dt.recall)
print(covid.dt.F1)
## [1] NaN