library(readr)
adult_h_data <- read_csv("~/Documents/School files/MS Program Spring 2016/Classes 2017/Fall 2017/CSE 891/HW/HW 3/adult_h.data.txt")
## Parsed with column specification:
## cols(
## age = col_integer(),
## workclass = col_character(),
## fnlweight = col_integer(),
## education = col_character(),
## educationnum = col_integer(),
## maritalstatus = col_character(),
## occupation = col_character(),
## relationship = col_character(),
## race = col_character(),
## gender = col_character(),
## capitalgain = col_integer(),
## capitalloss = col_integer(),
## hoursperweek = col_integer(),
## country = col_character(),
## incomeabove50 = col_character()
## )
adult_h_data$incomeabove50 <- ifelse(adult_h_data$incomeabove50 == '>50K',"Yes", "No")
table(adult_h_data$incomeabove50)
##
## No Yes
## 24720 7841
In this case there are fare more No’s than Yes’s.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'default/
## America/Detroit'
set.seed(123)
TrainingIndex = createDataPartition(adult_h_data$incomeabove50, p = .67, list = FALSE, times = 1)
Training = adult_h_data[TrainingIndex,]
Test = adult_h_data[-TrainingIndex,]
summary(Test)
## age workclass fnlweight education
## Min. :17.00 Length:10744 Min. : 12285 Length:10744
## 1st Qu.:28.00 Class :character 1st Qu.: 117963 Class :character
## Median :37.00 Mode :character Median : 178508 Mode :character
## Mean :38.64 Mean : 190093
## 3rd Qu.:48.00 3rd Qu.: 237729
## Max. :90.00 Max. :1366120
## educationnum maritalstatus occupation relationship
## Min. : 1.0 Length:10744 Length:10744 Length:10744
## 1st Qu.: 9.0 Class :character Class :character Class :character
## Median :10.0 Mode :character Mode :character Mode :character
## Mean :10.1
## 3rd Qu.:13.0
## Max. :16.0
## race gender capitalgain capitalloss
## Length:10744 Length:10744 Min. : 0 Min. : 0.00
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.00
## Mode :character Mode :character Median : 0 Median : 0.00
## Mean : 1140 Mean : 90.09
## 3rd Qu.: 0 3rd Qu.: 0.00
## Max. :99999 Max. :3683.00
## hoursperweek country incomeabove50
## Min. : 1.00 Length:10744 Length:10744
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
summary(Training)
## age workclass fnlweight education
## Min. :17.00 Length:21817 Min. : 18827 Length:21817
## 1st Qu.:28.00 Class :character 1st Qu.: 117767 Class :character
## Median :37.00 Mode :character Median : 178319 Mode :character
## Mean :38.55 Mean : 189623
## 3rd Qu.:47.00 3rd Qu.: 236861
## Max. :90.00 Max. :1484705
## educationnum maritalstatus occupation relationship
## Min. : 1.00 Length:21817 Length:21817 Length:21817
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.07
## 3rd Qu.:12.00
## Max. :16.00
## race gender capitalgain capitalloss
## Length:21817 Length:21817 Min. : 0 Min. : 0.00
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.00
## Mode :character Mode :character Median : 0 Median : 0.00
## Mean : 1047 Mean : 85.93
## 3rd Qu.: 0 3rd Qu.: 0.00
## Max. :99999 Max. :4356.00
## hoursperweek country incomeabove50
## Min. : 1.00 Length:21817 Length:21817
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
library(e1071)
Training[sapply(Training, is.character)] <- lapply(Training[sapply(Training, is.character)], as.factor)
model = naiveBayes(incomeabove50 ~ . ,Training)
This is the total probability of there being a yes or a no.
This table represents the conditional probability’s of attributes ‘workclass’ in the data. It gives the conditional probability of each value for that atribute.
Since age is acontinuse value the tables represents the mean and standard deviation for a No and a Yes.
Test[sapply(Test, is.character)] <- lapply(Test[sapply(Test, is.character)], as.factor)
sapply(Test, function(r) { class(r)})
## age workclass fnlweight education educationnum
## "integer" "factor" "integer" "factor" "integer"
## maritalstatus occupation relationship race gender
## "factor" "factor" "factor" "factor" "factor"
## capitalgain capitalloss hoursperweek country incomeabove50
## "integer" "integer" "integer" "factor" "factor"
PRED = predict(model, Test)
mean(Test$incomeabove50 == PRED)
## [1] 0.8252048
confusionMatrix(predict(model, Test), Test$incomeabove50)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 7673 1394
## Yes 484 1193
##
## Accuracy : 0.8252
## 95% CI : (0.8179, 0.8323)
## No Information Rate : 0.7592
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4567
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9407
## Specificity : 0.4612
## Pos Pred Value : 0.8463
## Neg Pred Value : 0.7114
## Prevalence : 0.7592
## Detection Rate : 0.7142
## Detection Prevalence : 0.8439
## Balanced Accuracy : 0.7009
##
## 'Positive' Class : No
##
The accuracy is .825 and Just becuase the accuracy is high does not mean that it is a good model. However, in this case this is still a good model.
library(rminer)
mmetric(Test$incomeabove50, PRED, metric = c('ACC', 'PRECISION','TPR', 'F1'))
## ACC PRECISION1 PRECISION2 TPR1 TPR2 F11
## 82.52048 84.62557 71.13894 94.06645 46.11519 89.09661
## F12
## 55.95685
ACC: we already established Precision: score is looking at both classes and measuring the truly posative in the set predicted positive. TPR: THis is also looking at if both classes were the reference which is why there are 2. It is saying the percentage of possitive instances truly predicted positive. F1: is the combination of the percision and recall looking at the overall model reaching its best value at 100% or 1.