a. Load the csv data set adult_h.data into R

library(readr)
adult_h_data <- read_csv("~/Documents/School files/MS Program Spring 2016/Classes 2017/Fall 2017/CSE 891/HW/HW 3/adult_h.data.txt")
## Parsed with column specification:
## cols(
##   age = col_integer(),
##   workclass = col_character(),
##   fnlweight = col_integer(),
##   education = col_character(),
##   educationnum = col_integer(),
##   maritalstatus = col_character(),
##   occupation = col_character(),
##   relationship = col_character(),
##   race = col_character(),
##   gender = col_character(),
##   capitalgain = col_integer(),
##   capitalloss = col_integer(),
##   hoursperweek = col_integer(),
##   country = col_character(),
##   incomeabove50 = col_character()
## )

b. Rename the factor levels for the target variable “incomeabove50” to Yes/No where Yes represents >50K and No represents <=50K

adult_h_data$incomeabove50 <- ifelse(adult_h_data$incomeabove50 == '>50K',"Yes", "No")

c. Check the distribution of target variable. Is there a class imbalance problem?

table(adult_h_data$incomeabove50)
## 
##    No   Yes 
## 24720  7841

In this case there are fare more No’s than Yes’s.

d. Load the library caret. Partition the data set into training and testing sets using a 67/33 split. Use a seed to be able to repeat the same random split

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'default/
## America/Detroit'
set.seed(123)
TrainingIndex = createDataPartition(adult_h_data$incomeabove50, p = .67, list = FALSE, times = 1)

Training = adult_h_data[TrainingIndex,]
Test = adult_h_data[-TrainingIndex,]

e. Verify the number of records in each set.

summary(Test)
##       age         workclass           fnlweight        education        
##  Min.   :17.00   Length:10744       Min.   :  12285   Length:10744      
##  1st Qu.:28.00   Class :character   1st Qu.: 117963   Class :character  
##  Median :37.00   Mode  :character   Median : 178508   Mode  :character  
##  Mean   :38.64                      Mean   : 190093                     
##  3rd Qu.:48.00                      3rd Qu.: 237729                     
##  Max.   :90.00                      Max.   :1366120                     
##   educationnum  maritalstatus       occupation        relationship      
##  Min.   : 1.0   Length:10744       Length:10744       Length:10744      
##  1st Qu.: 9.0   Class :character   Class :character   Class :character  
##  Median :10.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.1                                                           
##  3rd Qu.:13.0                                                           
##  Max.   :16.0                                                           
##      race              gender           capitalgain     capitalloss     
##  Length:10744       Length:10744       Min.   :    0   Min.   :   0.00  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.00  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.00  
##                                        Mean   : 1140   Mean   :  90.09  
##                                        3rd Qu.:    0   3rd Qu.:   0.00  
##                                        Max.   :99999   Max.   :3683.00  
##   hoursperweek     country          incomeabove50     
##  Min.   : 1.00   Length:10744       Length:10744      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00
summary(Training)
##       age         workclass           fnlweight        education        
##  Min.   :17.00   Length:21817       Min.   :  18827   Length:21817      
##  1st Qu.:28.00   Class :character   1st Qu.: 117767   Class :character  
##  Median :37.00   Mode  :character   Median : 178319   Mode  :character  
##  Mean   :38.55                      Mean   : 189623                     
##  3rd Qu.:47.00                      3rd Qu.: 236861                     
##  Max.   :90.00                      Max.   :1484705                     
##   educationnum   maritalstatus       occupation        relationship      
##  Min.   : 1.00   Length:21817       Length:21817       Length:21817      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.07                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race              gender           capitalgain     capitalloss     
##  Length:21817       Length:21817       Min.   :    0   Min.   :   0.00  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.00  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.00  
##                                        Mean   : 1047   Mean   :  85.93  
##                                        3rd Qu.:    0   3rd Qu.:   0.00  
##                                        Max.   :99999   Max.   :4356.00  
##   hoursperweek     country          incomeabove50     
##  Min.   : 1.00   Length:21817       Length:21817      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

f. Install/load library e1071

library(e1071)

g. Run naiveBayes of e1071 package

Training[sapply(Training, is.character)] <- lapply(Training[sapply(Training, is.character)], as.factor)
model = naiveBayes(incomeabove50 ~ . ,Training)

h. Examine the model by simply executing its name. What does the first table represent?

This is the total probability of there being a yes or a no.

i. What does the table “workclass” represent?

This table represents the conditional probability’s of attributes ‘workclass’ in the data. It gives the conditional probability of each value for that atribute.

j. Consider the table “age” in the model. What do the values represent?

Since age is acontinuse value the tables represents the mean and standard deviation for a No and a Yes.

k. Use the model to predict the test set. What is the accuracy. Is this a good model?

Test[sapply(Test, is.character)] <- lapply(Test[sapply(Test, is.character)], as.factor)
sapply(Test, function(r) { class(r)})
##           age     workclass     fnlweight     education  educationnum 
##     "integer"      "factor"     "integer"      "factor"     "integer" 
## maritalstatus    occupation  relationship          race        gender 
##      "factor"      "factor"      "factor"      "factor"      "factor" 
##   capitalgain   capitalloss  hoursperweek       country incomeabove50 
##     "integer"     "integer"     "integer"      "factor"      "factor"
PRED = predict(model, Test)
mean(Test$incomeabove50 == PRED)
## [1] 0.8252048
confusionMatrix(predict(model, Test), Test$incomeabove50)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  7673 1394
##        Yes  484 1193
##                                           
##                Accuracy : 0.8252          
##                  95% CI : (0.8179, 0.8323)
##     No Information Rate : 0.7592          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4567          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9407          
##             Specificity : 0.4612          
##          Pos Pred Value : 0.8463          
##          Neg Pred Value : 0.7114          
##              Prevalence : 0.7592          
##          Detection Rate : 0.7142          
##    Detection Prevalence : 0.8439          
##       Balanced Accuracy : 0.7009          
##                                           
##        'Positive' Class : No              
## 

The accuracy is .825 and Just becuase the accuracy is high does not mean that it is a good model. However, in this case this is still a good model.

l. Install/Load library rminer.

library(rminer)

m. Call mmetric on the test set predictions and output the following metrics: Accuracy,precision, true positive rate, f1. Comment on the results.

mmetric(Test$incomeabove50, PRED, metric = c('ACC', 'PRECISION','TPR', 'F1'))
##        ACC PRECISION1 PRECISION2       TPR1       TPR2        F11 
##   82.52048   84.62557   71.13894   94.06645   46.11519   89.09661 
##        F12 
##   55.95685

ACC: we already established Precision: score is looking at both classes and measuring the truly posative in the set predicted positive. TPR: THis is also looking at if both classes were the reference which is why there are 2. It is saying the percentage of possitive instances truly predicted positive. F1: is the combination of the percision and recall looking at the overall model reaching its best value at 100% or 1.