Naveen Dandiprolu
Thu Apr 16 13:24:36 2015
Question:
Predict if a person's income is above or below 50K$/yr given certain features(both quantitative and qualitative).
Data:
1994 Census Data of USA describing 15 attributes for 32561 individuals. (Source: UCI Machine Learning Repository )
Challenge:
Fitting a model with both qualitative and quantitative variables isn't always an ideal scenario. Achieving high accuracy involves complex methods.
Inference: Much of the older people earn more than the younger ones.
Inference: Higher the education level, higher the probability to earn more than 50K$/year.
Inference: Most of the higher income group work more than 40 hours per week.
Inference: No impact of Final Weight on income.
Inference: People with higher income usually have higher Capital Gains (very few exceptions)
Inference: Poeple with high capital losses mostly fall under the category of low income group.
Inference: Very few females earn more than 50K$/year
Inference: Variations in the proportions of income groups across different working classes
Inference: Most of the singles earn less than 50K$ per year.
Inference: Contrasting ratio's across various occupations
Inference: Percent of black population in higher income group is very less compared to the whites.
Inference: Most of the data is collected from American's. No significant effect in modeling.
Partition Data into train(75%) and test(25%) data sets
set.seed(1234)
inTrain <- createDataPartition(rawdata$Class,p=0.75,list=FALSE)
train <- rawdata[inTrain,]
test <- rawdata[-inTrain,]
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 5815 727
>50K 365 1233
Accuracy : 0.8658
95% CI : (0.8583, 0.8732)
No Information Rate : 0.7592
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6084
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9409
Specificity : 0.6291
Pos Pred Value : 0.8889
Neg Pred Value : 0.7716
Prevalence : 0.7592
Detection Rate : 0.7144
Detection Prevalence : 0.8037
Balanced Accuracy : 0.7850
'Positive' Class : <=50K