Publicly available datasets are very helpful for self-studying statistical models. The UCL machine learning repository has a wide list of datasets to be used for exercise. You can accede to the repository from here: https://archive.ics.uci.edu/ml/datasets.html
Most datasets are for ‘regression’ or ‘classification’. For each dataset, you’ll found a ‘Data Folder’ containing both the data and its description. The description report information on the source of the data and past usage of it. There is also a list of the variables (usually: Name, Data Type, Measure, and so on).
To be used, a dataset must be downloaded from the repository, uploaded in your R console, renamed and cleaned. The renaming of the dataset is necessary, since in general variables in the dataset have not their proper head.
So, what to do?
As an example, I will consider the “Adult Data Set” or “Census Income” dataset, which was suggested in this post in StackoverflowR:
Click on the dataset and save it into your folder.
The data will be saved as “adultData.txt” but it is, indeed, a csv file.
This is the most tricky part. Files in csv from outside Europe can be uploaded with the library(foreign), using the read.csv command. Files in csv from outside Europe can be uploaded with the library(foreign), using the read.csv2 command.
Let updload the file. I tend to use the command “file.choose()”, which opens a dialogue window (easier than set the working directory). You have to set “header” to FALSE since the dataset only has the values of the variables,
The head (name) of the variables have to be added, and when there are less than 30 variables, the simplest way is to do this by hand.
library(foreign)
adultData <- read.csv(file.choose(), header = FALSE)
Then you need to add the head to the variables.
names(adultData) <- c("age",
"workclass",
"fnlwgt",
"education",
"education.num",
"marital.status",
"occupation",
"relationship",
"race",
"sex",
"capital.gain",
"capital.loss",
"hours.per.week",
"native.country",
"salary")
Check the data.
dim(adultData)
head(adultData)
names(adultData)
Now you can save again your reworked dataset for future use.
I suggest use row.names=FALSE to avoid the creation of a list of rownames with numbers, quote=FALSE to avoid the saving add quote to some string. I use write.csv2 since I am in Europe. If you are not in Europe, use write.csv. By default, the file is saved in the current working directory.
write.csv2(dat, file = "adultDataRenamed.csv", row.names=FALSE, quote=FALSE)
When I downloaded and then updloaded the data, I’ve recognized a problem that is common with factors: the label had extra-space, as a consequence, when you call the labels (e.g., “Bachelors”), the system does not recognize it, since in the factor this levels has an extra-space:
" Bachelors“.
You can see this by calling the levels of the factor: levels(adultData$education).
You can remove whitespaces in a read call by setting the strip.white parameter to TRUE.
The dataset is too wide to be published here. My clean dataset can be dowloaded from here: http://www.insular.it/?wpdmact=process&did=OC5ob3RsaW5r
Since I am in an European country, you should use read.csv2 to use my dataset.
If you upload the dataset in the standard way, you can say that the factors’ labels have extra space.
adultData <- read.csv2(file.choose(), header = TRUE)
levels(adultData$education)
[1] " 10th" " 11th" " 12th" " 1st-4th"
[5] " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm"
[9] " Assoc-voc" " Bachelors" " Doctorate" " HS-grad"
[13] " Masters" " Preschool" " Prof-school" " Some-college"
If you upload the dataset with strip.white = TRUE, you can see that factors’ labels have no extra space.
adultData <- read.csv2(file.choose(), header = TRUE, strip.white = TRUE)
levels(adultData$education)
[1] “10th” “11th” “12th” “1st-4th” “5th-6th”
[6] “7th-8th” “9th” “Assoc-acdm” “Assoc-voc” “Bachelors”
[11] “Doctorate” “HS-grad” “Masters” “Preschool” “Prof-school” [16] “Some-college”
adultData <- read.csv2("http://www.insular.it/?wpdmact=process&did=OC5ob3RsaW5r", header = TRUE, strip.white = TRUE)
dim(adultData)
## [1] 32561 15
head(adultData)
## age workclass fnlwgt education education.num marital.status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital.gain capital.loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hours.per.week native.country salary
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
str(adultData)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education.num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital.gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native.country: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
## $ salary : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
summary(adultData)
## age workclass fnlwgt
## Min. :17.0 Private :22696 Min. : 12285
## 1st Qu.:28.0 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.0 Local-gov : 2093 Median : 178356
## Mean :38.6 ? : 1836 Mean : 189778
## 3rd Qu.:48.0 State-gov : 1298 3rd Qu.: 237051
## Max. :90.0 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education.num marital.status
## HS-grad :10501 Min. : 1.0 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.0 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.0 Married-civ-spouse :14976
## Masters : 1723 Mean :10.1 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.0 Never-married :10683
## 11th : 1175 Max. :16.0 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty :4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair :4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial:4066 Other-relative: 981 Black : 3124
## Adm-clerical :3770 Own-child : 5068 Other : 271
## Sales :3650 Unmarried : 3446 White :27816
## Other-service :3295 Wife : 1568
## (Other) :9541
## sex capital.gain capital.loss hours.per.week
## Female:10771 Min. : 0 Min. : 0 Min. : 1.0
## Male :21790 1st Qu.: 0 1st Qu.: 0 1st Qu.:40.0
## Median : 0 Median : 0 Median :40.0
## Mean : 1078 Mean : 87 Mean :40.4
## 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.:45.0
## Max. :99999 Max. :4356 Max. :99.0
##
## native.country salary
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## ? : 583
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
Assign the dataset to an easy vector.
df <- adultData
Redefine the dataset (you do not want ‘fnlwgt’ in your dataset, believe me).
df <- df[, c(1:2,4:15)]
Apply a for cycle, using the number of columns as a counter. You can specify this by using dim(df)[2], because ‘dim’ return just two numbers: the number of rows (how may cases you have), the number of columns (how many variables you have in the dataset).
for (i in 1:dim(df)[2]) {
na <- names(df[i])
tab <- table(df[,i])
print(i) ## the number of the column where is the variable in the dataset
print(na) ## name of the variable
print(class(df[,i])) ## class of the variable
print(tab) ## frequency distribution of the values in the variable
cat("\n")
}
## [1] 1
## [1] "age"
## [1] "integer"
##
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 395 550 712 753 720 765 877 798 841 785 835 867 813 861 888 828 875 886
## 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 876 898 858 827 816 794 808 780 770 724 734 737 708 543 577 602 595 478
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
## 464 415 419 366 358 366 355 312 300 258 230 208 178 150 151 120 108 89
## 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
## 72 67 64 51 45 46 29 23 22 22 20 12 6 10 3 1 1 3
## 90
## 43
##
## [1] 2
## [1] "workclass"
## [1] "factor"
##
## ? Federal-gov Local-gov Never-worked
## 1836 960 2093 7
## Private Self-emp-inc Self-emp-not-inc State-gov
## 22696 1116 2541 1298
## Without-pay
## 14
##
## [1] 3
## [1] "education"
## [1] "factor"
##
## 10th 11th 12th 1st-4th 5th-6th
## 933 1175 433 168 333
## 7th-8th 9th Assoc-acdm Assoc-voc Bachelors
## 646 514 1067 1382 5355
## Doctorate HS-grad Masters Preschool Prof-school
## 413 10501 1723 51 576
## Some-college
## 7291
##
## [1] 4
## [1] "education.num"
## [1] "integer"
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 51 168 333 646 514 933 1175 433 10501 7291 1382 1067
## 13 14 15 16
## 5355 1723 576 413
##
## [1] 5
## [1] "marital.status"
## [1] "factor"
##
## Divorced Married-AF-spouse Married-civ-spouse
## 4443 23 14976
## Married-spouse-absent Never-married Separated
## 418 10683 1025
## Widowed
## 993
##
## [1] 6
## [1] "occupation"
## [1] "factor"
##
## ? Adm-clerical Armed-Forces Craft-repair
## 1843 3770 9 4099
## Exec-managerial Farming-fishing Handlers-cleaners Machine-op-inspct
## 4066 994 1370 2002
## Other-service Priv-house-serv Prof-specialty Protective-serv
## 3295 149 4140 649
## Sales Tech-support Transport-moving
## 3650 928 1597
##
## [1] 7
## [1] "relationship"
## [1] "factor"
##
## Husband Not-in-family Other-relative Own-child Unmarried
## 13193 8305 981 5068 3446
## Wife
## 1568
##
## [1] 8
## [1] "race"
## [1] "factor"
##
## Amer-Indian-Eskimo Asian-Pac-Islander Black
## 311 1039 3124
## Other White
## 271 27816
##
## [1] 9
## [1] "sex"
## [1] "factor"
##
## Female Male
## 10771 21790
##
## [1] 10
## [1] "capital.gain"
## [1] "integer"
##
## 0 114 401 594 914 991 1055 1086 1111 1151 1173 1409
## 29849 6 2 34 8 5 25 4 1 8 3 7
## 1424 1455 1471 1506 1639 1797 1831 1848 2009 2036 2050 2062
## 3 1 7 15 1 7 7 6 3 4 5 2
## 2105 2174 2176 2202 2228 2290 2329 2346 2354 2387 2407 2414
## 9 48 23 16 5 5 6 6 11 1 19 8
## 2463 2538 2580 2597 2635 2653 2829 2885 2907 2936 2961 2964
## 11 1 12 20 11 5 31 24 11 3 3 9
## 2977 2993 3103 3137 3273 3325 3411 3418 3432 3456 3464 3471
## 8 2 97 37 6 53 24 5 4 2 23 8
## 3674 3781 3818 3887 3908 3942 4064 4101 4386 4416 4508 4650
## 14 12 7 6 32 14 42 20 70 12 12 41
## 4687 4787 4865 4931 4934 5013 5060 5178 5455 5556 5721 6097
## 3 23 17 1 7 69 1 97 11 5 3 1
## 6360 6418 6497 6514 6723 6767 6849 7298 7430 7443 7688 7896
## 3 9 11 5 2 5 27 246 9 5 284 3
## 7978 8614 9386 9562 10520 10566 10605 11678 13550 14084 14344 15020
## 1 55 22 4 43 6 12 2 27 41 26 5
## 15024 15831 18481 20051 22040 25124 25236 27828 34095 41310 99999
## 347 6 2 37 1 4 11 34 5 2 159
##
## [1] 11
## [1] "capital.loss"
## [1] "integer"
##
## 0 155 213 323 419 625 653 810 880 974 1092 1138
## 31042 1 4 3 3 12 3 2 6 2 7 2
## 1258 1340 1380 1408 1411 1485 1504 1539 1564 1573 1579 1590
## 4 7 7 21 1 51 18 1 25 6 20 40
## 1594 1602 1617 1628 1648 1651 1668 1669 1672 1719 1721 1726
## 8 47 9 15 2 9 4 24 34 22 18 4
## 1735 1740 1741 1755 1762 1816 1825 1844 1848 1876 1887 1902
## 2 42 24 2 14 2 4 1 51 39 159 202
## 1944 1974 1977 1980 2001 2002 2042 2051 2057 2080 2129 2149
## 1 18 168 23 24 21 9 21 6 1 3 2
## 2163 2174 2179 2201 2205 2206 2231 2238 2246 2258 2267 2282
## 1 7 15 1 9 6 3 2 6 25 3 1
## 2339 2352 2377 2392 2415 2444 2457 2467 2472 2489 2547 2559
## 17 2 20 9 49 12 3 1 1 1 4 12
## 2603 2754 2824 3004 3683 3770 3900 4356
## 5 2 10 2 2 2 2 3
##
## [1] 12
## [1] "hours.per.week"
## [1] "integer"
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 20 32 39 54 60 64 26 145 18 278 11 173
## 13 14 15 16 17 18 19 20 21 22 23 24
## 23 34 404 205 29 75 14 1224 24 44 21 252
## 25 26 27 28 29 30 31 32 33 34 35 36
## 674 30 30 86 7 1149 5 266 39 28 1297 220
## 37 38 39 40 41 42 43 44 45 46 47 48
## 149 476 38 15217 36 219 151 212 1824 82 49 517
## 49 50 51 52 53 54 55 56 57 58 59 60
## 29 2819 13 138 25 41 694 97 17 28 5 1475
## 61 62 63 64 65 66 67 68 70 72 73 74
## 2 18 10 14 244 17 4 12 291 71 2 1
## 75 76 77 78 80 81 82 84 85 86 87 88
## 66 3 6 8 133 3 1 45 13 2 1 2
## 89 90 91 92 94 95 96 97 98 99
## 2 29 3 1 1 2 5 2 11 85
##
## [1] 13
## [1] "native.country"
## [1] "factor"
##
## ? Cambodia
## 583 19
## Canada China
## 121 75
## Columbia Cuba
## 59 95
## Dominican-Republic Ecuador
## 70 28
## El-Salvador England
## 106 90
## France Germany
## 29 137
## Greece Guatemala
## 29 64
## Haiti Holand-Netherlands
## 44 1
## Honduras Hong
## 13 20
## Hungary India
## 13 100
## Iran Ireland
## 43 24
## Italy Jamaica
## 73 81
## Japan Laos
## 62 18
## Mexico Nicaragua
## 643 34
## Outlying-US(Guam-USVI-etc) Peru
## 14 31
## Philippines Poland
## 198 60
## Portugal Puerto-Rico
## 37 114
## Scotland South
## 12 80
## Taiwan Thailand
## 51 18
## Trinadad&Tobago United-States
## 19 29170
## Vietnam Yugoslavia
## 67 16
##
## [1] 14
## [1] "salary"
## [1] "factor"
##
## <=50K >50K
## 24720 7841
It will always happen that some variable have to be explicitly called as factor, and some other variables have to be reorganized.
It is a good thing to plot the associations between variables.
In this dataset you may be interested in looking whether there is a link between marital status and salary.
sel <- c("marital.status", "salary") # criterion for selection
dat <- subset(adultData, select = sel) # subset based on the predefined selection
# Rename the labels of the factor 'marital.status' for plotting
# Added \n' to make room for the label when plotted
levels(dat$marital.status) <- c("Divorced", "Married \n AF-spouse","Married \n civ-spouse",
"Married \nspouse-absent", "Never-married", "Separated", "Widowed")
library(wesanderson) # Did I mention I'm a fan of Wes Anderson's movies?
## Warning: package 'wesanderson' was built under R version 3.1.3
pal <- wes_palette(name = "GrandBudapest", type = "discrete") # this does not work in knitr, unless you have installed the package in RStudio
par(las=2) # Rotated labels
par(mar = c(9, 5, 5, 3) + 0.1) # Increase bottom margin to make room for rotated labels
# In the label of y axis added '\n' to make room for the label
# Use of expression(italic("Your label")) to have the lables printed in italic
plot(dat, col = pal, ylab=expression(italic("Salary\n")),xlab="", main=expression(italic("Marital status")), cex=0.8)
## Warning: metrica carattere sconosciuta per carattere 0xa
## Warning: metrica carattere sconosciuta per carattere 0xa
Essentially, if you are ‘Married-civ-spouse’ you are more likely to get a salary with more than 50K US dollars.
Now, something more difficult: worked hours per week by country.
sel <- c("native.country", "hours.per.week") # criterion for selection
dat <- subset(adultData, select = sel) # subset based on the predefined selection
library(wesanderson) # I've already mentioned I'm a fan of Wes Anderson's movies
pal <- wes_palette(name = "Royal1", type = "continuous")
par(las=2) # Rotated labels
par(mar = c(9, 5, 5, 3) + 0.1) # Increased bottom margin to make room for rotated labels
# Use of expression(italic("Your label")) to have the lables printed in italic
plot(dat, col = pal, ylab=expression(italic("Worked hours per week")),xlab="", cex.axis=0.7,cex.lab=1.25, col.lab="#D67236",
main=expression(italic("Native country")), col.main="#D67236")
dev.off()
## null device
## 1
According to the native country, some people work more hours per week than other, and yes, there is a link between worked hours per week and salary. You work more, you get a better salary (there are a lot of outliers, though).
sel <- c("salary", "hours.per.week") # criterion for selection
dat <- subset(adultData, select = sel) # subset based on the predefined selection
library(wesanderson) # I've already mentioned I'm a fan of Wes Anderson's movies
pal <- wes_palette(name = "Cavalcanti", type = "continuous")
par(las=2) # Rotated labels
par(mar = c(9, 5, 5, 3) + 0.1) # Increased bottom margin to make room for rotated labels
# Use of expression(italic("Your label")) to have the lables printed in italic
plot(dat, col = pal, ylab=expression(italic("Worked hours per week")),xlab=expression(italic("Salary")),
cex.axis=0.7,cex.lab=1.25, col.lab="#81A88D")
In this dataset you may be also interested in looking whether there is a link between education and salary.
# reordering of the levels of the 'education' factor, to have the levels ordered from lowest to highest education
df$education<- ordered(df$education,levels = c("Preschool","1st-4th","5th-6th","7th-8th","9th","10th","11th","12th","HS-grad","Some-college","Assoc-voc","Assoc-acdm","Bachelors","Masters","Prof-school","Doctorate"))
par(mar = c(9, 8, 5, 3) + 0.1)
par(las=2)
# creation of a palette with colors between two intervals (in this example, blue to red)
colfunc <- colorRampPalette(c("blue", "red"))
# the colfunc is used to create 16 intervals of colors between the two main colors
plot(df$education~df$salary, col=colfunc(16), ylab="",xlab="Salary", cex.axis=0.7)
Apparently, as higher the education, as more likely to get a salary with more than 50K US dollars.
Any gender difference in salary distribution?
plot(df$sex~df$salary, col=c("green", "yellow"), ylab="Sex",xlab="Salary")
Here we go: female workers are less likely to get a salary with more than 50K US dollars.
Any link between gender and education?
par(mar = c(9, 8, 5, 3) + 0.1)
par(las=2)
# creation of a palette with colors between two intervals (in this example, white to black)
colfunc <- colorRampPalette(c("white", "black"))
# the colfunc is used to create 16 intervals of colors between the two main colors
plot(df$education~df$sex, col=colfunc(16), ylab="",xlab="Sex")
Apparently, males had higher education than females.
prop <- prop.table(table(df$sex,df$education))
round(prop*100,1)
##
## Preschool 1st-4th 5th-6th 7th-8th 9th 10th 11th 12th HS-grad
## Female 0.0 0.1 0.3 0.5 0.4 0.9 1.3 0.4 10.4
## Male 0.1 0.4 0.8 1.5 1.1 2.0 2.3 0.9 21.8
##
## Some-college Assoc-voc Assoc-acdm Bachelors Masters Prof-school
## Female 8.6 1.5 1.3 5.0 1.6 0.3
## Male 13.8 2.7 2.0 11.5 3.6 1.5
##
## Doctorate
## Female 0.3
## Male 1.0
options(scipen=999) # this will eliminate the scientific notation with exponentials
chisq.test(df$sex,df$education)
##
## Pearson's Chi-squared test
##
## data: df$sex and df$education
## X-squared = 297.7, df = 15, p-value < 0.00000000000000022
Is this difference a reasonable explanation for gender differences in salary?
A smart guy, Duy Bui, posted an interesting model of this dataset, which was based on the ‘caret’ package (a very interesting tool for predictive analytics and machine learning). You may found information on this package at: http://topepo.github.io/caret/index.html
Here the model of Duy Bui:
library(caret) # same as before, this does not work in knitr, unless you have installed the package in RStudio
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
The variables that were used in the model were a selection of the attributes.
selected <- c("age", "education", "marital.status", "relationship", "sex", "hours.per.week", "salary")
adultData <- subset(adultData, select = selected)
The training set and the test set were created by using a useful feature of the ‘caret’ package, the command ‘createDataPartition’. A random index is created, which can be used to select the training set and the test set. I’ve reduced the data set to 60% only (which is accetable for training).
trainIndex = createDataPartition(adultData$salary, p=0.60, list=FALSE)
training = adultData[ trainIndex, ]
I also added a test-set.
test = adultData[ -trainIndex, ]
The model is based on the recursive partitioning for classification, regression and survival trees (rpart). See here for details: http://cran.r-project.org/web/packages/rpart/index.html
set.seed(33833) # for reproducibility
modFit <- train(salary ~ ., method = "rpart", data=training)
## Loading required package: rpart
This analysis will take some time, so I add an alarm sound to warn me when the analysis had completed.
Why I do not use alarm()? Because, for no clear reasons, it does not work in my PC.
You can skip this
library(audio)
beep <- function(n = 4){
for(i in seq(n)){
play(sin(1:10000/20))
Sys.sleep(.5)
}
}
beep()
### done
### done
Overall accuracy (tested on the test set, a NEW dataset than the one in which the model had been evaluated). This script is based on ‘Practical Data Science with R’, authored by Nina Zumel and John Mount, a very good reading.
prediction <- predict(modFit, newdata=test)
tab <- table(prediction, test$salary)
sum(diag(tab))/sum(tab)
## [1] 0.8142
Better testing with the caret package.
rpartPred<-predict(modFit,test)
confusionMatrix(rpartPred,test$salary)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 9419 1951
## >50K 469 1185
##
## Accuracy : 0.814
## 95% CI : (0.807, 0.821)
## No Information Rate : 0.759
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.394
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.953
## Specificity : 0.378
## Pos Pred Value : 0.828
## Neg Pred Value : 0.716
## Prevalence : 0.759
## Detection Rate : 0.723
## Detection Prevalence : 0.873
## Balanced Accuracy : 0.665
##
## 'Positive' Class : <=50K
##
Plot the model (not really clear).
library(rattle)
## Warning: package 'rattle' was built under R version 3.1.3
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(modFit$finalModel)
Alternative.
library(partykit)
## Warning: package 'partykit' was built under R version 3.1.3
## Loading required package: grid
finalModel <-as.party(modFit$finalModel)
par(mar = c(5, 8, 5, 3) + 0.1)
plot(finalModel)
We are interested in the prediction with the new data value as specified by the investigator:
fem <- data.frame(age = 40, education = "Masters", marital.status = "Married-civ-spouse", relationship = "Own-child", sex = "Female", hours.per.week = 40)
predict(modFit, newdata = fem)
## [1] >50K
## Levels: <=50K >50K
mal <- data.frame(age = 40, education = "Masters", marital.status = "Married-civ-spouse", relationship = "Own-child", sex = "Male", hours.per.week = 40)
predict(modFit, newdata = mal)
## [1] >50K
## Levels: <=50K >50K
Same socio-demographic profile but different gender seems not related to differences in salary. Uhm!… Probably more analyses are required.
I hope you’ve enjoyed this!