This project is an exercise prepared within UDEMY’s course: “R for Data Science and Machine Learning”.
In this project we will be working with the UCI adult dataset. We will be attempting to predict if people in the data set belong in a certain class by salary, either making <=50k or >50k per year. Most of this project is focus on the different issues that may arise when cleaning data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
adult <- read_csv("adult_sal.csv")
## New names:
## * `` -> ...1
## Rows: 32561 Columns: 16
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): type_employer, education, marital, occupation, relationship, race, ...
## dbl (7): ...1, age, fnlwgt, education_num, capital_gain, capital_loss, hr_pe...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(adult, 10)
## # A tibble: 10 x 16
## ...1 age type_employer fnlwgt education education_num marital occupation
## <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 39 State-gov 77516 Bachelors 13 Never-m~ Adm-cleri~
## 2 2 50 Self-emp-not-~ 83311 Bachelors 13 Married~ Exec-mana~
## 3 3 38 Private 215646 HS-grad 9 Divorced Handlers-~
## 4 4 53 Private 234721 11th 7 Married~ Handlers-~
## 5 5 28 Private 338409 Bachelors 13 Married~ Prof-spec~
## 6 6 37 Private 284582 Masters 14 Married~ Exec-mana~
## 7 7 49 Private 160187 9th 5 Married~ Other-ser~
## 8 8 52 Self-emp-not-~ 209642 HS-grad 9 Married~ Exec-mana~
## 9 9 31 Private 45781 Masters 14 Never-m~ Prof-spec~
## 10 10 42 Private 159449 Bachelors 13 Married~ Exec-mana~
## # ... with 8 more variables: relationship <chr>, race <chr>, sex <chr>,
## # capital_gain <dbl>, capital_loss <dbl>, hr_per_week <dbl>, country <chr>,
## # income <chr>
Dropping duplicated index column
adult <- select(adult, -1)
str(adult)
## tibble [32,561 x 15] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:32561] 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: chr [1:32561] "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : num [1:32561] 77516 83311 215646 234721 338409 ...
## $ education : chr [1:32561] "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ education_num: num [1:32561] 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : chr [1:32561] "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ occupation : chr [1:32561] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ relationship : chr [1:32561] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ race : chr [1:32561] "White" "White" "White" "Black" ...
## $ sex : chr [1:32561] "Male" "Male" "Male" "Male" ...
## $ capital_gain : num [1:32561] 2174 0 0 0 0 ...
## $ capital_loss : num [1:32561] 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : num [1:32561] 40 13 40 40 40 40 16 45 50 40 ...
## $ country : chr [1:32561] "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr [1:32561] "<=50K" "<=50K" "<=50K" "<=50K" ...
head(adult)
## # A tibble: 6 x 15
## age type_employer fnlwgt education education_num marital occupation
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 39 State-gov 77516 Bachelors 13 Never-marr~ Adm-clerical
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-ci~ Exec-manage~
## 3 38 Private 215646 HS-grad 9 Divorced Handlers-cl~
## 4 53 Private 234721 11th 7 Married-ci~ Handlers-cl~
## 5 28 Private 338409 Bachelors 13 Married-ci~ Prof-specia~
## 6 37 Private 284582 Masters 14 Married-ci~ Exec-manage~
## # ... with 8 more variables: relationship <chr>, race <chr>, sex <chr>,
## # capital_gain <dbl>, capital_loss <dbl>, hr_per_week <dbl>, country <chr>,
## # income <chr>
summary(adult)
## age type_employer fnlwgt education
## Min. :17.00 Length:32561 Min. : 12285 Length:32561
## 1st Qu.:28.00 Class :character 1st Qu.: 117827 Class :character
## Median :37.00 Mode :character Median : 178356 Mode :character
## Mean :38.58 Mean : 189778
## 3rd Qu.:48.00 3rd Qu.: 237051
## Max. :90.00 Max. :1484705
## education_num marital occupation relationship
## Min. : 1.00 Length:32561 Length:32561 Length:32561
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.08
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital_gain capital_loss
## Length:32561 Length:32561 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1078 Mean : 87.3
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :4356.0
## hr_per_week country income
## Min. : 1.00 Length:32561 Length:32561
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
colnames(adult)
## [1] "age" "type_employer" "fnlwgt" "education"
## [5] "education_num" "marital" "occupation" "relationship"
## [9] "race" "sex" "capital_gain" "capital_loss"
## [13] "hr_per_week" "country" "income"
unique(adult$type_employer)
## [1] "State-gov" "Self-emp-not-inc" "Private" "Federal-gov"
## [5] "Local-gov" "?" "Self-emp-inc" "Without-pay"
## [9] "Never-worked"
unique(adult$marital)
## [1] "Never-married" "Married-civ-spouse" "Divorced"
## [4] "Married-spouse-absent" "Separated" "Married-AF-spouse"
## [7] "Widowed"
unique(adult$race)
## [1] "White" "Black" "Asian-Pac-Islander"
## [4] "Amer-Indian-Eskimo" "Other"
unique(adult$sex)
## [1] "Male" "Female"
unique(adult$country)
## [1] "United-States" "Cuba"
## [3] "Jamaica" "India"
## [5] "?" "Mexico"
## [7] "South" "Puerto-Rico"
## [9] "Honduras" "England"
## [11] "Canada" "Germany"
## [13] "Iran" "Philippines"
## [15] "Italy" "Poland"
## [17] "Columbia" "Cambodia"
## [19] "Thailand" "Ecuador"
## [21] "Laos" "Taiwan"
## [23] "Haiti" "Portugal"
## [25] "Dominican-Republic" "El-Salvador"
## [27] "France" "Guatemala"
## [29] "China" "Japan"
## [31] "Yugoslavia" "Peru"
## [33] "Outlying-US(Guam-USVI-etc)" "Scotland"
## [35] "Trinadad&Tobago" "Greece"
## [37] "Nicaragua" "Vietnam"
## [39] "Hong" "Ireland"
## [41] "Hungary" "Holand-Netherlands"
It seems read_csv function has avoided reading some columns as factors. It could be useful to have sex, race and income columns as factor variables in further analysis.
adult$sex <- as.factor(adult$sex)
adult$race <- as.factor(adult$race)
adult$income <- as.factor(adult$income)
str(adult)
## tibble [32,561 x 15] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:32561] 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: chr [1:32561] "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : num [1:32561] 77516 83311 215646 234721 338409 ...
## $ education : chr [1:32561] "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ education_num: num [1:32561] 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : chr [1:32561] "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ occupation : chr [1:32561] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ relationship : chr [1:32561] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : num [1:32561] 2174 0 0 0 0 ...
## $ capital_loss : num [1:32561] 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : num [1:32561] 40 13 40 40 40 40 16 45 50 40 ...
## $ country : chr [1:32561] "United-States" "United-States" "United-States" "United-States" ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
type_employer column
table(adult$type_employer)
##
## ? Federal-gov Local-gov Never-worked
## 1836 960 2093 7
## Private Self-emp-inc Self-emp-not-inc State-gov
## 22696 1116 2541 1298
## Without-pay
## 14
Table analysis of column “type_employer” shows that there is room for some cleaning. The aim here is to consolidate features to have a more robust model down the road.
We will start by assigning a correct NA value to the “?” text used in 1836 observations
adult <- adult %>%
mutate(type_employer = ifelse(type_employer=="?", NA, type_employer)) %>%
mutate(type_employer = ifelse(type_employer=="Never-worked" | type_employer == "Without-pay",
"Unemployed", type_employer)) %>%
mutate(type_employer = ifelse(type_employer=="Local-gov" | type_employer=="State-gov",
"SL-gov", type_employer)) %>%
mutate(type_employer = ifelse(type_employer == "Self-emp-inc" | type_employer == "Self-emp-not-inc", "self-empl", type_employer))
table(adult$type_employer)
##
## Federal-gov Private self-empl SL-gov Unemployed
## 960 22696 3657 3391 21
sum(is.na(adult$type_employer))
## [1] 1836
Marital Column
table(adult$marital)
##
## Divorced Married-AF-spouse Married-civ-spouse
## 4443 23 14976
## Married-spouse-absent Never-married Separated
## 418 10683 1025
## Widowed
## 993
Let’s reduce marital status groups into only 3 categories: Married, Not-Married, Never-Married
adult <- adult %>%
mutate(marital=ifelse(marital=="Married-AF-spouse"|marital=="Married-civ-spouse"|
marital=="Married-spouse-absent", "Married", marital))%>%
mutate(marital=ifelse(marital=="Divorced"|marital=="Separated"|marital=="Widowed",
"Not-Married", marital))
Country Column
table(adult$country)
##
## ? Cambodia
## 583 19
## Canada China
## 121 75
## Columbia Cuba
## 59 95
## Dominican-Republic Ecuador
## 70 28
## El-Salvador England
## 106 90
## France Germany
## 29 137
## Greece Guatemala
## 29 64
## Haiti Holand-Netherlands
## 44 1
## Honduras Hong
## 13 20
## Hungary India
## 13 100
## Iran Ireland
## 43 24
## Italy Jamaica
## 73 81
## Japan Laos
## 62 18
## Mexico Nicaragua
## 643 34
## Outlying-US(Guam-USVI-etc) Peru
## 14 31
## Philippines Poland
## 198 60
## Portugal Puerto-Rico
## 37 114
## Scotland South
## 12 80
## Taiwan Thailand
## 51 18
## Trinadad&Tobago United-States
## 19 29170
## Vietnam Yugoslavia
## 67 16
As expected, the number of countries observed is somewhat large. We will proceed to group them by subcontinents under a new variable called “region”. The parameter to group this items will reference from Wikipedia page “List of Countries by regional classification” (https://meta.wikimedia.org/wiki/List_of_countries_by_regional_classification).
countries <- read_csv("country_list.csv")
## Rows: 248 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Country, Region, Global South
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(countries)
## [1] "Country" "Region" "Global South"
Country’s names in country classification list do no contain hyphens. We will strip hyphens from the adult data frame in order to facilitate join ahead.
adult$country <- gsub("-", " ", adult$country)
Finally, before proceeding with the join, let’s do a couple more corrections
adult <- adult %>%
mutate(country = ifelse(country=="?", NA, country)) %>%
mutate(country = ifelse(country=="Outlying US(Guam USVI etc)",
"United States Minor Outlying Islands", country))%>%
mutate(country = ifelse(country=="Trinadad&Tobago", "Trinidad and Tobago", country))%>%
mutate(country = ifelse(country=="Columbia", "Colombia", country))%>%
mutate(country = ifelse(country=="Holand Netherlands", "Netherlands", country))
adult <- left_join(adult, countries, by = c("country" = "Country"))
Let’s explore NAs after the join
paste("There are", sum(is.na(adult$country)), "NA countries in the adult DF")
## [1] "There are 583 NA countries in the adult DF"
paste("There are", sum(is.na(adult$Region)), "NA regions in the adult DF")
## [1] "There are 862 NA regions in the adult DF"
This means there are still 279 name records to clean in the adult DF.
adultnas <- adult %>%
filter(is.na(adult$Region) & !is.na(adult$country))
unique(adultnas$country)
## [1] "South" "England" "Iran" "Laos" "Yugoslavia"
## [6] "Scotland" "Hong"
There are still seven country names that don’t match the classification list. Our of this, there is one name “South” that we cannot interpret with the information that we have, so it will be ignored. The other six can be corrected:
adult <- adult %>%
mutate(country = ifelse(country=="South", NA, country))%>%
mutate(country = ifelse(country=="England", "United Kingdom", country)) %>%
mutate(country = ifelse(country=="Iran", "Iran, Islamic Republic of", country)) %>%
mutate(country = ifelse(country=="Laos", "Lao People's Democratic Republic", country)) %>%
mutate(country = ifelse(country=="Yugoslavia", "Bosnia and Herzegovina", country)) %>%
mutate(country = ifelse(country=="Scotland", "United Kingdom", country)) %>%
mutate(country = ifelse(country=="Hong", "Hong Kong", country))
Important to note: England and Scotland have been assigned to the United Kingdom. Yugoslavia has been changed to “Bosnia and Herzegovina” in an arbitrary fashion, since we don’t have any information to divide registries into the different nations that arised from the dissolution of the former Yugoslavia.
adult <- adult%>%
select(c(1:15))%>%
left_join(countries, by = c("country" = "Country"))
Let’s explore NAs after the join
paste("There are", sum(is.na(adult$country)), "NA countries in the adult DF")
## [1] "There are 663 NA countries in the adult DF"
paste("There are", sum(is.na(adult$Region)), "NA regions in the adult DF")
## [1] "There are 663 NA regions in the adult DF"
table(adult$Region)
##
## Asia & Pacific Europe Middle east North America
## 642 521 43 29291
## South/Latin America
## 1401
table(adult$`Global South`)
##
## Global North Global South
## 29925 1973
missmap(adult, legend = F, col = c("navy", "gray"))
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
head(adult, 12)
## # A tibble: 12 x 17
## age type_employer fnlwgt education education_num marital occupation
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 39 SL-gov 77516 Bachelors 13 Never-mar~ Adm-clerical
## 2 50 self-empl 83311 Bachelors 13 Married Exec-manage~
## 3 38 Private 215646 HS-grad 9 Not-Marri~ Handlers-cl~
## 4 53 Private 234721 11th 7 Married Handlers-cl~
## 5 28 Private 338409 Bachelors 13 Married Prof-specia~
## 6 37 Private 284582 Masters 14 Married Exec-manage~
## 7 49 Private 160187 9th 5 Married Other-servi~
## 8 52 self-empl 209642 HS-grad 9 Married Exec-manage~
## 9 31 Private 45781 Masters 14 Never-mar~ Prof-specia~
## 10 42 Private 159449 Bachelors 13 Married Exec-manage~
## 11 37 Private 280464 Some-college 10 Married Exec-manage~
## 12 30 SL-gov 141297 Bachelors 13 Married Prof-specia~
## # ... with 10 more variables: relationship <chr>, race <fct>, sex <fct>,
## # capital_gain <dbl>, capital_loss <dbl>, hr_per_week <dbl>, country <chr>,
## # income <fct>, Region <chr>, Global South <chr>
check for “?” characters in the data frame
res <- NULL
for (i in 1:17){
val <- sum((adult[,i])=="?")
ad.var <- colnames(adult[,i])
res <- rbind(res, c(ad.var, val))
}
colnames(res) <- c("variable", "observed_?s")
print(res)
## variable observed_?s
## [1,] "age" "0"
## [2,] "type_employer" NA
## [3,] "fnlwgt" "0"
## [4,] "education" "0"
## [5,] "education_num" "0"
## [6,] "marital" "0"
## [7,] "occupation" "1843"
## [8,] "relationship" "0"
## [9,] "race" "0"
## [10,] "sex" "0"
## [11,] "capital_gain" "0"
## [12,] "capital_loss" "0"
## [13,] "hr_per_week" "0"
## [14,] "country" NA
## [15,] "income" "0"
## [16,] "Region" NA
## [17,] "Global South" NA
The only column that contains “?” characters is “occupation”. We can fix that in order facilitate analysis.
adult <- mutate(adult, occupation = ifelse(occupation=="?", NA, occupation))
res <- NULL
for (i in 1:17){
val <- sum((adult[,i])=="?")
ad.var <- colnames(adult[,i])
res <- rbind(res, c(ad.var, val))
}
colnames(res) <- c("variable", "observed_?s")
print(res)
## variable observed_?s
## [1,] "age" "0"
## [2,] "type_employer" NA
## [3,] "fnlwgt" "0"
## [4,] "education" "0"
## [5,] "education_num" "0"
## [6,] "marital" "0"
## [7,] "occupation" NA
## [8,] "relationship" "0"
## [9,] "race" "0"
## [10,] "sex" "0"
## [11,] "capital_gain" "0"
## [12,] "capital_loss" "0"
## [13,] "hr_per_week" "0"
## [14,] "country" NA
## [15,] "income" "0"
## [16,] "Region" NA
## [17,] "Global South" NA
Now every observation that contained a “?” character for any variable has been corrected to NA.
missmap(adult, legend = F, col = c("navy", "gray"))
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
paste("df contains", sum(complete.cases(adult)), "complete cases")
## [1] "df contains 30091 complete cases"
paste("they account for", round(100*sum(complete.cases(adult))/nrow(adult), 3), "percentage of the df")
## [1] "they account for 92.414 percentage of the df"
As we can see in the previous graphic, NA values are present in five variables. Three of them are geographic location variables, they are completely co-related, and most likely a useful imputation is not possible. The other two variables with missing data are “occupation” and “type_employer”, NA frequency is also correlated, so meaningful imputation seems unlikely.
adult <- na.omit(adult)
missmap(adult, legend = F, col = c("navy", "gray"))
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
Now that we have cleaned the data, let’s take a close look at it.
str(adult)
## tibble [30,091 x 17] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:30091] 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: chr [1:30091] "SL-gov" "self-empl" "Private" "Private" ...
## $ fnlwgt : num [1:30091] 77516 83311 215646 234721 338409 ...
## $ education : chr [1:30091] "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ education_num: num [1:30091] 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : chr [1:30091] "Never-married" "Married" "Not-Married" "Married" ...
## $ occupation : chr [1:30091] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ relationship : chr [1:30091] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : num [1:30091] 2174 0 0 0 0 ...
## $ capital_loss : num [1:30091] 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : num [1:30091] 40 13 40 40 40 40 16 45 50 40 ...
## $ country : chr [1:30091] "United States" "United States" "United States" "United States" ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
## $ Region : chr [1:30091] "North America" "North America" "North America" "North America" ...
## $ Global South : chr [1:30091] "Global North" "Global North" "Global North" "Global North" ...
## - attr(*, "na.action")= 'omit' Named int [1:2470] 15 28 39 52 62 70 78 94 107 129 ...
## ..- attr(*, "names")= chr [1:2470] "15" "28" "39" "52" ...
Histogram of ages colored by income
ggplot(data=adult, aes(age)) + geom_histogram(aes(fill=income), binwidth = 1)
Histogram of hours worked per week
ggplot(data=adult, aes(hr_per_week)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Income level by region
ggplot(data=adult, aes(Region)) + geom_bar(aes(fill=income))
sum(adult$Region=="Middle east")
## [1] 42
ggplot(data=adult, aes(Region)) + geom_bar(aes(fill=income), position = "fill")
By far, most observations relate to the North America region. Second region with most observations is South/Latin America, while Asia & Pacific and Europe have fewer observations. Middle east accountants for the least number of observations.
Middle east region reports the higher ratio of income above the 50k threshold, while South/Latin America has the least proportion of people earning above 50 k. Asia & Pacific, Europe and North America show a similar general income distribution pattern.
This model will predict if people is likely to fall into above or below 50k salary groups.
Logistic Regression is a type of classification model. In classification models, we attempt to predict the outcome of categorical dependent variables, using one or more independent variables. The independent variables can be either categorical or numerical.
Logistic regression is based on the logistic function, which always takes values between 0 and 1. Replacing the dependent variable of the logistic function with a linear combination of dependent variables we intend to use for regression, we arrive at the formula for logistic regression.
head(adult, 10)
## # A tibble: 10 x 17
## age type_employer fnlwgt education education_num marital occupation
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 39 SL-gov 77516 Bachelors 13 Never-married Adm-clerical
## 2 50 self-empl 83311 Bachelors 13 Married Exec-manage~
## 3 38 Private 215646 HS-grad 9 Not-Married Handlers-cl~
## 4 53 Private 234721 11th 7 Married Handlers-cl~
## 5 28 Private 338409 Bachelors 13 Married Prof-specia~
## 6 37 Private 284582 Masters 14 Married Exec-manage~
## 7 49 Private 160187 9th 5 Married Other-servi~
## 8 52 self-empl 209642 HS-grad 9 Married Exec-manage~
## 9 31 Private 45781 Masters 14 Never-married Prof-specia~
## 10 42 Private 159449 Bachelors 13 Married Exec-manage~
## # ... with 10 more variables: relationship <chr>, race <fct>, sex <fct>,
## # capital_gain <dbl>, capital_loss <dbl>, hr_per_week <dbl>, country <chr>,
## # income <fct>, Region <chr>, Global South <chr>
adult <- adult[,c(1:13, 15,16)]
library(caTools)
set.seed(101)
adult.split <- sample.split(adult$income, SplitRatio = .65)
adult.train <- subset(adult, adult.split == TRUE)
adult.test <- subset(adult, adult.split == FALSE)
ad.mdtr.1 <- glm(income ~ ., family = binomial(logit), data = adult.train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(ad.mdtr.1)
##
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = adult.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0422 -0.5219 -0.1957 -0.0008 3.6049
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.115e+00 4.627e-01 -11.055 < 2e-16 ***
## age 2.537e-02 2.079e-03 12.203 < 2e-16 ***
## type_employerPrivate -5.145e-01 1.146e-01 -4.492 7.07e-06 ***
## type_employerself-empl -8.382e-01 1.277e-01 -6.562 5.30e-11 ***
## type_employerSL-gov -6.926e-01 1.291e-01 -5.363 8.19e-08 ***
## type_employerUnemployed -1.315e+01 2.703e+02 -0.049 0.961188
## fnlwgt 7.574e-07 2.148e-07 3.526 0.000422 ***
## education11th 3.334e-01 2.704e-01 1.233 0.217606
## education12th 6.009e-01 3.364e-01 1.787 0.074017 .
## education1st-4th 2.010e-03 5.142e-01 0.004 0.996881
## education5th-6th -2.495e-01 4.382e-01 -0.569 0.569191
## education7th-8th -4.729e-01 3.218e-01 -1.470 0.141618
## education9th -3.076e-01 3.495e-01 -0.880 0.378815
## educationAssoc-acdm 1.438e+00 2.268e-01 6.339 2.31e-10 ***
## educationAssoc-voc 1.403e+00 2.199e-01 6.380 1.77e-10 ***
## educationBachelors 2.023e+00 2.046e-01 9.887 < 2e-16 ***
## educationDoctorate 3.082e+00 2.824e-01 10.913 < 2e-16 ***
## educationHS-grad 9.128e-01 1.993e-01 4.579 4.67e-06 ***
## educationMasters 2.243e+00 2.174e-01 10.317 < 2e-16 ***
## educationPreschool -1.060e+01 1.478e+02 -0.072 0.942835
## educationProf-school 3.020e+00 2.578e-01 11.717 < 2e-16 ***
## educationSome-college 1.263e+00 2.019e-01 6.252 4.05e-10 ***
## education_num NA NA NA NA
## maritalNever-married -1.103e+00 2.208e-01 -4.997 5.81e-07 ***
## maritalNot-Married -5.374e-01 2.191e-01 -2.452 0.014201 *
## occupationArmed-Forces -1.033e+00 1.525e+00 -0.678 0.498037
## occupationCraft-repair 8.915e-02 9.872e-02 0.903 0.366515
## occupationExec-managerial 8.515e-01 9.582e-02 8.886 < 2e-16 ***
## occupationFarming-fishing -1.025e+00 1.737e-01 -5.900 3.64e-09 ***
## occupationHandlers-cleaners -6.679e-01 1.824e-01 -3.662 0.000250 ***
## occupationMachine-op-inspct -1.927e-01 1.238e-01 -1.556 0.119651
## occupationOther-service -8.442e-01 1.465e-01 -5.763 8.27e-09 ***
## occupationPriv-house-serv -3.645e+00 1.803e+00 -2.021 0.043231 *
## occupationProf-specialty 4.767e-01 1.015e-01 4.698 2.63e-06 ***
## occupationProtective-serv 4.087e-01 1.584e-01 2.581 0.009862 **
## occupationSales 3.603e-01 1.020e-01 3.531 0.000414 ***
## occupationTech-support 6.304e-01 1.363e-01 4.625 3.74e-06 ***
## occupationTransport-moving -1.224e-01 1.227e-01 -0.997 0.318551
## relationshipNot-in-family -1.072e+00 2.164e-01 -4.954 7.25e-07 ***
## relationshipOther-relative -1.376e+00 2.922e-01 -4.709 2.49e-06 ***
## relationshipOwn-child -2.334e+00 2.784e-01 -8.382 < 2e-16 ***
## relationshipUnmarried -1.170e+00 2.387e-01 -4.902 9.49e-07 ***
## relationshipWife 1.387e+00 1.299e-01 10.681 < 2e-16 ***
## raceAsian-Pac-Islander 2.404e-01 3.452e-01 0.696 0.486228
## raceBlack -1.049e-01 2.914e-01 -0.360 0.718973
## raceOther -5.499e-01 4.650e-01 -1.182 0.237010
## raceWhite 2.512e-02 2.777e-01 0.090 0.927923
## sexMale 8.906e-01 9.939e-02 8.961 < 2e-16 ***
## capital_gain 3.141e-04 1.300e-05 24.168 < 2e-16 ***
## capital_loss 6.175e-04 4.726e-05 13.066 < 2e-16 ***
## hr_per_week 2.862e-02 2.078e-03 13.774 < 2e-16 ***
## RegionEurope 4.365e-01 2.886e-01 1.513 0.130384
## RegionMiddle east -1.176e-01 5.875e-01 -0.200 0.841345
## RegionNorth America 3.060e-01 2.421e-01 1.264 0.206230
## RegionSouth/Latin America -2.619e-01 2.870e-01 -0.912 0.361551
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21956 on 19558 degrees of freedom
## Residual deviance: 12828 on 19505 degrees of freedom
## AIC: 12936
##
## Number of Fisher Scoring iterations: 13
Now let’s use the step() function to remove predictive variables that do not significantly add to the fit.
new.model <- step(ad.mdtr.1)
## Start: AIC=12936.4
## income ~ age + type_employer + fnlwgt + education + education_num +
## marital + occupation + relationship + race + sex + capital_gain +
## capital_loss + hr_per_week + Region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=12936.4
## income ~ age + type_employer + fnlwgt + education + marital +
## occupation + relationship + race + sex + capital_gain + capital_loss +
## hr_per_week + Region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - race 4 12834 12934
## <none> 12828 12936
## - Region 4 12844 12944
## - fnlwgt 1 12841 12947
## - marital 2 12871 12975
## - type_employer 4 12884 12984
## - sex 1 12912 13018
## - age 1 12978 13084
## - capital_loss 1 13004 13110
## - hr_per_week 1 13022 13128
## - relationship 5 13095 13193
## - occupation 13 13226 13308
## - education 15 13440 13518
## - capital_gain 1 13963 14069
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=12934.05
## income ~ age + type_employer + fnlwgt + education + marital +
## occupation + relationship + sex + capital_gain + capital_loss +
## hr_per_week + Region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 12834 12934
## - fnlwgt 1 12845 12943
## - Region 4 12852 12944
## - marital 2 12877 12973
## - type_employer 4 12888 12980
## - sex 1 12919 13017
## - age 1 12985 13083
## - capital_loss 1 13010 13108
## - hr_per_week 1 13028 13126
## - relationship 5 13102 13192
## - occupation 13 13236 13310
## - education 15 13450 13520
## - capital_gain 1 13968 14066
summary(new.model)
##
## Call:
## glm(formula = income ~ age + type_employer + fnlwgt + education +
## marital + occupation + relationship + sex + capital_gain +
## capital_loss + hr_per_week + Region, family = binomial(logit),
## data = adult.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0407 -0.5228 -0.1959 -0.0008 3.6206
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.929e+00 3.304e-01 -14.919 < 2e-16 ***
## age 2.545e-02 2.078e-03 12.246 < 2e-16 ***
## type_employerPrivate -4.999e-01 1.140e-01 -4.387 1.15e-05 ***
## type_employerself-empl -8.225e-01 1.271e-01 -6.471 9.71e-11 ***
## type_employerSL-gov -6.814e-01 1.288e-01 -5.291 1.22e-07 ***
## type_employerUnemployed -1.313e+01 2.703e+02 -0.049 0.961264
## fnlwgt 7.196e-07 2.126e-07 3.384 0.000714 ***
## education11th 3.382e-01 2.705e-01 1.250 0.211136
## education12th 5.936e-01 3.362e-01 1.766 0.077457 .
## education1st-4th 1.215e-02 5.142e-01 0.024 0.981157
## education5th-6th -2.522e-01 4.385e-01 -0.575 0.565081
## education7th-8th -4.748e-01 3.218e-01 -1.476 0.140040
## education9th -3.078e-01 3.491e-01 -0.882 0.378020
## educationAssoc-acdm 1.443e+00 2.267e-01 6.365 1.96e-10 ***
## educationAssoc-voc 1.408e+00 2.198e-01 6.403 1.52e-10 ***
## educationBachelors 2.031e+00 2.045e-01 9.929 < 2e-16 ***
## educationDoctorate 3.093e+00 2.824e-01 10.951 < 2e-16 ***
## educationHS-grad 9.181e-01 1.993e-01 4.607 4.09e-06 ***
## educationMasters 2.252e+00 2.173e-01 10.363 < 2e-16 ***
## educationPreschool -1.066e+01 1.469e+02 -0.073 0.942169
## educationProf-school 3.029e+00 2.578e-01 11.749 < 2e-16 ***
## educationSome-college 1.268e+00 2.019e-01 6.279 3.41e-10 ***
## maritalNever-married -1.096e+00 2.210e-01 -4.960 7.04e-07 ***
## maritalNot-Married -5.311e-01 2.194e-01 -2.421 0.015497 *
## occupationArmed-Forces -1.019e+00 1.520e+00 -0.670 0.502635
## occupationCraft-repair 9.118e-02 9.866e-02 0.924 0.355404
## occupationExec-managerial 8.538e-01 9.578e-02 8.913 < 2e-16 ***
## occupationFarming-fishing -1.020e+00 1.736e-01 -5.879 4.12e-09 ***
## occupationHandlers-cleaners -6.750e-01 1.822e-01 -3.705 0.000212 ***
## occupationMachine-op-inspct -2.009e-01 1.237e-01 -1.624 0.104472
## occupationOther-service -8.499e-01 1.463e-01 -5.808 6.33e-09 ***
## occupationPriv-house-serv -3.639e+00 1.795e+00 -2.027 0.042655 *
## occupationProf-specialty 4.788e-01 1.014e-01 4.723 2.33e-06 ***
## occupationProtective-serv 4.015e-01 1.582e-01 2.537 0.011169 *
## occupationSales 3.616e-01 1.020e-01 3.546 0.000390 ***
## occupationTech-support 6.371e-01 1.361e-01 4.680 2.87e-06 ***
## occupationTransport-moving -1.244e-01 1.226e-01 -1.014 0.310385
## relationshipNot-in-family -1.081e+00 2.166e-01 -4.992 5.99e-07 ***
## relationshipOther-relative -1.392e+00 2.934e-01 -4.743 2.10e-06 ***
## relationshipOwn-child -2.336e+00 2.787e-01 -8.381 < 2e-16 ***
## relationshipUnmarried -1.183e+00 2.388e-01 -4.956 7.21e-07 ***
## relationshipWife 1.387e+00 1.299e-01 10.684 < 2e-16 ***
## sexMale 8.970e-01 9.937e-02 9.026 < 2e-16 ***
## capital_gain 3.139e-04 1.299e-05 24.170 < 2e-16 ***
## capital_loss 6.169e-04 4.727e-05 13.052 < 2e-16 ***
## hr_per_week 2.866e-02 2.078e-03 13.790 < 2e-16 ***
## RegionEurope 2.530e-01 2.205e-01 1.147 0.251365
## RegionMiddle east -2.702e-01 5.660e-01 -0.477 0.633019
## RegionNorth America 1.154e-01 1.548e-01 0.745 0.455992
## RegionSouth/Latin America -4.954e-01 2.194e-01 -2.258 0.023940 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21956 on 19558 degrees of freedom
## Residual deviance: 12834 on 19509 degrees of freedom
## AIC: 12934
##
## Number of Fisher Scoring iterations: 13
adult.test$predicted.income <- predict(new.model, newdata = adult.test, type="response")
table(adult.test$income, adult.test$predicted.income > 0.5)
##
## FALSE TRUE
## <=50K 7356 553
## >50K 1018 1605
Sum of FALSE / Sum of ALL
cm.income <- table(adult.test$income, adult.test$predicted.income > 0.5)
(cm.income[1,1] + cm.income[2,2])/sum(cm.income)
## [1] 0.8508355
Recall
cm.income[1,1]/(sum(cm.income[1,]))
## [1] 0.9300797
Precision
cm.income[1,1]/sum(cm.income[,1])
## [1] 0.8784332
The final model accuracy numbers show that the model has 85% accuracy. This is not particulary high, as from a good model we would expect at least 90% accuracy.