This project is an exercise prepared within UDEMY’s course: “R for Data Science and Machine Learning”.

Logistic Regression Project

In this project we will be working with the UCI adult dataset. We will be attempting to predict if people in the data set belong in a certain class by salary, either making <=50k or >50k per year. Most of this project is focus on the different issues that may arise when cleaning data.

Libraries and data loading

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

Loading and Preparing Data

adult <- read_csv("adult_sal.csv")
## New names:
## * `` -> ...1
## Rows: 32561 Columns: 16
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): type_employer, education, marital, occupation, relationship, race, ...
## dbl (7): ...1, age, fnlwgt, education_num, capital_gain, capital_loss, hr_pe...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(adult, 10)
## # A tibble: 10 x 16
##     ...1   age type_employer  fnlwgt education education_num marital  occupation
##    <dbl> <dbl> <chr>           <dbl> <chr>             <dbl> <chr>    <chr>     
##  1     1    39 State-gov       77516 Bachelors            13 Never-m~ Adm-cleri~
##  2     2    50 Self-emp-not-~  83311 Bachelors            13 Married~ Exec-mana~
##  3     3    38 Private        215646 HS-grad               9 Divorced Handlers-~
##  4     4    53 Private        234721 11th                  7 Married~ Handlers-~
##  5     5    28 Private        338409 Bachelors            13 Married~ Prof-spec~
##  6     6    37 Private        284582 Masters              14 Married~ Exec-mana~
##  7     7    49 Private        160187 9th                   5 Married~ Other-ser~
##  8     8    52 Self-emp-not-~ 209642 HS-grad               9 Married~ Exec-mana~
##  9     9    31 Private         45781 Masters              14 Never-m~ Prof-spec~
## 10    10    42 Private        159449 Bachelors            13 Married~ Exec-mana~
## # ... with 8 more variables: relationship <chr>, race <chr>, sex <chr>,
## #   capital_gain <dbl>, capital_loss <dbl>, hr_per_week <dbl>, country <chr>,
## #   income <chr>

Dropping duplicated index column

adult <- select(adult, -1)
str(adult)
## tibble [32,561 x 15] (S3: tbl_df/tbl/data.frame)
##  $ age          : num [1:32561] 39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: chr [1:32561] "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt       : num [1:32561] 77516 83311 215646 234721 338409 ...
##  $ education    : chr [1:32561] "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ education_num: num [1:32561] 13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : chr [1:32561] "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ occupation   : chr [1:32561] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ relationship : chr [1:32561] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ race         : chr [1:32561] "White" "White" "White" "Black" ...
##  $ sex          : chr [1:32561] "Male" "Male" "Male" "Male" ...
##  $ capital_gain : num [1:32561] 2174 0 0 0 0 ...
##  $ capital_loss : num [1:32561] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : num [1:32561] 40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : chr [1:32561] "United-States" "United-States" "United-States" "United-States" ...
##  $ income       : chr [1:32561] "<=50K" "<=50K" "<=50K" "<=50K" ...
head(adult)
## # A tibble: 6 x 15
##     age type_employer    fnlwgt education education_num marital     occupation  
##   <dbl> <chr>             <dbl> <chr>             <dbl> <chr>       <chr>       
## 1    39 State-gov         77516 Bachelors            13 Never-marr~ Adm-clerical
## 2    50 Self-emp-not-inc  83311 Bachelors            13 Married-ci~ Exec-manage~
## 3    38 Private          215646 HS-grad               9 Divorced    Handlers-cl~
## 4    53 Private          234721 11th                  7 Married-ci~ Handlers-cl~
## 5    28 Private          338409 Bachelors            13 Married-ci~ Prof-specia~
## 6    37 Private          284582 Masters              14 Married-ci~ Exec-manage~
## # ... with 8 more variables: relationship <chr>, race <chr>, sex <chr>,
## #   capital_gain <dbl>, capital_loss <dbl>, hr_per_week <dbl>, country <chr>,
## #   income <chr>
summary(adult)
##       age        type_employer          fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education_num     marital           occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital_gain    capital_loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##   hr_per_week      country             income         
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

Feature Exploration

colnames(adult)
##  [1] "age"           "type_employer" "fnlwgt"        "education"    
##  [5] "education_num" "marital"       "occupation"    "relationship" 
##  [9] "race"          "sex"           "capital_gain"  "capital_loss" 
## [13] "hr_per_week"   "country"       "income"
unique(adult$type_employer)
## [1] "State-gov"        "Self-emp-not-inc" "Private"          "Federal-gov"     
## [5] "Local-gov"        "?"                "Self-emp-inc"     "Without-pay"     
## [9] "Never-worked"
unique(adult$marital)
## [1] "Never-married"         "Married-civ-spouse"    "Divorced"             
## [4] "Married-spouse-absent" "Separated"             "Married-AF-spouse"    
## [7] "Widowed"
unique(adult$race)
## [1] "White"              "Black"              "Asian-Pac-Islander"
## [4] "Amer-Indian-Eskimo" "Other"
unique(adult$sex)
## [1] "Male"   "Female"
unique(adult$country)
##  [1] "United-States"              "Cuba"                      
##  [3] "Jamaica"                    "India"                     
##  [5] "?"                          "Mexico"                    
##  [7] "South"                      "Puerto-Rico"               
##  [9] "Honduras"                   "England"                   
## [11] "Canada"                     "Germany"                   
## [13] "Iran"                       "Philippines"               
## [15] "Italy"                      "Poland"                    
## [17] "Columbia"                   "Cambodia"                  
## [19] "Thailand"                   "Ecuador"                   
## [21] "Laos"                       "Taiwan"                    
## [23] "Haiti"                      "Portugal"                  
## [25] "Dominican-Republic"         "El-Salvador"               
## [27] "France"                     "Guatemala"                 
## [29] "China"                      "Japan"                     
## [31] "Yugoslavia"                 "Peru"                      
## [33] "Outlying-US(Guam-USVI-etc)" "Scotland"                  
## [35] "Trinadad&Tobago"            "Greece"                    
## [37] "Nicaragua"                  "Vietnam"                   
## [39] "Hong"                       "Ireland"                   
## [41] "Hungary"                    "Holand-Netherlands"

Data Cleaning

It seems read_csv function has avoided reading some columns as factors. It could be useful to have sex, race and income columns as factor variables in further analysis.

adult$sex <- as.factor(adult$sex)
adult$race <- as.factor(adult$race)
adult$income <- as.factor(adult$income)
str(adult)
## tibble [32,561 x 15] (S3: tbl_df/tbl/data.frame)
##  $ age          : num [1:32561] 39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: chr [1:32561] "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt       : num [1:32561] 77516 83311 215646 234721 338409 ...
##  $ education    : chr [1:32561] "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ education_num: num [1:32561] 13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : chr [1:32561] "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ occupation   : chr [1:32561] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ relationship : chr [1:32561] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : num [1:32561] 2174 0 0 0 0 ...
##  $ capital_loss : num [1:32561] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : num [1:32561] 40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : chr [1:32561] "United-States" "United-States" "United-States" "United-States" ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

type_employer column

table(adult$type_employer)
## 
##                ?      Federal-gov        Local-gov     Never-worked 
##             1836              960             2093                7 
##          Private     Self-emp-inc Self-emp-not-inc        State-gov 
##            22696             1116             2541             1298 
##      Without-pay 
##               14

Table analysis of column “type_employer” shows that there is room for some cleaning. The aim here is to consolidate features to have a more robust model down the road.

We will start by assigning a correct NA value to the “?” text used in 1836 observations

adult <- adult %>%
      mutate(type_employer = ifelse(type_employer=="?", NA, type_employer)) %>%
      mutate(type_employer = ifelse(type_employer=="Never-worked" | type_employer == "Without-pay", 
                                    "Unemployed", type_employer)) %>%
      mutate(type_employer = ifelse(type_employer=="Local-gov" | type_employer=="State-gov",
                                   "SL-gov", type_employer)) %>%
      mutate(type_employer = ifelse(type_employer == "Self-emp-inc" | type_employer == "Self-emp-not-inc",                                     "self-empl", type_employer))
table(adult$type_employer)
## 
## Federal-gov     Private   self-empl      SL-gov  Unemployed 
##         960       22696        3657        3391          21
sum(is.na(adult$type_employer))
## [1] 1836

Marital Column

table(adult$marital)
## 
##              Divorced     Married-AF-spouse    Married-civ-spouse 
##                  4443                    23                 14976 
## Married-spouse-absent         Never-married             Separated 
##                   418                 10683                  1025 
##               Widowed 
##                   993

Let’s reduce marital status groups into only 3 categories: Married, Not-Married, Never-Married

adult <- adult %>%
      mutate(marital=ifelse(marital=="Married-AF-spouse"|marital=="Married-civ-spouse"|
                                  marital=="Married-spouse-absent", "Married", marital))%>%
      mutate(marital=ifelse(marital=="Divorced"|marital=="Separated"|marital=="Widowed",
                            "Not-Married", marital))

Country Column

table(adult$country)
## 
##                          ?                   Cambodia 
##                        583                         19 
##                     Canada                      China 
##                        121                         75 
##                   Columbia                       Cuba 
##                         59                         95 
##         Dominican-Republic                    Ecuador 
##                         70                         28 
##                El-Salvador                    England 
##                        106                         90 
##                     France                    Germany 
##                         29                        137 
##                     Greece                  Guatemala 
##                         29                         64 
##                      Haiti         Holand-Netherlands 
##                         44                          1 
##                   Honduras                       Hong 
##                         13                         20 
##                    Hungary                      India 
##                         13                        100 
##                       Iran                    Ireland 
##                         43                         24 
##                      Italy                    Jamaica 
##                         73                         81 
##                      Japan                       Laos 
##                         62                         18 
##                     Mexico                  Nicaragua 
##                        643                         34 
## Outlying-US(Guam-USVI-etc)                       Peru 
##                         14                         31 
##                Philippines                     Poland 
##                        198                         60 
##                   Portugal                Puerto-Rico 
##                         37                        114 
##                   Scotland                      South 
##                         12                         80 
##                     Taiwan                   Thailand 
##                         51                         18 
##            Trinadad&Tobago              United-States 
##                         19                      29170 
##                    Vietnam                 Yugoslavia 
##                         67                         16

As expected, the number of countries observed is somewhat large. We will proceed to group them by subcontinents under a new variable called “region”. The parameter to group this items will reference from Wikipedia page “List of Countries by regional classification” (https://meta.wikimedia.org/wiki/List_of_countries_by_regional_classification).

countries <- read_csv("country_list.csv")
## Rows: 248 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Country, Region, Global South
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(countries)
## [1] "Country"      "Region"       "Global South"

Country’s names in country classification list do no contain hyphens. We will strip hyphens from the adult data frame in order to facilitate join ahead.

adult$country <- gsub("-", " ", adult$country)

Finally, before proceeding with the join, let’s do a couple more corrections

adult <- adult %>%
      mutate(country = ifelse(country=="?", NA, country)) %>%
      mutate(country = ifelse(country=="Outlying US(Guam USVI etc)", 
                              "United States Minor Outlying Islands", country))%>%
      mutate(country = ifelse(country=="Trinadad&Tobago", "Trinidad and Tobago", country))%>%
      mutate(country = ifelse(country=="Columbia", "Colombia", country))%>%
      mutate(country = ifelse(country=="Holand Netherlands", "Netherlands", country))
adult <- left_join(adult, countries, by = c("country" = "Country"))

Let’s explore NAs after the join

paste("There are", sum(is.na(adult$country)), "NA countries in the adult DF")
## [1] "There are 583 NA countries in the adult DF"
paste("There are", sum(is.na(adult$Region)), "NA regions in the adult DF")
## [1] "There are 862 NA regions in the adult DF"

This means there are still 279 name records to clean in the adult DF.

adultnas <- adult %>%
      filter(is.na(adult$Region) & !is.na(adult$country))
unique(adultnas$country)
## [1] "South"      "England"    "Iran"       "Laos"       "Yugoslavia"
## [6] "Scotland"   "Hong"

There are still seven country names that don’t match the classification list. Our of this, there is one name “South” that we cannot interpret with the information that we have, so it will be ignored. The other six can be corrected:

adult <- adult %>%
      mutate(country = ifelse(country=="South", NA, country))%>%
      mutate(country = ifelse(country=="England", "United Kingdom", country)) %>%
      mutate(country = ifelse(country=="Iran", "Iran, Islamic Republic of", country)) %>%
      mutate(country = ifelse(country=="Laos", "Lao People's Democratic Republic", country)) %>%
      mutate(country = ifelse(country=="Yugoslavia", "Bosnia and Herzegovina", country)) %>%
      mutate(country = ifelse(country=="Scotland", "United Kingdom", country)) %>%
      mutate(country = ifelse(country=="Hong", "Hong Kong", country))

Important to note: England and Scotland have been assigned to the United Kingdom. Yugoslavia has been changed to “Bosnia and Herzegovina” in an arbitrary fashion, since we don’t have any information to divide registries into the different nations that arised from the dissolution of the former Yugoslavia.

adult <- adult%>%
      select(c(1:15))%>%
      left_join(countries, by = c("country" = "Country"))

Let’s explore NAs after the join

paste("There are", sum(is.na(adult$country)), "NA countries in the adult DF")
## [1] "There are 663 NA countries in the adult DF"
paste("There are", sum(is.na(adult$Region)), "NA regions in the adult DF")
## [1] "There are 663 NA regions in the adult DF"
table(adult$Region)
## 
##      Asia & Pacific              Europe         Middle east       North America 
##                 642                 521                  43               29291 
## South/Latin America 
##                1401
table(adult$`Global South`)
## 
## Global North Global South 
##        29925         1973

Missing data

missmap(adult, legend = F, col = c("navy", "gray"))
## Warning: Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.

head(adult, 12)
## # A tibble: 12 x 17
##      age type_employer fnlwgt education    education_num marital    occupation  
##    <dbl> <chr>          <dbl> <chr>                <dbl> <chr>      <chr>       
##  1    39 SL-gov         77516 Bachelors               13 Never-mar~ Adm-clerical
##  2    50 self-empl      83311 Bachelors               13 Married    Exec-manage~
##  3    38 Private       215646 HS-grad                  9 Not-Marri~ Handlers-cl~
##  4    53 Private       234721 11th                     7 Married    Handlers-cl~
##  5    28 Private       338409 Bachelors               13 Married    Prof-specia~
##  6    37 Private       284582 Masters                 14 Married    Exec-manage~
##  7    49 Private       160187 9th                      5 Married    Other-servi~
##  8    52 self-empl     209642 HS-grad                  9 Married    Exec-manage~
##  9    31 Private        45781 Masters                 14 Never-mar~ Prof-specia~
## 10    42 Private       159449 Bachelors               13 Married    Exec-manage~
## 11    37 Private       280464 Some-college            10 Married    Exec-manage~
## 12    30 SL-gov        141297 Bachelors               13 Married    Prof-specia~
## # ... with 10 more variables: relationship <chr>, race <fct>, sex <fct>,
## #   capital_gain <dbl>, capital_loss <dbl>, hr_per_week <dbl>, country <chr>,
## #   income <fct>, Region <chr>, Global South <chr>

check for “?” characters in the data frame

res <- NULL
for (i in 1:17){
      val <- sum((adult[,i])=="?") 
      ad.var <- colnames(adult[,i])
      res <- rbind(res, c(ad.var, val))
}
colnames(res) <- c("variable", "observed_?s")
print(res)
##       variable        observed_?s
##  [1,] "age"           "0"        
##  [2,] "type_employer" NA         
##  [3,] "fnlwgt"        "0"        
##  [4,] "education"     "0"        
##  [5,] "education_num" "0"        
##  [6,] "marital"       "0"        
##  [7,] "occupation"    "1843"     
##  [8,] "relationship"  "0"        
##  [9,] "race"          "0"        
## [10,] "sex"           "0"        
## [11,] "capital_gain"  "0"        
## [12,] "capital_loss"  "0"        
## [13,] "hr_per_week"   "0"        
## [14,] "country"       NA         
## [15,] "income"        "0"        
## [16,] "Region"        NA         
## [17,] "Global South"  NA

The only column that contains “?” characters is “occupation”. We can fix that in order facilitate analysis.

adult <- mutate(adult, occupation = ifelse(occupation=="?", NA, occupation))
res <- NULL
for (i in 1:17){
      val <- sum((adult[,i])=="?") 
      ad.var <- colnames(adult[,i])
      res <- rbind(res, c(ad.var, val))
}
colnames(res) <- c("variable", "observed_?s")
print(res)
##       variable        observed_?s
##  [1,] "age"           "0"        
##  [2,] "type_employer" NA         
##  [3,] "fnlwgt"        "0"        
##  [4,] "education"     "0"        
##  [5,] "education_num" "0"        
##  [6,] "marital"       "0"        
##  [7,] "occupation"    NA         
##  [8,] "relationship"  "0"        
##  [9,] "race"          "0"        
## [10,] "sex"           "0"        
## [11,] "capital_gain"  "0"        
## [12,] "capital_loss"  "0"        
## [13,] "hr_per_week"   "0"        
## [14,] "country"       NA         
## [15,] "income"        "0"        
## [16,] "Region"        NA         
## [17,] "Global South"  NA

Now every observation that contained a “?” character for any variable has been corrected to NA.

missmap(adult, legend = F, col = c("navy", "gray"))
## Warning: Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.

paste("df contains", sum(complete.cases(adult)), "complete cases")
## [1] "df contains 30091 complete cases"
paste("they account for", round(100*sum(complete.cases(adult))/nrow(adult), 3), "percentage of the df")
## [1] "they account for 92.414 percentage of the df"

As we can see in the previous graphic, NA values are present in five variables. Three of them are geographic location variables, they are completely co-related, and most likely a useful imputation is not possible. The other two variables with missing data are “occupation” and “type_employer”, NA frequency is also correlated, so meaningful imputation seems unlikely.

adult <- na.omit(adult)
missmap(adult, legend = F, col = c("navy", "gray"))
## Warning: Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.

Exploring and data visualization

Now that we have cleaned the data, let’s take a close look at it.

str(adult)
## tibble [30,091 x 17] (S3: tbl_df/tbl/data.frame)
##  $ age          : num [1:30091] 39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: chr [1:30091] "SL-gov" "self-empl" "Private" "Private" ...
##  $ fnlwgt       : num [1:30091] 77516 83311 215646 234721 338409 ...
##  $ education    : chr [1:30091] "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ education_num: num [1:30091] 13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : chr [1:30091] "Never-married" "Married" "Not-Married" "Married" ...
##  $ occupation   : chr [1:30091] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ relationship : chr [1:30091] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : num [1:30091] 2174 0 0 0 0 ...
##  $ capital_loss : num [1:30091] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : num [1:30091] 40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : chr [1:30091] "United States" "United States" "United States" "United States" ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
##  $ Region       : chr [1:30091] "North America" "North America" "North America" "North America" ...
##  $ Global South : chr [1:30091] "Global North" "Global North" "Global North" "Global North" ...
##  - attr(*, "na.action")= 'omit' Named int [1:2470] 15 28 39 52 62 70 78 94 107 129 ...
##   ..- attr(*, "names")= chr [1:2470] "15" "28" "39" "52" ...

Histogram of ages colored by income

ggplot(data=adult, aes(age)) + geom_histogram(aes(fill=income), binwidth = 1)

Histogram of hours worked per week

ggplot(data=adult, aes(hr_per_week)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Income level by region

ggplot(data=adult, aes(Region)) + geom_bar(aes(fill=income))

sum(adult$Region=="Middle east")
## [1] 42
ggplot(data=adult, aes(Region)) + geom_bar(aes(fill=income), position = "fill")

By far, most observations relate to the North America region. Second region with most observations is South/Latin America, while Asia & Pacific and Europe have fewer observations. Middle east accountants for the least number of observations.

Middle east region reports the higher ratio of income above the 50k threshold, while South/Latin America has the least proportion of people earning above 50 k. Asia & Pacific, Europe and North America show a similar general income distribution pattern.

Building a Model

This model will predict if people is likely to fall into above or below 50k salary groups.

Logistic regression

Logistic Regression is a type of classification model. In classification models, we attempt to predict the outcome of categorical dependent variables, using one or more independent variables. The independent variables can be either categorical or numerical.

Logistic regression is based on the logistic function, which always takes values between 0 and 1. Replacing the dependent variable of the logistic function with a linear combination of dependent variables we intend to use for regression, we arrive at the formula for logistic regression.

head(adult, 10)
## # A tibble: 10 x 17
##      age type_employer fnlwgt education education_num marital       occupation  
##    <dbl> <chr>          <dbl> <chr>             <dbl> <chr>         <chr>       
##  1    39 SL-gov         77516 Bachelors            13 Never-married Adm-clerical
##  2    50 self-empl      83311 Bachelors            13 Married       Exec-manage~
##  3    38 Private       215646 HS-grad               9 Not-Married   Handlers-cl~
##  4    53 Private       234721 11th                  7 Married       Handlers-cl~
##  5    28 Private       338409 Bachelors            13 Married       Prof-specia~
##  6    37 Private       284582 Masters              14 Married       Exec-manage~
##  7    49 Private       160187 9th                   5 Married       Other-servi~
##  8    52 self-empl     209642 HS-grad               9 Married       Exec-manage~
##  9    31 Private        45781 Masters              14 Never-married Prof-specia~
## 10    42 Private       159449 Bachelors            13 Married       Exec-manage~
## # ... with 10 more variables: relationship <chr>, race <fct>, sex <fct>,
## #   capital_gain <dbl>, capital_loss <dbl>, hr_per_week <dbl>, country <chr>,
## #   income <fct>, Region <chr>, Global South <chr>
adult <- adult[,c(1:13, 15,16)]
library(caTools)

Train/Test Split

set.seed(101)
adult.split <- sample.split(adult$income, SplitRatio = .65)
adult.train <- subset(adult, adult.split == TRUE)
adult.test <- subset(adult, adult.split == FALSE)

Training the model

ad.mdtr.1 <- glm(income ~ ., family = binomial(logit), data = adult.train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(ad.mdtr.1)
## 
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = adult.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.0422  -0.5219  -0.1957  -0.0008   3.6049  
## 
## Coefficients: (1 not defined because of singularities)
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -5.115e+00  4.627e-01 -11.055  < 2e-16 ***
## age                          2.537e-02  2.079e-03  12.203  < 2e-16 ***
## type_employerPrivate        -5.145e-01  1.146e-01  -4.492 7.07e-06 ***
## type_employerself-empl      -8.382e-01  1.277e-01  -6.562 5.30e-11 ***
## type_employerSL-gov         -6.926e-01  1.291e-01  -5.363 8.19e-08 ***
## type_employerUnemployed     -1.315e+01  2.703e+02  -0.049 0.961188    
## fnlwgt                       7.574e-07  2.148e-07   3.526 0.000422 ***
## education11th                3.334e-01  2.704e-01   1.233 0.217606    
## education12th                6.009e-01  3.364e-01   1.787 0.074017 .  
## education1st-4th             2.010e-03  5.142e-01   0.004 0.996881    
## education5th-6th            -2.495e-01  4.382e-01  -0.569 0.569191    
## education7th-8th            -4.729e-01  3.218e-01  -1.470 0.141618    
## education9th                -3.076e-01  3.495e-01  -0.880 0.378815    
## educationAssoc-acdm          1.438e+00  2.268e-01   6.339 2.31e-10 ***
## educationAssoc-voc           1.403e+00  2.199e-01   6.380 1.77e-10 ***
## educationBachelors           2.023e+00  2.046e-01   9.887  < 2e-16 ***
## educationDoctorate           3.082e+00  2.824e-01  10.913  < 2e-16 ***
## educationHS-grad             9.128e-01  1.993e-01   4.579 4.67e-06 ***
## educationMasters             2.243e+00  2.174e-01  10.317  < 2e-16 ***
## educationPreschool          -1.060e+01  1.478e+02  -0.072 0.942835    
## educationProf-school         3.020e+00  2.578e-01  11.717  < 2e-16 ***
## educationSome-college        1.263e+00  2.019e-01   6.252 4.05e-10 ***
## education_num                       NA         NA      NA       NA    
## maritalNever-married        -1.103e+00  2.208e-01  -4.997 5.81e-07 ***
## maritalNot-Married          -5.374e-01  2.191e-01  -2.452 0.014201 *  
## occupationArmed-Forces      -1.033e+00  1.525e+00  -0.678 0.498037    
## occupationCraft-repair       8.915e-02  9.872e-02   0.903 0.366515    
## occupationExec-managerial    8.515e-01  9.582e-02   8.886  < 2e-16 ***
## occupationFarming-fishing   -1.025e+00  1.737e-01  -5.900 3.64e-09 ***
## occupationHandlers-cleaners -6.679e-01  1.824e-01  -3.662 0.000250 ***
## occupationMachine-op-inspct -1.927e-01  1.238e-01  -1.556 0.119651    
## occupationOther-service     -8.442e-01  1.465e-01  -5.763 8.27e-09 ***
## occupationPriv-house-serv   -3.645e+00  1.803e+00  -2.021 0.043231 *  
## occupationProf-specialty     4.767e-01  1.015e-01   4.698 2.63e-06 ***
## occupationProtective-serv    4.087e-01  1.584e-01   2.581 0.009862 ** 
## occupationSales              3.603e-01  1.020e-01   3.531 0.000414 ***
## occupationTech-support       6.304e-01  1.363e-01   4.625 3.74e-06 ***
## occupationTransport-moving  -1.224e-01  1.227e-01  -0.997 0.318551    
## relationshipNot-in-family   -1.072e+00  2.164e-01  -4.954 7.25e-07 ***
## relationshipOther-relative  -1.376e+00  2.922e-01  -4.709 2.49e-06 ***
## relationshipOwn-child       -2.334e+00  2.784e-01  -8.382  < 2e-16 ***
## relationshipUnmarried       -1.170e+00  2.387e-01  -4.902 9.49e-07 ***
## relationshipWife             1.387e+00  1.299e-01  10.681  < 2e-16 ***
## raceAsian-Pac-Islander       2.404e-01  3.452e-01   0.696 0.486228    
## raceBlack                   -1.049e-01  2.914e-01  -0.360 0.718973    
## raceOther                   -5.499e-01  4.650e-01  -1.182 0.237010    
## raceWhite                    2.512e-02  2.777e-01   0.090 0.927923    
## sexMale                      8.906e-01  9.939e-02   8.961  < 2e-16 ***
## capital_gain                 3.141e-04  1.300e-05  24.168  < 2e-16 ***
## capital_loss                 6.175e-04  4.726e-05  13.066  < 2e-16 ***
## hr_per_week                  2.862e-02  2.078e-03  13.774  < 2e-16 ***
## RegionEurope                 4.365e-01  2.886e-01   1.513 0.130384    
## RegionMiddle east           -1.176e-01  5.875e-01  -0.200 0.841345    
## RegionNorth America          3.060e-01  2.421e-01   1.264 0.206230    
## RegionSouth/Latin America   -2.619e-01  2.870e-01  -0.912 0.361551    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21956  on 19558  degrees of freedom
## Residual deviance: 12828  on 19505  degrees of freedom
## AIC: 12936
## 
## Number of Fisher Scoring iterations: 13

Now let’s use the step() function to remove predictive variables that do not significantly add to the fit.

new.model <- step(ad.mdtr.1)
## Start:  AIC=12936.4
## income ~ age + type_employer + fnlwgt + education + education_num + 
##     marital + occupation + relationship + race + sex + capital_gain + 
##     capital_loss + hr_per_week + Region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=12936.4
## income ~ age + type_employer + fnlwgt + education + marital + 
##     occupation + relationship + race + sex + capital_gain + capital_loss + 
##     hr_per_week + Region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                 Df Deviance   AIC
## - race           4    12834 12934
## <none>                12828 12936
## - Region         4    12844 12944
## - fnlwgt         1    12841 12947
## - marital        2    12871 12975
## - type_employer  4    12884 12984
## - sex            1    12912 13018
## - age            1    12978 13084
## - capital_loss   1    13004 13110
## - hr_per_week    1    13022 13128
## - relationship   5    13095 13193
## - occupation    13    13226 13308
## - education     15    13440 13518
## - capital_gain   1    13963 14069
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=12934.05
## income ~ age + type_employer + fnlwgt + education + marital + 
##     occupation + relationship + sex + capital_gain + capital_loss + 
##     hr_per_week + Region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                 Df Deviance   AIC
## <none>                12834 12934
## - fnlwgt         1    12845 12943
## - Region         4    12852 12944
## - marital        2    12877 12973
## - type_employer  4    12888 12980
## - sex            1    12919 13017
## - age            1    12985 13083
## - capital_loss   1    13010 13108
## - hr_per_week    1    13028 13126
## - relationship   5    13102 13192
## - occupation    13    13236 13310
## - education     15    13450 13520
## - capital_gain   1    13968 14066
summary(new.model)
## 
## Call:
## glm(formula = income ~ age + type_employer + fnlwgt + education + 
##     marital + occupation + relationship + sex + capital_gain + 
##     capital_loss + hr_per_week + Region, family = binomial(logit), 
##     data = adult.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.0407  -0.5228  -0.1959  -0.0008   3.6206  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -4.929e+00  3.304e-01 -14.919  < 2e-16 ***
## age                          2.545e-02  2.078e-03  12.246  < 2e-16 ***
## type_employerPrivate        -4.999e-01  1.140e-01  -4.387 1.15e-05 ***
## type_employerself-empl      -8.225e-01  1.271e-01  -6.471 9.71e-11 ***
## type_employerSL-gov         -6.814e-01  1.288e-01  -5.291 1.22e-07 ***
## type_employerUnemployed     -1.313e+01  2.703e+02  -0.049 0.961264    
## fnlwgt                       7.196e-07  2.126e-07   3.384 0.000714 ***
## education11th                3.382e-01  2.705e-01   1.250 0.211136    
## education12th                5.936e-01  3.362e-01   1.766 0.077457 .  
## education1st-4th             1.215e-02  5.142e-01   0.024 0.981157    
## education5th-6th            -2.522e-01  4.385e-01  -0.575 0.565081    
## education7th-8th            -4.748e-01  3.218e-01  -1.476 0.140040    
## education9th                -3.078e-01  3.491e-01  -0.882 0.378020    
## educationAssoc-acdm          1.443e+00  2.267e-01   6.365 1.96e-10 ***
## educationAssoc-voc           1.408e+00  2.198e-01   6.403 1.52e-10 ***
## educationBachelors           2.031e+00  2.045e-01   9.929  < 2e-16 ***
## educationDoctorate           3.093e+00  2.824e-01  10.951  < 2e-16 ***
## educationHS-grad             9.181e-01  1.993e-01   4.607 4.09e-06 ***
## educationMasters             2.252e+00  2.173e-01  10.363  < 2e-16 ***
## educationPreschool          -1.066e+01  1.469e+02  -0.073 0.942169    
## educationProf-school         3.029e+00  2.578e-01  11.749  < 2e-16 ***
## educationSome-college        1.268e+00  2.019e-01   6.279 3.41e-10 ***
## maritalNever-married        -1.096e+00  2.210e-01  -4.960 7.04e-07 ***
## maritalNot-Married          -5.311e-01  2.194e-01  -2.421 0.015497 *  
## occupationArmed-Forces      -1.019e+00  1.520e+00  -0.670 0.502635    
## occupationCraft-repair       9.118e-02  9.866e-02   0.924 0.355404    
## occupationExec-managerial    8.538e-01  9.578e-02   8.913  < 2e-16 ***
## occupationFarming-fishing   -1.020e+00  1.736e-01  -5.879 4.12e-09 ***
## occupationHandlers-cleaners -6.750e-01  1.822e-01  -3.705 0.000212 ***
## occupationMachine-op-inspct -2.009e-01  1.237e-01  -1.624 0.104472    
## occupationOther-service     -8.499e-01  1.463e-01  -5.808 6.33e-09 ***
## occupationPriv-house-serv   -3.639e+00  1.795e+00  -2.027 0.042655 *  
## occupationProf-specialty     4.788e-01  1.014e-01   4.723 2.33e-06 ***
## occupationProtective-serv    4.015e-01  1.582e-01   2.537 0.011169 *  
## occupationSales              3.616e-01  1.020e-01   3.546 0.000390 ***
## occupationTech-support       6.371e-01  1.361e-01   4.680 2.87e-06 ***
## occupationTransport-moving  -1.244e-01  1.226e-01  -1.014 0.310385    
## relationshipNot-in-family   -1.081e+00  2.166e-01  -4.992 5.99e-07 ***
## relationshipOther-relative  -1.392e+00  2.934e-01  -4.743 2.10e-06 ***
## relationshipOwn-child       -2.336e+00  2.787e-01  -8.381  < 2e-16 ***
## relationshipUnmarried       -1.183e+00  2.388e-01  -4.956 7.21e-07 ***
## relationshipWife             1.387e+00  1.299e-01  10.684  < 2e-16 ***
## sexMale                      8.970e-01  9.937e-02   9.026  < 2e-16 ***
## capital_gain                 3.139e-04  1.299e-05  24.170  < 2e-16 ***
## capital_loss                 6.169e-04  4.727e-05  13.052  < 2e-16 ***
## hr_per_week                  2.866e-02  2.078e-03  13.790  < 2e-16 ***
## RegionEurope                 2.530e-01  2.205e-01   1.147 0.251365    
## RegionMiddle east           -2.702e-01  5.660e-01  -0.477 0.633019    
## RegionNorth America          1.154e-01  1.548e-01   0.745 0.455992    
## RegionSouth/Latin America   -4.954e-01  2.194e-01  -2.258 0.023940 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21956  on 19558  degrees of freedom
## Residual deviance: 12834  on 19509  degrees of freedom
## AIC: 12934
## 
## Number of Fisher Scoring iterations: 13

Confusion matrix

adult.test$predicted.income <- predict(new.model, newdata = adult.test, type="response")
table(adult.test$income, adult.test$predicted.income > 0.5)
##        
##         FALSE TRUE
##   <=50K  7356  553
##   >50K   1018 1605

Model accuracy

Sum of FALSE / Sum of ALL

cm.income <- table(adult.test$income, adult.test$predicted.income > 0.5)
(cm.income[1,1] + cm.income[2,2])/sum(cm.income)
## [1] 0.8508355

Recall

cm.income[1,1]/(sum(cm.income[1,]))
## [1] 0.9300797

Precision

cm.income[1,1]/sum(cm.income[,1])
## [1] 0.8784332

The final model accuracy numbers show that the model has 85% accuracy. This is not particulary high, as from a good model we would expect at least 90% accuracy.