library("tidyverse")
library("lubridate")
library("knitr")
library("ggpubr")
library("kableExtra")

1. Introduction

1.1. Report objectives

The main objective is to analyze the characteristics and quality of Census data. Then, several pre-processing considerations are proposed to enhance the performance of income classification prediction.

1.2. Data set introduction

The chosen data set contains key characteristics of the United States’ population in 1994. It was extracted and donor by (1996). A basic description can be found at UCI Machine Learning Repository. However, because the original source link is not valid, it is challenging to gain a better understanding of data collection methodology and related features. However, based on my desk research, it seems that the data set comes from the Current Population Survey (CPS), which was conducted by the United States Bureau of the Census.

Data collection method: A nationwide sample of 57,000 housing units in 1,973 counties were personally interviewed (Census 1995).

Data extraction conditions: According to UCI, data was extracted using the following rules: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)), which means individuals with:

  • age > 16
  • adjusted gross income > $100
  • final weighting > 1
  • working hours per week > 0

The extracted data set has 48,842 records and was randomly split into training and test set at ratio 2/3 and 1/3 respectively.

In this report, only the training data set, which includes 32,561 records and 15 attributes, will be analysed. Each record contains data of one respondent and each column describes one feature of that person. 10 variables are categorical and 5 variables are continuous as shown below.

#import data
censusData <- read.csv("censusData.csv", stringsAsFactors = T, strip.white = TRUE) %>% tbl_df()
censusData_backup <- censusData

#rename to replace "." with "_"
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")

Variable type and description summary

variables_class <- as.data.frame(lapply(censusData, class)) %>% 
  .[1,] %>% 
  gather(variable, R_class , 1:ncol(.)) %>% 
  mutate(data_type = ifelse(R_class == "factor", "categorical", 
                            ifelse(variable == "education_num","categorical","continuous")))
variables_class %>% 
  arrange(data_type, variable) %>% 
  kable_adj() %>% 
  row_spec(11:15, color = "white", background = "#696969") %>% 
  column_spec(4, width = "10cm")
variable R_class data_type description
education factor categorical highest level of education that individual achieved
education_num integer categorical highest level of education in a numerical form
income factor categorical (target) whether or not the person makes more than $50,000 per annum income
marital_status factor categorical marital status of the individual
native_country factor categorical country of origin
occupation factor categorical occupation of the individual
race factor categorical the individual’s ethnicity
relationship factor categorical primary family relationship (the relationship of the individual in his/her primary family)
sex factor categorical gender
workclass factor categorical type of employer the individual has
age integer continuous age of the individual
capital_gain integer continuous capital gains (from selling an asset such as a stock, bond for more than the original purchase price)
capital_loss integer continuous capital losses (from selling an asset such as a stock, bond for less than the original purchase price)
fnlwgt integer continuous number of people in the universe (U.S. population) that each individual in the survey can represent
hours_per_week integer continuous hours worked per week
  • Note: fnlwgt is weighting value that ensures sample from this survey can represent the ‘true status’ of U.S. population. In general, a respondent gets a high weight if the proportion of that type of people in the survey is small compared to its size in the whole population. When applying machine learning method, this variable might be used as a weighted field to give extra importance to underrepresented groups. However, there is one note in (Census 1995) mentioned there were many missing values in this field, and hence, this field should not be used. In addition, the applied extraction rule is fnlwgt >1, which might remove that missing part of the sample and those whose weights are equal to 1. This may make the usability of this field as well as the sample representative in this extracted data set questionable.

Data structure and example values:

#preview data structure and example values
glimpse(censusData)
Observations: 32,561
Variables: 15
$ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
$ workclass      <fct> State-gov, Self-emp-not-inc, Private, Private, ...
$ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
$ education      <fct> Bachelors, Bachelors, HS-grad, 11th, Bachelors,...
$ education_num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
$ marital_status <fct> Never-married, Married-civ-spouse, Divorced, Ma...
$ occupation     <fct> Adm-clerical, Exec-managerial, Handlers-cleaner...
$ relationship   <fct> Not-in-family, Husband, Not-in-family, Husband,...
$ race           <fct> White, White, White, Black, Black, White, Black...
$ sex            <fct> Male, Male, Male, Male, Female, Female, Female,...
$ capital_gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
$ capital_loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ hours_per_week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
$ native_country <fct> United-States, United-States, United-States, Un...
$ income         <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K...

There are some interesting features from these examples that can be used as general guidance for pre-processing or further exploration:

  • education & education_num are redundant as they contain the same information in different forms.
  • capital_gain and capital_loss might be considered to combine into one attribute as it can represent one feature (capital).
  • Some values seem to have very high frequency, such as “0” in capital_gain and capital_loss, “40” in hours_per_week, “United- States” in native_country and “<=50K” in target variable income.

2. Initial Data Quality Assessment

#pre-processing data without printing out - will be discussed in the pre-processing section
censusData %>% 
  select(education_num, education) %>% 
  distinct() %>% 
  arrange(education_num) -> edu_order

censusData$education <- ordered(censusData$education, levels = edu_order$education)

censusData <- censusData %>% select(-education_num) %>% mutate(id = row_number())

2.1. Duplication

There are 24 duplicated rows (identical in all attributes). 22 of them are repeated 2 times and one is repeated 3 times. A full list of duplicated values is included in Appendix (6.2).

censusData %>% 
  select(-id) %>% 
  summarize(no_of_observations = n(),
            no_of_distinct = n_distinct(.)) %>% 
  mutate(no_of_dup = no_of_observations - no_of_distinct) %>% 
  mutate(pct_dup = round(no_of_dup/no_of_observations*100,1))%>% kable_adj()
no_of_observations no_of_distinct no_of_dup pct_dup
32,561 32,537 24 0.1
censusData %>% 
  group_by_at(vars(-id)) %>% 
  summarize(dup_group = n()) %>% 
  group_by(dup_group) %>% 
  summarize(count = n()) %>% kable_adj()
dup_group count
1 32,514
2 22
3 1

2.2. Missing values

According to the description on UCI, “Unknown” values are labelled as “?”. In total, there are 2,399 incomplete records (~7.4%). Specifically, there are missing values in occupation, workclass and native_country, which accounts for 5.7%, 5.6% and 1.8% of the total records respectively.

#check 'normal' missing values
# map_df(censusData, ~sum(is.na(.)))

#'check missing values labeled as "?"
total_missing <- map_df(censusData, ~str_detect(., pattern = "\\?")) %>% 
  rowSums() %>% 
  tbl_df() %>% 
  filter(value > 0) %>% 
  summarize(no_of_missing = n()) %>% 
  mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1),
         variable = "all variables") %>% 
  select(variable, no_of_missing, pct_missing)


map_df(censusData, ~sum(str_detect(., pattern = "\\?"))) %>% 
  gather(variable, no_of_missing, 1:15) %>% 
  mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1)) %>% 
  bind_rows(total_missing) %>% 
  arrange(desc(no_of_missing)) %>% 
  kable_adj()
variable no_of_missing pct_missing
all variables 2,399 7.4
occupation 1,843 5.7
workclass 1,836 5.6
native_country 583 1.8
age 0 0.0
fnlwgt 0 0.0
education 0 0.0
marital_status 0 0.0
relationship 0 0.0
race 0 0.0
sex 0 0.0
capital_gain 0 0.0
capital_loss 0 0.0
hours_per_week 0 0.0
income 0 0.0
id 0 0.0

However, at the top-line level, the missing values seem to have a minimal impact on the income group contribution. The odd between <=50K & >50K income group with or without missing values are quite similar to each other as shown in the graph below. In addition, some algorithms (e.g. C5.0, CR&T) have their own techniques (e.g. fractional cases, surrogate splits) to handle missing values. Therefore, I kept using the original data set for the following analysis. However, we might need to justify the impact of missing values on the model performance later one. If the performance is poor, we might consider removing incomplete records (or attributes) or imputing missing values. As the missing values are categorical, they might be defined according to the most frequent values or using classification methods to predict the missing values based on other attributes (excluding target variable income). However, these methods can not guarantee the accuracy of imputed values and might introduce bias. Furthermore, the attributes with missing values have many levels (14 occupations and 41 native countries), which makes missing values imputation even more challenging.

income_s <- prop_summary(censusData,income)

income_c <- censusData %>% 
  filter_all(all_vars(.!= "?")) %>% 
  prop_summary(income) %>%
  mutate(dataset = "without missing values") %>% 
  bind_rows(income_s %>% mutate(dataset = "with missing values"))

ggplot(income_c, aes(dataset, prop))+
  geom_col(aes(fill=income))+
  labs(y="Percentage")+
  geom_text(aes(label=prop), hjust=2.5)+
  coord_flip()+
  theme(legend.position="bottom")

2.3. Other logic checks

Based on the relationship between different attributes, the majority of observations seem to be valid, except very few abnormal values as following:

  • 2 respondents achieved Masters degrees at a very young age (<=20), which is shown as red points in the graph below. It might be true, yet worth to double check if there are many observations like this. In this context, it is reasonable to keep it as it is.
censusData %>% 
  group_by(age, education) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(., aes(age, education))+
  geom_point(aes(size = count, color = (age <= 20 & education =="Masters")), pch = 19, alpha = 0.7)+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_colour_manual(values = setNames(c('red','grey'),c(T, F)))+
  theme(legend.position="bottom")+
  labs(y="")

  • 3 cases that show the conflict between age and relationship (“Husband” having “Female” gender and "the other way round). The genders or relationships of these cases might be recorded incorrectly.
censusData %>% 
  group_by(sex, relationship) %>% 
  summarize(count = n()) %>% 
  ggplot(., aes(relationship, sex))+
  geom_text(aes(label=count, color = (sex == "Male" & relationship == "Wife") | (sex == "Female" & relationship == "Husband")))+
  scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
  theme(legend.position="bottom")

  • There are many people aged 90 work at least one hour per week. There are some even work more 40 hours and one works 99 hours per week, which seems to be abnormal.
censusData %>% 
  ggplot(., aes(age, hours_per_week))+
  geom_jitter(aes(col = (age == 90)),alpha = 0.2)+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_y_continuous(breaks = seq(0,100,10))+
  scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
  theme(legend.position="bottom")

3. Data exploration

3.1. Categorical variables

Target variable income has two levels “<=50K” and “>50K” with the odds ratio approximately 3:1. The labelled classes are not extremely imbalanced and seem to be equally important in this task. Therefore, this ratio might not have a severe impact on model performance.

ggplot(income_c %>% filter(dataset == "with missing values"), aes(dataset, prop))+
  geom_col(aes(fill=income))+
  labs(y="Percentage")+
  geom_text(aes(label=prop), hjust=2.5)+
  coord_flip()+
  labs(x ="")+
  scale_x_discrete(labels=c("with missing values" = ""))+
  theme_minimal()+
  theme(legend.position="right")

Note: From this on, for convenience, I am going to mention “<=50K” and “>50K” as low and high income group respectively.

However, there is a big gap between the proportion of high income in this sample (~24%) and the results publicized by U.S. Bureau of the Census (n.d.) (~10%). This re-enforces my concern about the data representativeness.

variables_class$unique_value <- sapply(censusData_backup, n_distinct)
variables_class$levels <- c(NA,
                            "?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay",
                            NA,
                            "Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate",
                            "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16",
                            "Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed",
                            "?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv,     Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving",
                            "Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife",
                            "Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White",
                            "Female, Male",
                            NA,
                            NA,
                            NA,
                            "?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia",
                            "<=50K, >50K")


variables_class %>% 
  select(-description, - R_class) %>% 
  filter(data_type == "categorical") %>% 
  arrange(unique_value, variable) %>% 
  kable_adj() %>% 
  column_spec(4, width = "10cm")
variable data_type unique_value levels
income categorical 2 <=50K, >50K
sex categorical 2 Female, Male
race categorical 5 Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White
relationship categorical 6 Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife
marital_status categorical 7 Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed
workclass categorical 9 ?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay
occupation categorical 15 ?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving
education categorical 16 Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate
education_num categorical 16 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
native_country categorical 42 ?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia

Some categorical variables have quite many levels, esp. native_country, education with 42 and 16 distinct values respectively.

censusData %>%
  gather(variable, value, - id) %>% 
  mutate(value = as.integer(value))-> censusData_long

lapply(seq(temp2), function(i) {
    y <- data.frame(temp2[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("levels")
    return(y)
  }) %>% bind_rows() -> chart1

ggplot(chart1 %>% filter(variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
  geom_col(fill= "#56B4E9", alpha = 0.7)+
  facet_grid(variable~., space = "free", scales = "free")+
  coord_flip()+
  labs(title = "Categorical variables - Frequency bar chart - part 1",
       x = "",
       y = "Percentage" )+
  geom_text(aes(label = round(prop,0)), hjust = -1)+
  scale_y_continuous(breaks = seq(0,100,10))

The U.S. accounts for 90% in native_country attribute while each of other 40 countries account for less than 1% except Mexico (2%). Similarly, “White” is dominant in race feature with 85%. However, race has fewer levels than native_country and might be more effective in this task.

###{#gender}

censusData %>%
  gather(variable, value, - id) %>% 
  mutate(value = as.integer(value))-> censusData_long

lapply(seq(temp2), function(i) {
    y <- data.frame(temp2[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("levels")
    return(y)
  }) %>% bind_rows() %>% mutate(levels = factor(levels))-> chart1

ggplot(chart1 %>% filter(!variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
  geom_col(fill= "#56B4E9", alpha = 0.7)+
  facet_wrap(~variable, ncol =2, scales = "free_y")+
  coord_flip()+
  labs(title = "Categorical variables - Frequency bar chart - part 2",
       x = "",
       y = "Percentage" )+
  geom_text(aes(label = round(prop,0)), hjust = -0.5)+
  scale_y_continuous(breaks = seq(0,90,10), limits = c(0,80))

  • High-school graduate, Some college and Bachelors are the most common education levels, which accounted for 70% of survey responses.

  • Sample seems to be equally distributed in terms of occupation, yet quite imbalance in terms of sex and workclass. The high proportion of Private workclass might reflect the real world while the gap between genders might result from sample design methodology or data extraction rules.

censusData %>%
  summarize(high = mean(income ==">50K")) -> benchmark

censusData %>%
  gather(variable, value, - id, - income) -> censusData_long1
censusData_long1 %>%
  filter(variable %in% categorical_list$variable) %>%
  group_by(variable, value) %>%
  summarize(count = n(),
            pct_gt50 = round(mean(income == ">50K")*100,2)) %>% 
  mutate(pct_count = round(count/ nrow(censusData)*100,0)) -> summary1


summary1 %>%
  filter(variable!="native_country", variable!="race") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  facet_wrap(.~variable, scales = "free_y", ncol = 2)+
  coord_flip() + 
  labs(title = "Categorical variables - Proportion of high income (>50K) in each demographic group - part 2",
       x = "",
       y = "Percentage of high income group" )+
  theme(legend.position = "none")

Graph explaination: The numbers in the graph represent the proportion of specific group in a demographic type (e.g. In the bottom left graph, 67% people in this survey are male.). The bar graphs show the proportion of high income group (“>50K”) in each demographic group. (e.g. Around 30% of male respondents had income more than 50K per year.). The vertical line is the proportion of high income group at the total level (~24%).

As can be seen from the graphs, those who are male, achieving at least Bachelors’ degree, in married relationships, having Management or Professional occupation, self-employed or working in government are more likely to have high income. #### {#page10} However, the size of Managerial and Professional occupation, self-employed or working-in-government groups are relatively small. Therefore, they might be less influential than other mentioned attributes (sex, marital status and education).

Some small groups with similar income likelihood might be grouped together to reduce noise and enhance generalizability. For example, some reclassified grouped might be education below High school graduate, Married-civ-spouse and Married-AF-spouse, Without-pay and Never-worked workclass.

summary1 %>%
  filter(variable=="native_country" | variable=="race") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  facet_grid(variable~., scales = "free_y", space = "free")+
  coord_flip() + 
  labs(title = "Categorical variables - Proportion of \n high income (>50K) in each demographic group - part 1",
       x = "",
       y = "Percentage of high income group" ) +
  theme(legend.position = "none")

As discussed, the percentage of all native countries other than the U.S. is too small (~10%). In addition, they are equally distributed above and below the high-income likelihood benchmark. Therefore, this attribute might not be important for this classification task. Black, African American, American Indian or Alaskan Native have a much lower proportion of high income than their counterparts.

3.2. Continuous variables

Descriptive statistics for all numerical variables and boxplot to detect outliers:

lapply(seq(temp3), function(i) {
    y <- data.frame(temp3[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("stats")
    return(y)
  }) %>% bind_rows() -> chart2

chart2%>% 
  spread(stats, distribution) %>% select(variable, Min., `1st Qu.`, Median, Mean, `3rd Qu.`, Max.) %>% kable_adj()
variable Min. 1st Qu. Median Mean 3rd Qu. Max.
age 17 28 37 38.58 48 90
capital_gain 0 0 0 1,078.00 0 99,999
capital_loss 0 0 0 87.30 0 4,356
fnlwgt 12,285 117,827 178,356 189,778.00 237,051 1,484,705
hours_per_week 1 40 40 40.44 45 99
censusData_long1 %>% 
  filter(variable %in% integer_list$variable) %>% 
  tbl_df() %>% 
  mutate(value = as.integer(value))-> censusData_int

chart2$stats <- ordered(chart2$stats, levels = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max."))
chart2 %>% 
  filter(stats %in% c("Min.", "Max.")) %>% 
  spread(stats, distribution) %>% 
  mutate(diff = Max. - Min.)-> t1
censusData_int %>% 
  ggplot(., aes(1,value))+
  geom_boxplot()+
  facet_wrap(~variable, scales = "free")+
  coord_flip()

In capital_gain and capital_loss, minimum, 1st, 2nd and 3rd quantiles are all equal to “0”, which reflects the fact that at least 75% of people in this data set has no capital gain and capital loss. The distributions of these variables would be extremely right-skewed.

#check whether there is any people having both capital_gain and capital_loss positive values
censusData %>% 
  group_by(capital_gain, capital_loss) %>% 
  summarize(count = n()) %>% 
  filter(capital_gain != 0 & capital_loss != 0)

#none of them

For fnlwgt, the weighting factor has a very wide range. The maximum value is 120 times greater than the minimum.

3.2.1. Capital (gain and loss)

capital was created by taking capital_gain minus capital_loss. There are 87% of respondents do not have any capital gain and loss. Due to this feature, the distribution of any values other than “0” is hardly identified in the following histogram.

#create capital column
censusData <- censusData %>% 
  mutate(capital = capital_gain - capital_loss)
#count no. of "0" value
censusData %>% 
  prop_summary(capital) %>% 
  filter(capital == 0) %>% 
  kable_adj()
capital count prop
0 28,330 87
capital_summary <- summary(censusData$capital) %>% 
  as.list() %>% 
  tbl_df() %>% 
  gather(stats, distribution, 1:6)
censusData %>% 
  ggplot(. , aes(capital))+
  geom_histogram(fill= "grey", alpha = 0.7)+
  geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1, alpha = 0.5)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution",
       x="Capital",
       y="No. of records")

Therefore, I tried to remove “0” to improve the visibility of the histogram.

censusData %>% 
  filter(capital != 0) %>% 
  ggplot(. , aes(capital))+
  geom_histogram(fill= "grey", alpha = 0.7)+
  geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution - After remove 0 values",
       x="Capital",
       y="No. of records")

“99999” values seem to be suspicious as they are separated from the rest of the distribution (so-called outliers). According to IPUMS-CPS (n.d.), it might represent “Not-In-Universe” values rather than actual capital gain. Let temporarily remove it for clearer visualisation.

censusData %>% 
  filter(capital != 0 & capital != 99999) %>% 
  ggplot(. , aes(capital))+
  geom_histogram(aes(fill = income), binwidth = 500, alpha = 0.7)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution - After remove 0 and 99999 values",
       x="Capital",
       y="No. of records")+
  geom_vline(xintercept = c(-1750, 7250), linetype = "dashed", alpha = 0.5, size =0.5)+
  scale_x_continuous(breaks = seq(-10000, 40000, 10000))+
  theme(legend.position = "bottom")

### {#page14} The boundaries of capital to classified income group are approximately detected from this graph. It seems that people have the capital gain at least 7,250 or capital loss at least -1,750 USD are more likely to have a high-income.

censusData %>% 
  prop_summary(capital) %>% 
  arrange(desc(capital)) %>% 
  head(n = 10) %>% 
  kable_adj()
capital count prop
99,999 159 0.5
41,310 2 0.0
34,095 5 0.0
27,828 34 0.1
25,236 11 0.0
25,124 4 0.0
22,040 1 0.0
20,051 37 0.1
18,481 2 0.0
15,831 6 0.0

Caution: Looking a bit closer, the capital recorded values are not rounded numbers (e.g. 27,828 but not 28,000). Based on common sense, it is unlikely to get this precise amount of money based on questionnaire data. According to O’Hara (2004), capital gain came from the Census Bureau’s tax model and was discontinued since 2010 (IPUMS-CPS (n.d.)). Even though this variable might powerful in classifying the income groups, the concern is whether this information is still available for new data. If not, this feature would be useless in this classification model.

3.2.2. Age

From the 1st graph on the next page, the age distribution is right-skewed due to data extraction rule (age > 16). There is a minor spike in the number of respondents aged exactly 90. There might be possible that all ages above 90 or “Unknown” values were encoded as “90”. As mentioned in the Data quality section, it is strange that all “90-year-old people” still work at least 1 hour per week!

censusData_int %>% 
  filter(variable == "age") %>% 
  ggplot(. , aes(value))+
  geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
  # facet_wrap(~variable,scales = "free", ncol = 1)+
  geom_vline(data = chart2 %>% filter(variable == "age"), aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Age distribution",
       x="Age",
       y="No. of records")+
  scale_x_continuous(breaks=seq(0,100,10))+
  theme(legend.position = "right")

censusData %>% 
  group_by(income, age) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(age) %>% 
  mutate(prop = count / sum(count)) %>% 
  filter(income == ">50K", prop >= benchmark$high) -> mark

censusData_int %>% 
  filter(variable == "age") %>% 
  ggplot(., aes(value))+
  geom_freqpoly(aes(col= income), binwidth = 1)+
  geom_point(data = mark, aes(age, count))+
  geom_vline(xintercept = c(35,60), linetype = "dashed")+
  geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
  scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
  scale_colour_hue()+
  theme(legend.position = "right")+
  labs(title = "Age distribution by income groups - Observation count",
       x ="Age",
       y = "Number of Observations")

censusData %>% 
  group_by(income, age) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(age) %>% 
  mutate(prop = count / sum(count)) %>%
  ggplot(., aes(age, prop, fill = income))+
  geom_area()+
  geom_point(data = mark, aes(age, prop))+
  geom_hline(yintercept = benchmark$high, linetype = "dashed")+
  geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
  geom_vline(xintercept = c(35,60), linetype = "dashed")+
  # geom_vline(xintercept = seq(17, 90, by = 10), col = "black")+
  scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
  theme(legend.position = "right") +
  labs(title = "Age distribution by income groups - Proportion",
       x ="Age",
       y = "Proportion")

Note: All the horizontal lines (at 24%) are the percentage of high income group at aggregation level.

Based on the 3rd graph, it can be seen that people who aged from 35 to 60 are more likely to have a higher income than the average. Besides, there are only 5 people aged under 22 have an annual income greater than 50K, which accounts for 0.2% of people in this age group. In other word, it is very unlikely that someone age under 22 could earn more than 50,000 USD in 1994.

3.2.3. Working hours per week

47% of respondents work exactly 40 hours per week.

censusData_int %>% 
  filter(variable == "hours_per_week") %>% 
  ggplot(. , aes(value))+
  geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
  geom_vline(data = chart2 %>% filter(variable == "hours_per_week"), aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Working hours per week distribution",
       x="Hours per week",
       y="No. of records")+
  scale_x_continuous(breaks=seq(0,100,10))

censusData_int %>% 
  filter(variable == "hours_per_week") %>% 
  ggplot(., aes(value))+
  geom_freqpoly(aes(col= income), binwidth = 1)+
  scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
  geom_vline(xintercept = 34, linetype = "dashed")+
  scale_colour_hue()+
  theme(legend.position = "right")+
  labs(title = "Hours per week distribution by income groups - Observation count",
       x ="Hours per week",
       y = "Number of Observations")

censusData %>% 
  group_by(income, hours_per_week) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(hours_per_week) %>% 
  mutate(prop = count / sum(count)) %>%
  ggplot(., aes(hours_per_week, prop, fill = income))+
  geom_area()+
  geom_hline(yintercept = benchmark$high, linetype = "dashed")+
  scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
  geom_vline(xintercept = 34, linetype = "dashed")+
  theme(legend.position = "right") +
  labs(title = "Hours per week distribution by income groups - Proportion",
       x ="Hours per week",
       y = "Proportion")

The percentage of people aged under 35 have high income is typically low comparing to the older group.

censusData %>% 
  filter(workclass == "Never-worked") %>% 
  select(-capital_gain, - capital_loss, - fnlwgt, - capital, - relationship, - race, - sex, - native_country) %>%
  kable_adj()
age workclass education marital_status occupation hours_per_week income id
18 Never-worked 10th Never-married ? 40 <=50K 5,362
23 Never-worked 7th-8th Divorced ? 35 <=50K 10,846
17 Never-worked 10th Never-married ? 30 <=50K 14,773
18 Never-worked 11th Never-married ? 10 <=50K 20,338
20 Never-worked Some-college Never-married ? 40 <=50K 23,233
30 Never-worked HS-grad Married-civ-spouse ? 40 <=50K 32,305
18 Never-worked Some-college Never-married ? 4 <=50K 32,315
  • Another quality concern: There are 7 cases that have working hours per week greater than 0 but classified as “Never-worked”.

3.2.4 Multivariate analysis

There are no correlation between three main continuous variables:

# pairs(censusData %>% select(age, hours_per_week, fnlwgt, capital), pch =21, cex=0.1)

panel.cor <- function(x, y, ...)
{
par(usr = c(0, 1, 0, 1))
txt <- as.character(format(cor(x, y), digits=2))
text(0.5, 0.5, txt, cex = 60* abs(cor(x, y)))
}
pairs(censusData %>% select(age, hours_per_week, capital), upper.panel=panel.cor)

#4. Data pre-processing

In this context, data pre-processing might include four main tasks: cleaning, reduction, discretization and transformation (Han and Pei (2012)). Most of them have been discussed in the previous sections. This part aims to give a concise summary of them:

4.1. Data cleaning

  1. Duplicated and “?” missing values rates are quite small (0.1% and 7.4%). Their impact and some suggestions were discussed in detailed in Section 2.1 & 2.2.
  2. A few abnormal values based on their consistency with other attributes were identified in Section 2.3 (e.g. between gender and relationship, age and education, age and working hours, working hours and workclass…).
  3. Extreme values including “90” (age), “99” (hours_per_week), “99999” (capital_gain) need further investigation to identify whether it is true or missing values.
  4. The most challenging concern is sample bias. In the data set, the genders are imbalanced and income proportion is significantly different from official public data.

In general, correcting point (1,2) errors might be not as important as point (3,4) as the error rates in (1,2) are quite small and identifiable. Meanwhile, concerns in point (3,4) require much more domain knowledge about government Census and their procedure. By clarifying these concerns, the weakness of the data set can be well defined and the feasible solutions might be formulated. For example, if we know that the ratio between male and female should be exactly 1:1, a new weighting or other statistical analysis can be applied to minimize the impact of sample imbalance.

4.2. Data reduction

This might include dimensionality and/or numerosity reduction or both.
Several simple ways that have been done to reduce data dimensions, including:

  1. Defining education as “ordered factor” (ordinal data type) and drop redundant attribute education_num.
  2. Replacing capital_gain and capital_loss by new attribute capital.
  3. Considering filter out some weak, unreliable, correlated or associated variables such as native_country.
  4. Grouping multiple related levels into one group as the example below. 16 levels of education might be reasonably classified into 3 groups by the high-income likelihood and the group sizes. (split by the red lines)
summary1 %>%
  filter(variable=="education") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  geom_vline(xintercept = c(8.5, 12.5), alpha = 0.5, linetype = "dashed", col = "red")+
  coord_flip() + 
  labs(title = "Categorical variables - \n Proportion of high income (>50K) in education groups",
       x = "",
       y = "Percentage of high income group" )+
  theme(legend.position = "none")

Besides, there are other advanced data reduction techniques, that might be useful to try such as Principle Component Analysis for dimensionality reduction and clustering for numerosity reduction. Due to the limit of time, these techniques were not applied, yet would be useful to explore further.

4.3. Data discretization

The general idea is trying to map multiple values into intervals or labels.

In a similar approach, continuous variables (e.g. capital, age, hours_per_week) can also be discretized section 3.2.1. The purity within some sub-groups, which were constructed based on the mentioned rule, do increase significantly as shown below.

censusData <- censusData %>% 
  mutate(capital_group = case_when(capital< (-1750) ~ "[-inf, -1750)",
                                   capital>= (-1750) & capital<0 ~ "[-1750, 0)",
                                   capital== 0 ~ "0",
                                   capital>0 & capital<=7250 ~ "(0,7250]",
                                   capital>7250 ~ "(7250, +inf]"))
censusData$capital_group <- ordered(censusData$capital_group, levels = c("[-inf, -1750)","[-1750, 0)","0","(0,7250]","(7250, +inf]"))
censusData %>% 
  ggplot(., aes(capital_group))+
  geom_bar(aes(fill = income), position = "fill")+
  labs(y = "Percentage")+
  geom_hline(yintercept = 0.241, linetype = "dashed")

4.4. Data transformation

This step depends on what machine learning methods will be applied in the modelling phase. If the algorithms are distance-based (e.g. k-Nearest Neighbours), continuous variables (including age, hours_per_week and capital) should be normalized to the same scale (0,1) and categorical attributes need to transform into dummy binary variables to measure the distance. In this data set case, as categorical variables have many levels, it might be required to reduce the number of levels before transformation or using feature selection to reduce the number of dummy attributes.

5. Summary

Business and data understanding are strongly connected to each other and crucial for the success of data mining. Data exploratory analysis enables people to understand more about domain knowledge (Census methodology). As a result, gaining further knowledge about the area of expertise helps data analysis to discover interesting patterns with a reasonable explanation.

  1. Categorical variables: Those who are male, achieving at least Bachelors’ degree, in married relationships, having Management or Professional occupation, self-employed or working in government are more likely to have high income. Sex, marital status and education might be more important than occupation and workclass in this context.

  2. Continuous variables: To some extent, capital, age and working hours per week support the discrimination of different income groups. Specifically, people in the U.S who have capital gain at least 7,250 or capital loss at least -1,750, aged 35-60 and work at least 35 hours per week have a higher probability of gaining more than 50,000 USD per year in 1994.

6. Appendix

6.1. Frequency tables for all categorical variables with levels’ full description

temp2[[1]]$description <- c("Private",
                            "Self-employed-not incorporated",
                            "Local government",
                            "Unknown values",
                            "State government",
                            "Self-employed-incorporated",
                            "Federal government",
                            "Without pay",
                            "Never worked")

temp2[[2]]$description <- c("High school graduate - high school diploma or equivalent",
                            "Some college but no degree",
                            "Bachelor's degree (for example: BA,AB,BS)",
                            "Master's degree (for example: MA,MS,MENG,MED, MSW, MBA)",
                            "Associate degree in college - occupation/vocation program",
                            "11th grade",
                            "Associate degree in college - academic program",
                            "10th grade",
                            "7th and 8th grade",
                            "Professional school degree (for example: MD,DDS,DVM,LLB,JD)",
                            "9th grade",
                            "12th grade no diploma",
                            "Doctorate degree (for example: PHD,EDD)",
                            "5th or 6th grade",
                            "1st,2nd,3rd,or 4th grade",
                            "Preschool")

temp2[[3]]$description <- c("Married - civilian spouse present",
                           "Never married",
                           "Divorced",
                           "Separated",
                           "Widowed",
                           "Married - spouse absent (exc. separated)",
                           "Married - AF spouse present")
temp2[[6]]$description <- c("White",
                          "Black",
                          "Asian, Native Hawaiian or Other Pacific Islander",
                          "African American, American Indian or Alaskan Native",
                          "Other race")
map(temp2, kable_adj) %>% walk(print)
workclass count prop description
Private 22,696 69.7 Private
Self-emp-not-inc 2,541 7.8 Self-employed-not incorporated
Local-gov 2,093 6.4 Local government
? 1,836 5.6 Unknown values
State-gov 1,298 4.0 State government
Self-emp-inc 1,116 3.4 Self-employed-incorporated
Federal-gov 960 2.9 Federal government
Without-pay 14 0.0 Without pay
Never-worked 7 0.0 Never worked
education count prop description
HS-grad 10,501 32.3 High school graduate - high school diploma or equivalent
Some-college 7,291 22.4 Some college but no degree
Bachelors 5,355 16.4 Bachelor’s degree (for example: BA,AB,BS)
Masters 1,723 5.3 Master’s degree (for example: MA,MS,MENG,MED, MSW, MBA)
Assoc-voc 1,382 4.2 Associate degree in college - occupation/vocation program
11th 1,175 3.6 11th grade
Assoc-acdm 1,067 3.3 Associate degree in college - academic program
10th 933 2.9 10th grade
7th-8th 646 2.0 7th and 8th grade
Prof-school 576 1.8 Professional school degree (for example: MD,DDS,DVM,LLB,JD)
9th 514 1.6 9th grade
12th 433 1.3 12th grade no diploma
Doctorate 413 1.3 Doctorate degree (for example: PHD,EDD)
5th-6th 333 1.0 5th or 6th grade
1st-4th 168 0.5 1st,2nd,3rd,or 4th grade
Preschool 51 0.2 Preschool
marital_status count prop description
Married-civ-spouse 14,976 46.0 Married - civilian spouse present
Never-married 10,683 32.8 Never married
Divorced 4,443 13.6 Divorced
Separated 1,025 3.1 Separated
Widowed 993 3.0 Widowed
Married-spouse-absent 418 1.3 Married - spouse absent (exc. separated)
Married-AF-spouse 23 0.1 Married - AF spouse present
occupation count prop
Prof-specialty 4,140 12.7
Craft-repair 4,099 12.6
Exec-managerial 4,066 12.5
Adm-clerical 3,770 11.6
Sales 3,650 11.2
Other-service 3,295 10.1
Machine-op-inspct 2,002 6.1
? 1,843 5.7
Transport-moving 1,597 4.9
Handlers-cleaners 1,370 4.2
Farming-fishing 994 3.1
Tech-support 928 2.9
Protective-serv 649 2.0
Priv-house-serv 149 0.5
Armed-Forces 9 0.0
relationship count prop
Husband 13,193 40.5
Not-in-family 8,305 25.5
Own-child 5,068 15.6
Unmarried 3,446 10.6
Wife 1,568 4.8
Other-relative 981 3.0
race count prop description
White 27,816 85.4 White
Black 3,124 9.6 Black
Asian-Pac-Islander 1,039 3.2 Asian, Native Hawaiian or Other Pacific Islander
Amer-Indian-Eskimo 311 1.0 African American, American Indian or Alaskan Native
Other 271 0.8 Other race
sex count prop
Male 21,790 66.9
Female 10,771 33.1
native_country count prop
United-States 29,170 89.6
Mexico 643 2.0
? 583 1.8
Philippines 198 0.6
Germany 137 0.4
Canada 121 0.4
Puerto-Rico 114 0.4
El-Salvador 106 0.3
India 100 0.3
Cuba 95 0.3
England 90 0.3
Jamaica 81 0.2
South 80 0.2
China 75 0.2
Italy 73 0.2
Dominican-Republic 70 0.2
Vietnam 67 0.2
Guatemala 64 0.2
Japan 62 0.2
Poland 60 0.2
Columbia 59 0.2
Taiwan 51 0.2
Haiti 44 0.1
Iran 43 0.1
Portugal 37 0.1
Nicaragua 34 0.1
Peru 31 0.1
France 29 0.1
Greece 29 0.1
Ecuador 28 0.1
Ireland 24 0.1
Hong 20 0.1
Cambodia 19 0.1
Trinadad&Tobago 19 0.1
Laos 18 0.1
Thailand 18 0.1
Yugoslavia 16 0.0
Outlying-US(Guam-USVI-etc) 14 0.0
Honduras 13 0.0
Hungary 13 0.0
Scotland 12 0.0
Holand-Netherlands 1 0.0

6.2. Duplicated records

The only difference is the dummy id (based on row number) that I created.

censusData %>% 
  group_by_at(vars(-id)) %>% 
  summarize(dup_group = n()) %>% 
  filter(dup_group>1) %>% 
  select(-dup_group) %>% 
  inner_join(censusData, by = names(censusData)[-(length(censusData)-2)]) %>% 
  kable(format="latex", booktabs=TRUE) %>% 
  kable_styling(latex_options="scale_down", position = "float_left")

6.3. Summary statistics of continuous variables by target groups

colnames(censusData) <- str_replace_all(colnames(censusData), "_", ".")
censusData %>% 
  group_by(income) %>% 
  summarise_at(vars(age, fnlwgt, capital.gain, capital.loss, hours.per.week), funs(min, max, mean, median)) -> s1

s1 %>% 
  gather(variable_measure, value, 2:21) %>% 
  separate(variable_measure, c("variable", "measure"), sep = "_") %>% 
  spread(measure, value) %>% 
  arrange(variable) %>% 
  select(variable, income, min, median, mean, max) %>% 
  kable_adj()
variable income min median mean max
age <=50K 17 34 36.78 90
age >50K 19 44 44.25 90
capital.gain <=50K 0 0 148.75 41,310
capital.gain >50K 0 0 4,006.14 99,999
capital.loss <=50K 0 0 53.14 4,356
capital.loss >50K 0 0 195.00 3,683
fnlwgt <=50K 12,285 179,465 190,340.87 1,484,705
fnlwgt >50K 14,878 176,101 188,005.00 1,226,583
hours.per.week <=50K 1 40 38.84 99
hours.per.week >50K 1 40 45.47 99
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")

R codes

knitr::opts_chunk$set(warning = FALSE, message = FALSE, comment = NA, fig.align='center')
# knitr::opts_chunk$set(fig.width = 15, fig.height = 14)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
options(dplyr.width = Inf)
options(knitr.table.format = "html")
options(knitr.kable.NA = '')

prop_summary <- function(df, ...){
df %>%
  group_by(...) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(prop = round(count/sum(count)*100,1))
}

kable_adj <- function(data) {
  knitr::kable(data, booktabs = TRUE, digits = 2, format.args = list(big.mark = ",")) %>% 
    kableExtra::kable_styling(full_width = F, position = "left") %>% 
    row_spec(0, bold = T, color = "white", background = "#696969")
}
library("tidyverse")
library("lubridate")
library("knitr")
library("ggpubr")
library("kableExtra")
#import data
censusData <- read.csv("censusData.csv", stringsAsFactors = T, strip.white = TRUE) %>% tbl_df()
censusData_backup <- censusData

#rename to replace "." with "_"
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")
variables_class <- as.data.frame(lapply(censusData, class)) %>% 
  .[1,] %>% 
  gather(variable, R_class , 1:ncol(.)) %>% 
  mutate(data_type = ifelse(R_class == "factor", "categorical", 
                            ifelse(variable == "education_num","categorical","continuous")))
description <- c("age of the individual",
"type of employer the individual has",
"number of people in the universe (U.S. population) that each individual in the survey can represent",
"highest level of education that individual achieved",
"highest level of education in a numerical form",
"marital status of the individual",
"occupation of the individual",
"primary family relationship (the relationship of the individual in his/her primary family)",
"the individual's ethnicity",
"gender",
"capital gains (from selling an asset such as a stock, bond for more than the original purchase price)",
"capital losses (from selling an asset such as a stock, bond for less than the original purchase price)",
"hours worked per week",
"country of origin",
"(target) whether or not the person makes more than $50,000 per annum income")

variables_class$description <- description
variables_class %>% 
  arrange(data_type, variable) %>% 
  kable_adj() %>% 
  row_spec(11:15, color = "white", background = "#696969") %>% 
  column_spec(4, width = "10cm")
#preview data structure and example values
glimpse(censusData)
#pre-processing data without printing out - will be discussed in the pre-processing section
censusData %>% 
  select(education_num, education) %>% 
  distinct() %>% 
  arrange(education_num) -> edu_order

censusData$education <- ordered(censusData$education, levels = edu_order$education)

censusData <- censusData %>% select(-education_num) %>% mutate(id = row_number())
censusData %>% 
  select(-id) %>% 
  summarize(no_of_observations = n(),
            no_of_distinct = n_distinct(.)) %>% 
  mutate(no_of_dup = no_of_observations - no_of_distinct) %>% 
  mutate(pct_dup = round(no_of_dup/no_of_observations*100,1))%>% kable_adj()
censusData %>% 
  group_by_at(vars(-id)) %>% 
  summarize(dup_group = n()) %>% 
  group_by(dup_group) %>% 
  summarize(count = n()) %>% kable_adj()
#check 'normal' missing values
# map_df(censusData, ~sum(is.na(.)))

#'check missing values labeled as "?"
total_missing <- map_df(censusData, ~str_detect(., pattern = "\\?")) %>% 
  rowSums() %>% 
  tbl_df() %>% 
  filter(value > 0) %>% 
  summarize(no_of_missing = n()) %>% 
  mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1),
         variable = "all variables") %>% 
  select(variable, no_of_missing, pct_missing)


map_df(censusData, ~sum(str_detect(., pattern = "\\?"))) %>% 
  gather(variable, no_of_missing, 1:15) %>% 
  mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1)) %>% 
  bind_rows(total_missing) %>% 
  arrange(desc(no_of_missing)) %>% 
  kable_adj()
income_s <- prop_summary(censusData,income)

income_c <- censusData %>% 
  filter_all(all_vars(.!= "?")) %>% 
  prop_summary(income) %>%
  mutate(dataset = "without missing values") %>% 
  bind_rows(income_s %>% mutate(dataset = "with missing values"))

ggplot(income_c, aes(dataset, prop))+
  geom_col(aes(fill=income))+
  labs(y="Percentage")+
  geom_text(aes(label=prop), hjust=2.5)+
  coord_flip()+
  theme(legend.position="bottom")

censusData %>% 
  group_by(age, education) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(., aes(age, education))+
  geom_point(aes(size = count, color = (age <= 20 & education =="Masters")), pch = 19, alpha = 0.7)+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_colour_manual(values = setNames(c('red','grey'),c(T, F)))+
  theme(legend.position="bottom")+
  labs(y="")
censusData %>% 
  group_by(sex, relationship) %>% 
  summarize(count = n()) %>% 
  ggplot(., aes(relationship, sex))+
  geom_text(aes(label=count, color = (sex == "Male" & relationship == "Wife") | (sex == "Female" & relationship == "Husband")))+
  scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
  theme(legend.position="bottom")
censusData %>% 
  ggplot(., aes(age, hours_per_week))+
  geom_jitter(aes(col = (age == 90)),alpha = 0.2)+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_y_continuous(breaks = seq(0,100,10))+
  scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
  theme(legend.position="bottom")
ggplot(income_c %>% filter(dataset == "with missing values"), aes(dataset, prop))+
  geom_col(aes(fill=income))+
  labs(y="Percentage")+
  geom_text(aes(label=prop), hjust=2.5)+
  coord_flip()+
  labs(x ="")+
  scale_x_discrete(labels=c("with missing values" = ""))+
  theme_minimal()+
  theme(legend.position="right")
variables_class$unique_value <- sapply(censusData_backup, n_distinct)
variables_class$levels <- c(NA,
                            "?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay",
                            NA,
                            "Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate",
                            "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16",
                            "Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed",
                            "?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv,     Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving",
                            "Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife",
                            "Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White",
                            "Female, Male",
                            NA,
                            NA,
                            NA,
                            "?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia",
                            "<=50K, >50K")


variables_class %>% 
  select(-description, - R_class) %>% 
  filter(data_type == "categorical") %>% 
  arrange(unique_value, variable) %>% 
  kable_adj() %>% 
  column_spec(4, width = "10cm")
#summary table
y <- censusData %>% summary(maxsum = 42) %>% as.data.frame.matrix(row.names = F) 


#generate counting numbers and empty list
a<-1
b<-1
temp2 <- list()
temp3 <- list()

categorical_list <- variables_class %>% filter(R_class == "factor", variable !="income")
integer_list <- variables_class %>% filter(R_class == "integer", !variable  %in% c("id", "education_num"))

#splitting into categorical and continuous groups
for (i in seq_along(y)){
  y %>% 
    select(i) %>% 
    na.omit() -> temp
  if (str_trim(colnames(temp)[1]) %in% categorical_list[,1] ){
    temp %>% 
      separate(1,c(str_trim(colnames(temp)[1]), "count"), sep = ":") %>% 
      mutate_all(~str_trim(.)) %>% 
      mutate(count = as.numeric(count), prop = round(count / sum(count) * 100,1)) %>% 
      arrange(desc(count))-> temp2[[a]]
    a <- a + 1
  }
  if (str_trim(colnames(temp)[1]) %in% integer_list[,1] ){
    temp %>% 
      separate(1,c(str_trim(colnames(temp)[1]), "distribution"), sep = ":") %>% 
      mutate_all(~str_trim(.)) %>% 
      mutate(distribution = as.numeric(distribution)) -> temp3[[b]]
    b <- b + 1
  }
}
censusData %>%
  gather(variable, value, - id) %>% 
  mutate(value = as.integer(value))-> censusData_long

lapply(seq(temp2), function(i) {
    y <- data.frame(temp2[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("levels")
    return(y)
  }) %>% bind_rows() -> chart1

ggplot(chart1 %>% filter(variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
  geom_col(fill= "#56B4E9", alpha = 0.7)+
  facet_grid(variable~., space = "free", scales = "free")+
  coord_flip()+
  labs(title = "Categorical variables - Frequency bar chart - part 1",
       x = "",
       y = "Percentage" )+
  geom_text(aes(label = round(prop,0)), hjust = -1)+
  scale_y_continuous(breaks = seq(0,100,10))
censusData %>%
  gather(variable, value, - id) %>% 
  mutate(value = as.integer(value))-> censusData_long

lapply(seq(temp2), function(i) {
    y <- data.frame(temp2[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("levels")
    return(y)
  }) %>% bind_rows() %>% mutate(levels = factor(levels))-> chart1

ggplot(chart1 %>% filter(!variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
  geom_col(fill= "#56B4E9", alpha = 0.7)+
  facet_wrap(~variable, ncol =2, scales = "free_y")+
  coord_flip()+
  labs(title = "Categorical variables - Frequency bar chart - part 2",
       x = "",
       y = "Percentage" )+
  geom_text(aes(label = round(prop,0)), hjust = -0.5)+
  scale_y_continuous(breaks = seq(0,90,10), limits = c(0,80))

censusData %>%
  summarize(high = mean(income ==">50K")) -> benchmark

censusData %>%
  gather(variable, value, - id, - income) -> censusData_long1
censusData_long1 %>%
  filter(variable %in% categorical_list$variable) %>%
  group_by(variable, value) %>%
  summarize(count = n(),
            pct_gt50 = round(mean(income == ">50K")*100,2)) %>% 
  mutate(pct_count = round(count/ nrow(censusData)*100,0)) -> summary1


summary1 %>%
  filter(variable!="native_country", variable!="race") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  facet_wrap(.~variable, scales = "free_y", ncol = 2)+
  coord_flip() + 
  labs(title = "Categorical variables - Proportion of high income (>50K) in each demographic group - part 2",
       x = "",
       y = "Percentage of high income group" )+
  theme(legend.position = "none")
summary1 %>%
  filter(variable=="native_country" | variable=="race") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  facet_grid(variable~., scales = "free_y", space = "free")+
  coord_flip() + 
  labs(title = "Categorical variables - Proportion of \n high income (>50K) in each demographic group - part 1",
       x = "",
       y = "Percentage of high income group" ) +
  theme(legend.position = "none")
lapply(seq(temp3), function(i) {
    y <- data.frame(temp3[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("stats")
    return(y)
  }) %>% bind_rows() -> chart2

chart2%>% 
  spread(stats, distribution) %>% select(variable, Min., `1st Qu.`, Median, Mean, `3rd Qu.`, Max.) %>% kable_adj()
censusData_long1 %>% 
  filter(variable %in% integer_list$variable) %>% 
  tbl_df() %>% 
  mutate(value = as.integer(value))-> censusData_int

chart2$stats <- ordered(chart2$stats, levels = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max."))
chart2 %>% 
  filter(stats %in% c("Min.", "Max.")) %>% 
  spread(stats, distribution) %>% 
  mutate(diff = Max. - Min.)-> t1
censusData_int %>% 
  ggplot(., aes(1,value))+
  geom_boxplot()+
  facet_wrap(~variable, scales = "free")+
  coord_flip()
#check whether there is any people having both capital_gain and capital_loss positive values
censusData %>% 
  group_by(capital_gain, capital_loss) %>% 
  summarize(count = n()) %>% 
  filter(capital_gain != 0 & capital_loss != 0)

#none of them
#create capital column
censusData <- censusData %>% 
  mutate(capital = capital_gain - capital_loss)
#count no. of "0" value
censusData %>% 
  prop_summary(capital) %>% 
  filter(capital == 0) %>% 
  kable_adj()
capital_summary <- summary(censusData$capital) %>% 
  as.list() %>% 
  tbl_df() %>% 
  gather(stats, distribution, 1:6)
censusData %>% 
  ggplot(. , aes(capital))+
  geom_histogram(fill= "grey", alpha = 0.7)+
  geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1, alpha = 0.5)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution",
       x="Capital",
       y="No. of records")
censusData %>% 
  filter(capital != 0) %>% 
  ggplot(. , aes(capital))+
  geom_histogram(fill= "grey", alpha = 0.7)+
  geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution - After remove 0 values",
       x="Capital",
       y="No. of records")
censusData %>% 
  filter(capital != 0 & capital != 99999) %>% 
  ggplot(. , aes(capital))+
  geom_histogram(aes(fill = income), binwidth = 500, alpha = 0.7)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution - After remove 0 and 99999 values",
       x="Capital",
       y="No. of records")+
  geom_vline(xintercept = c(-1750, 7250), linetype = "dashed", alpha = 0.5, size =0.5)+
  scale_x_continuous(breaks = seq(-10000, 40000, 10000))+
  theme(legend.position = "bottom")
censusData %>% 
  prop_summary(capital) %>% 
  arrange(desc(capital)) %>% 
  head(n = 10) %>% 
  kable_adj()
censusData_int %>% 
  filter(variable == "age") %>% 
  ggplot(. , aes(value))+
  geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
  # facet_wrap(~variable,scales = "free", ncol = 1)+
  geom_vline(data = chart2 %>% filter(variable == "age"), aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Age distribution",
       x="Age",
       y="No. of records")+
  scale_x_continuous(breaks=seq(0,100,10))+
  theme(legend.position = "right")

censusData %>% 
  group_by(income, age) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(age) %>% 
  mutate(prop = count / sum(count)) %>% 
  filter(income == ">50K", prop >= benchmark$high) -> mark

censusData_int %>% 
  filter(variable == "age") %>% 
  ggplot(., aes(value))+
  geom_freqpoly(aes(col= income), binwidth = 1)+
  geom_point(data = mark, aes(age, count))+
  geom_vline(xintercept = c(35,60), linetype = "dashed")+
  geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
  scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
  scale_colour_hue()+
  theme(legend.position = "right")+
  labs(title = "Age distribution by income groups - Observation count",
       x ="Age",
       y = "Number of Observations")

censusData %>% 
  group_by(income, age) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(age) %>% 
  mutate(prop = count / sum(count)) %>%
  ggplot(., aes(age, prop, fill = income))+
  geom_area()+
  geom_point(data = mark, aes(age, prop))+
  geom_hline(yintercept = benchmark$high, linetype = "dashed")+
  geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
  geom_vline(xintercept = c(35,60), linetype = "dashed")+
  # geom_vline(xintercept = seq(17, 90, by = 10), col = "black")+
  scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
  theme(legend.position = "right") +
  labs(title = "Age distribution by income groups - Proportion",
       x ="Age",
       y = "Proportion")
censusData_int %>% 
  filter(variable == "hours_per_week") %>% 
  ggplot(. , aes(value))+
  geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
  geom_vline(data = chart2 %>% filter(variable == "hours_per_week"), aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Working hours per week distribution",
       x="Hours per week",
       y="No. of records")+
  scale_x_continuous(breaks=seq(0,100,10))

censusData_int %>% 
  filter(variable == "hours_per_week") %>% 
  ggplot(., aes(value))+
  geom_freqpoly(aes(col= income), binwidth = 1)+
  scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
  geom_vline(xintercept = 34, linetype = "dashed")+
  scale_colour_hue()+
  theme(legend.position = "right")+
  labs(title = "Hours per week distribution by income groups - Observation count",
       x ="Hours per week",
       y = "Number of Observations")

censusData %>% 
  group_by(income, hours_per_week) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(hours_per_week) %>% 
  mutate(prop = count / sum(count)) %>%
  ggplot(., aes(hours_per_week, prop, fill = income))+
  geom_area()+
  geom_hline(yintercept = benchmark$high, linetype = "dashed")+
  scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
  geom_vline(xintercept = 34, linetype = "dashed")+
  theme(legend.position = "right") +
  labs(title = "Hours per week distribution by income groups - Proportion",
       x ="Hours per week",
       y = "Proportion")
censusData %>% 
  filter(workclass == "Never-worked") %>% 
  select(-capital_gain, - capital_loss, - fnlwgt, - capital, - relationship, - race, - sex, - native_country) %>%
  kable_adj()
# pairs(censusData %>% select(age, hours_per_week, fnlwgt, capital), pch =21, cex=0.1)

panel.cor <- function(x, y, ...)
{
par(usr = c(0, 1, 0, 1))
txt <- as.character(format(cor(x, y), digits=2))
text(0.5, 0.5, txt, cex = 60* abs(cor(x, y)))
}
pairs(censusData %>% select(age, hours_per_week, capital), upper.panel=panel.cor)
summary1 %>%
  filter(variable=="education") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  geom_vline(xintercept = c(8.5, 12.5), alpha = 0.5, linetype = "dashed", col = "red")+
  coord_flip() + 
  labs(title = "Categorical variables - \n Proportion of high income (>50K) in education groups",
       x = "",
       y = "Percentage of high income group" )+
  theme(legend.position = "none")
censusData <- censusData %>% 
  mutate(capital_group = case_when(capital< (-1750) ~ "[-inf, -1750)",
                                   capital>= (-1750) & capital<0 ~ "[-1750, 0)",
                                   capital== 0 ~ "0",
                                   capital>0 & capital<=7250 ~ "(0,7250]",
                                   capital>7250 ~ "(7250, +inf]"))
censusData$capital_group <- ordered(censusData$capital_group, levels = c("[-inf, -1750)","[-1750, 0)","0","(0,7250]","(7250, +inf]"))
censusData %>% 
  ggplot(., aes(capital_group))+
  geom_bar(aes(fill = income), position = "fill")+
  labs(y = "Percentage")+
  geom_hline(yintercept = 0.241, linetype = "dashed")
temp2[[1]]$description <- c("Private",
                            "Self-employed-not incorporated",
                            "Local government",
                            "Unknown values",
                            "State government",
                            "Self-employed-incorporated",
                            "Federal government",
                            "Without pay",
                            "Never worked")

temp2[[2]]$description <- c("High school graduate - high school diploma or equivalent",
                            "Some college but no degree",
                            "Bachelor's degree (for example: BA,AB,BS)",
                            "Master's degree (for example: MA,MS,MENG,MED, MSW, MBA)",
                            "Associate degree in college - occupation/vocation program",
                            "11th grade",
                            "Associate degree in college - academic program",
                            "10th grade",
                            "7th and 8th grade",
                            "Professional school degree (for example: MD,DDS,DVM,LLB,JD)",
                            "9th grade",
                            "12th grade no diploma",
                            "Doctorate degree (for example: PHD,EDD)",
                            "5th or 6th grade",
                            "1st,2nd,3rd,or 4th grade",
                            "Preschool")

temp2[[3]]$description <- c("Married - civilian spouse present",
                           "Never married",
                           "Divorced",
                           "Separated",
                           "Widowed",
                           "Married - spouse absent (exc. separated)",
                           "Married - AF spouse present")
temp2[[6]]$description <- c("White",
                          "Black",
                          "Asian, Native Hawaiian or Other Pacific Islander",
                          "African American, American Indian or Alaskan Native",
                          "Other race")

map(temp2, kable_adj) %>% walk(print)
censusData %>% 
  group_by_at(vars(-id)) %>% 
  summarize(dup_group = n()) %>% 
  filter(dup_group>1) %>% 
  select(-dup_group) %>% 
  inner_join(censusData, by = names(censusData)[-(length(censusData)-2)]) %>% 
  kable(format="latex", booktabs=TRUE) %>% 
  kable_styling(latex_options="scale_down", position = "float_left")
colnames(censusData) <- str_replace_all(colnames(censusData), "_", ".")
censusData %>% 
  group_by(income) %>% 
  summarise_at(vars(age, fnlwgt, capital.gain, capital.loss, hours.per.week), funs(min, max, mean, median)) -> s1

s1 %>% 
  gather(variable_measure, value, 2:21) %>% 
  separate(variable_measure, c("variable", "measure"), sep = "_") %>% 
  spread(measure, value) %>% 
  arrange(variable) %>% 
  select(variable, income, min, median, mean, max) %>% 
  kable_adj()

colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")

References

Becker, Barry, and Ron Kohavi. 1996. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. https://archive.ics.uci.edu/ml/machine-learning-databases/adult/.

Census, United States. Bureau of the. 1995. “Current Population Survey: Annual Demographic File, 1994.” Inter-university Consortium for Political; Social Research [distributor]. https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/6461/version/1.

Ethnic, and Hispanic Statistics Branch. n.d. Population Division, U.S. Bureau of the Census Current Population Survey. https://www2.census.gov/programs-surveys/demo/tables/foreign-born/1995/cps1995/tab0108.pdf.

Han, Jiawei., and Jian. Pei. 2012. “Data Mining: Concepts and Techniques.” London;Amsterdam; : Morgan Kaufmann. http://capitadiscovery.co.uk/brighton-ac/items/1431592.

IPUMS-CPS. n.d. University of Minnesota. https://cps.ipums.org/cps-action/variables/CAPGAIN#codes_section.

O’Hara, Amy. 2004. “New Methods for Simulating Cps Taxes.” Mimeo, U.S. Census. https://www.census.gov/content/dam/Census/library/working-papers/2004/demo/oharataxmodel.pdf.