library("tidyverse")
library("lubridate")
library("knitr")
library("ggpubr")
library("kableExtra")

1. Introduction

1.1. Report objectives

The main objective is to analyze the characteristics and quality of Census data. Then, several pre-processing considerations are proposed to enhance the performance of income classification prediction.

1.2. Data set introduction

The chosen data set contains key characteristics of the United States’ population in 1994. It was extracted and donor by (1996). A basic description can be found at UCI Machine Learning Repository. However, because the original source link is not valid, it is challenging to gain a better understanding of data collection methodology and related features. However, based on my desk research, it seems that the data set comes from the Current Population Survey (CPS), which was conducted by the United States Bureau of the Census.

Data collection method: A nationwide sample of 57,000 housing units in 1,973 counties were personally interviewed (Census 1995).

Data extraction conditions: According to UCI, data was extracted using the following rules: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)), which means individuals with:

age > 16
adjusted gross income > $100
final weighting > 1
working hours per week > 0

The extracted data set has 48,842 records and was randomly split into training and test set at ratio 2/3 and 1/3 respectively.

In this report, only the training data set, which includes 32,561 records and 15 attributes, will be analysed. Each record contains data of one respondent and each column describes one feature of that person. 10 variables are categorical and 5 variables are continuous as shown below.

#import data
censusData <- read.csv("censusData.csv", stringsAsFactors = T, strip.white = TRUE) %>% tbl_df()
censusData_backup <- censusData

#rename to replace "." with "_"
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")

Variable type and description summary

variables_class <- as.data.frame(lapply(censusData, class)) %>% 
  .[1,] %>% 
  gather(variable, R_class , 1:ncol(.)) %>% 
  mutate(data_type = ifelse(R_class == "factor", "categorical", 
                            ifelse(variable == "education_num","categorical","continuous")))

variables_class %>% 
  arrange(data_type, variable) %>% 
  kable_adj() %>% 
  row_spec(11:15, color = "white", background = "#696969") %>% 
  column_spec(4, width = "10cm")

variable	R_class	data_type	description
education	factor	categorical	highest level of education that individual achieved
education_num	integer	categorical	highest level of education in a numerical form
income	factor	categorical	(target) whether or not the person makes more than $50,000 per annum income
marital_status	factor	categorical	marital status of the individual
native_country	factor	categorical	country of origin
occupation	factor	categorical	occupation of the individual
race	factor	categorical	the individual’s ethnicity
relationship	factor	categorical	primary family relationship (the relationship of the individual in his/her primary family)
sex	factor	categorical	gender
workclass	factor	categorical	type of employer the individual has
age	integer	continuous	age of the individual
capital_gain	integer	continuous	capital gains (from selling an asset such as a stock, bond for more than the original purchase price)
capital_loss	integer	continuous	capital losses (from selling an asset such as a stock, bond for less than the original purchase price)
fnlwgt	integer	continuous	number of people in the universe (U.S. population) that each individual in the survey can represent
hours_per_week	integer	continuous	hours worked per week

Note: fnlwgt is weighting value that ensures sample from this survey can represent the ‘true status’ of U.S. population. In general, a respondent gets a high weight if the proportion of that type of people in the survey is small compared to its size in the whole population. When applying machine learning method, this variable might be used as a weighted field to give extra importance to underrepresented groups. However, there is one note in (Census 1995) mentioned there were many missing values in this field, and hence, this field should not be used. In addition, the applied extraction rule is fnlwgt >1, which might remove that missing part of the sample and those whose weights are equal to 1. This may make the usability of this field as well as the sample representative in this extracted data set questionable.

Data structure and example values:

#preview data structure and example values
glimpse(censusData)

Observations: 32,561
Variables: 15
$ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
$ workclass      <fct> State-gov, Self-emp-not-inc, Private, Private, ...
$ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
$ education      <fct> Bachelors, Bachelors, HS-grad, 11th, Bachelors,...
$ education_num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
$ marital_status <fct> Never-married, Married-civ-spouse, Divorced, Ma...
$ occupation     <fct> Adm-clerical, Exec-managerial, Handlers-cleaner...
$ relationship   <fct> Not-in-family, Husband, Not-in-family, Husband,...
$ race           <fct> White, White, White, Black, Black, White, Black...
$ sex            <fct> Male, Male, Male, Male, Female, Female, Female,...
$ capital_gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
$ capital_loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ hours_per_week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
$ native_country <fct> United-States, United-States, United-States, Un...
$ income         <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K...

There are some interesting features from these examples that can be used as general guidance for pre-processing or further exploration:

education & education_num are redundant as they contain the same information in different forms.
capital_gain and capital_loss might be considered to combine into one attribute as it can represent one feature (capital).
Some values seem to have very high frequency, such as “0” in capital_gain and capital_loss, “40” in hours_per_week, “United- States” in native_country and “<=50K” in target variable income.

2. Initial Data Quality Assessment

#pre-processing data without printing out - will be discussed in the pre-processing section
censusData %>% 
  select(education_num, education) %>% 
  distinct() %>% 
  arrange(education_num) -> edu_order

censusData$education <- ordered(censusData$education, levels = edu_order$education)

censusData <- censusData %>% select(-education_num) %>% mutate(id = row_number())

2.1. Duplication

There are 24 duplicated rows (identical in all attributes). 22 of them are repeated 2 times and one is repeated 3 times. A full list of duplicated values is included in Appendix (6.2).

censusData %>% 
  select(-id) %>% 
  summarize(no_of_observations = n(),
            no_of_distinct = n_distinct(.)) %>% 
  mutate(no_of_dup = no_of_observations - no_of_distinct) %>% 
  mutate(pct_dup = round(no_of_dup/no_of_observations*100,1))%>% kable_adj()

no_of_observations	no_of_distinct	no_of_dup	pct_dup
32,561	32,537	24	0.1

censusData %>% 
  group_by_at(vars(-id)) %>% 
  summarize(dup_group = n()) %>% 
  group_by(dup_group) %>% 
  summarize(count = n()) %>% kable_adj()

dup_group	count
1	32,514
2	22
3	1

2.2. Missing values

According to the description on UCI, “Unknown” values are labelled as “?”. In total, there are 2,399 incomplete records (~7.4%). Specifically, there are missing values in occupation, workclass and native_country, which accounts for 5.7%, 5.6% and 1.8% of the total records respectively.

#check 'normal' missing values
# map_df(censusData, ~sum(is.na(.)))

#'check missing values labeled as "?"
total_missing <- map_df(censusData, ~str_detect(., pattern = "\\?")) %>% 
  rowSums() %>% 
  tbl_df() %>% 
  filter(value > 0) %>% 
  summarize(no_of_missing = n()) %>% 
  mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1),
         variable = "all variables") %>% 
  select(variable, no_of_missing, pct_missing)


map_df(censusData, ~sum(str_detect(., pattern = "\\?"))) %>% 
  gather(variable, no_of_missing, 1:15) %>% 
  mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1)) %>% 
  bind_rows(total_missing) %>% 
  arrange(desc(no_of_missing)) %>% 
  kable_adj()

variable	no_of_missing	pct_missing
all variables	2,399	7.4
occupation	1,843	5.7
workclass	1,836	5.6
native_country	583	1.8
age	0	0.0
fnlwgt	0	0.0
education	0	0.0
marital_status	0	0.0
relationship	0	0.0
race	0	0.0
sex	0	0.0
capital_gain	0	0.0
capital_loss	0	0.0
hours_per_week	0	0.0
income	0	0.0
id	0	0.0

However, at the top-line level, the missing values seem to have a minimal impact on the income group contribution. The odd between <=50K & >50K income group with or without missing values are quite similar to each other as shown in the graph below. In addition, some algorithms (e.g. C5.0, CR&T) have their own techniques (e.g. fractional cases, surrogate splits) to handle missing values. Therefore, I kept using the original data set for the following analysis. However, we might need to justify the impact of missing values on the model performance later one. If the performance is poor, we might consider removing incomplete records (or attributes) or imputing missing values. As the missing values are categorical, they might be defined according to the most frequent values or using classification methods to predict the missing values based on other attributes (excluding target variable income). However, these methods can not guarantee the accuracy of imputed values and might introduce bias. Furthermore, the attributes with missing values have many levels (14 occupations and 41 native countries), which makes missing values imputation even more challenging.

income_s <- prop_summary(censusData,income)

income_c <- censusData %>% 
  filter_all(all_vars(.!= "?")) %>% 
  prop_summary(income) %>%
  mutate(dataset = "without missing values") %>% 
  bind_rows(income_s %>% mutate(dataset = "with missing values"))

ggplot(income_c, aes(dataset, prop))+
  geom_col(aes(fill=income))+
  labs(y="Percentage")+
  geom_text(aes(label=prop), hjust=2.5)+
  coord_flip()+
  theme(legend.position="bottom")

2.3. Other logic checks

Based on the relationship between different attributes, the majority of observations seem to be valid, except very few abnormal values as following:

2 respondents achieved Masters degrees at a very young age (<=20), which is shown as red points in the graph below. It might be true, yet worth to double check if there are many observations like this. In this context, it is reasonable to keep it as it is.

censusData %>% 
  group_by(age, education) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(., aes(age, education))+
  geom_point(aes(size = count, color = (age <= 20 & education =="Masters")), pch = 19, alpha = 0.7)+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_colour_manual(values = setNames(c('red','grey'),c(T, F)))+
  theme(legend.position="bottom")+
  labs(y="")

3 cases that show the conflict between age and relationship (“Husband” having “Female” gender and "the other way round). The genders or relationships of these cases might be recorded incorrectly.

censusData %>% 
  group_by(sex, relationship) %>% 
  summarize(count = n()) %>% 
  ggplot(., aes(relationship, sex))+
  geom_text(aes(label=count, color = (sex == "Male" & relationship == "Wife") | (sex == "Female" & relationship == "Husband")))+
  scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
  theme(legend.position="bottom")

There are many people aged 90 work at least one hour per week. There are some even work more 40 hours and one works 99 hours per week, which seems to be abnormal.

censusData %>% 
  ggplot(., aes(age, hours_per_week))+
  geom_jitter(aes(col = (age == 90)),alpha = 0.2)+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_y_continuous(breaks = seq(0,100,10))+
  scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
  theme(legend.position="bottom")

3. Data exploration

3.1. Categorical variables

Target variable income has two levels “<=50K” and “>50K” with the odds ratio approximately 3:1. The labelled classes are not extremely imbalanced and seem to be equally important in this task. Therefore, this ratio might not have a severe impact on model performance.

ggplot(income_c %>% filter(dataset == "with missing values"), aes(dataset, prop))+
  geom_col(aes(fill=income))+
  labs(y="Percentage")+
  geom_text(aes(label=prop), hjust=2.5)+
  coord_flip()+
  labs(x ="")+
  scale_x_discrete(labels=c("with missing values" = ""))+
  theme_minimal()+
  theme(legend.position="right")

Note: From this on, for convenience, I am going to mention “<=50K” and “>50K” as low and high income group respectively.

However, there is a big gap between the proportion of high income in this sample (~24%) and the results publicized by U.S. Bureau of the Census (n.d.) (~10%). This re-enforces my concern about the data representativeness.

variables_class$unique_value <- sapply(censusData_backup, n_distinct)
variables_class$levels <- c(NA,
                            "?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay",
                            NA,
                            "Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate",
                            "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16",
                            "Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed",
                            "?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv,     Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving",
                            "Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife",
                            "Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White",
                            "Female, Male",
                            NA,
                            NA,
                            NA,
                            "?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia",
                            "<=50K, >50K")


variables_class %>% 
  select(-description, - R_class) %>% 
  filter(data_type == "categorical") %>% 
  arrange(unique_value, variable) %>% 
  kable_adj() %>% 
  column_spec(4, width = "10cm")

variable	data_type	unique_value	levels
income	categorical	2	<=50K, >50K
sex	categorical	2	Female, Male
race	categorical	5	Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White
relationship	categorical	6	Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife
marital_status	categorical	7	Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed
workclass	categorical	9	?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay
occupation	categorical	15	?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving
education	categorical	16	Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate
education_num	categorical	16	1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
native_country	categorical	42	?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia

Some categorical variables have quite many levels, esp. native_country, education with 42 and 16 distinct values respectively.

censusData %>%
  gather(variable, value, - id) %>% 
  mutate(value = as.integer(value))-> censusData_long

lapply(seq(temp2), function(i) {
    y <- data.frame(temp2[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("levels")
    return(y)
  }) %>% bind_rows() -> chart1

ggplot(chart1 %>% filter(variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
  geom_col(fill= "#56B4E9", alpha = 0.7)+
  facet_grid(variable~., space = "free", scales = "free")+
  coord_flip()+
  labs(title = "Categorical variables - Frequency bar chart - part 1",
       x = "",
       y = "Percentage" )+
  geom_text(aes(label = round(prop,0)), hjust = -1)+
  scale_y_continuous(breaks = seq(0,100,10))

The U.S. accounts for 90% in native_country attribute while each of other 40 countries account for less than 1% except Mexico (2%). Similarly, “White” is dominant in race feature with 85%. However, race has fewer levels than native_country and might be more effective in this task.

###{#gender}

censusData %>%
  gather(variable, value, - id) %>% 
  mutate(value = as.integer(value))-> censusData_long

lapply(seq(temp2), function(i) {
    y <- data.frame(temp2[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("levels")
    return(y)
  }) %>% bind_rows() %>% mutate(levels = factor(levels))-> chart1

ggplot(chart1 %>% filter(!variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
  geom_col(fill= "#56B4E9", alpha = 0.7)+
  facet_wrap(~variable, ncol =2, scales = "free_y")+
  coord_flip()+
  labs(title = "Categorical variables - Frequency bar chart - part 2",
       x = "",
       y = "Percentage" )+
  geom_text(aes(label = round(prop,0)), hjust = -0.5)+
  scale_y_continuous(breaks = seq(0,90,10), limits = c(0,80))

High-school graduate, Some college and Bachelors are the most common education levels, which accounted for 70% of survey responses.
Sample seems to be equally distributed in terms of occupation, yet quite imbalance in terms of sex and workclass. The high proportion of Private workclass might reflect the real world while the gap between genders might result from sample design methodology or data extraction rules.

censusData %>%
  summarize(high = mean(income ==">50K")) -> benchmark

censusData %>%
  gather(variable, value, - id, - income) -> censusData_long1

censusData_long1 %>%
  filter(variable %in% categorical_list$variable) %>%
  group_by(variable, value) %>%
  summarize(count = n(),
            pct_gt50 = round(mean(income == ">50K")*100,2)) %>% 
  mutate(pct_count = round(count/ nrow(censusData)*100,0)) -> summary1


summary1 %>%
  filter(variable!="native_country", variable!="race") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  facet_wrap(.~variable, scales = "free_y", ncol = 2)+
  coord_flip() + 
  labs(title = "Categorical variables - Proportion of high income (>50K) in each demographic group - part 2",
       x = "",
       y = "Percentage of high income group" )+
  theme(legend.position = "none")

Graph explaination: The numbers in the graph represent the proportion of specific group in a demographic type (e.g. In the bottom left graph, 67% people in this survey are male.). The bar graphs show the proportion of high income group (“>50K”) in each demographic group. (e.g. Around 30% of male respondents had income more than 50K per year.). The vertical line is the proportion of high income group at the total level (~24%).

As can be seen from the graphs, those who are male, achieving at least Bachelors’ degree, in married relationships, having Management or Professional occupation, self-employed or working in government are more likely to have high income. #### {#page10} However, the size of Managerial and Professional occupation, self-employed or working-in-government groups are relatively small. Therefore, they might be less influential than other mentioned attributes (sex, marital status and education).

Some small groups with similar income likelihood might be grouped together to reduce noise and enhance generalizability. For example, some reclassified grouped might be education below High school graduate, Married-civ-spouse and Married-AF-spouse, Without-pay and Never-worked workclass.

summary1 %>%
  filter(variable=="native_country" | variable=="race") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  facet_grid(variable~., scales = "free_y", space = "free")+
  coord_flip() + 
  labs(title = "Categorical variables - Proportion of \n high income (>50K) in each demographic group - part 1",
       x = "",
       y = "Percentage of high income group" ) +
  theme(legend.position = "none")

As discussed, the percentage of all native countries other than the U.S. is too small (~10%). In addition, they are equally distributed above and below the high-income likelihood benchmark. Therefore, this attribute might not be important for this classification task. Black, African American, American Indian or Alaskan Native have a much lower proportion of high income than their counterparts.

3.2. Continuous variables

Descriptive statistics for all numerical variables and boxplot to detect outliers:

lapply(seq(temp3), function(i) {
    y <- data.frame(temp3[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("stats")
    return(y)
  }) %>% bind_rows() -> chart2

chart2%>% 
  spread(stats, distribution) %>% select(variable, Min., `1st Qu.`, Median, Mean, `3rd Qu.`, Max.) %>% kable_adj()

variable	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
age	17	28	37	38.58	48	90
capital_gain	0	0	0	1,078.00	0	99,999
capital_loss	0	0	0	87.30	0	4,356
fnlwgt	12,285	117,827	178,356	189,778.00	237,051	1,484,705
hours_per_week	1	40	40	40.44	45	99

censusData_long1 %>% 
  filter(variable %in% integer_list$variable) %>% 
  tbl_df() %>% 
  mutate(value = as.integer(value))-> censusData_int

chart2$stats <- ordered(chart2$stats, levels = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max."))
chart2 %>% 
  filter(stats %in% c("Min.", "Max.")) %>% 
  spread(stats, distribution) %>% 
  mutate(diff = Max. - Min.)-> t1

censusData_int %>% 
  ggplot(., aes(1,value))+
  geom_boxplot()+
  facet_wrap(~variable, scales = "free")+
  coord_flip()

In capital_gain and capital_loss, minimum, 1st, 2nd and 3rd quantiles are all equal to “0”, which reflects the fact that at least 75% of people in this data set has no capital gain and capital loss. The distributions of these variables would be extremely right-skewed.

#check whether there is any people having both capital_gain and capital_loss positive values
censusData %>% 
  group_by(capital_gain, capital_loss) %>% 
  summarize(count = n()) %>% 
  filter(capital_gain != 0 & capital_loss != 0)

#none of them

For fnlwgt, the weighting factor has a very wide range. The maximum value is 120 times greater than the minimum.

3.2.1. Capital (gain and loss)

capital was created by taking capital_gain minus capital_loss. There are 87% of respondents do not have any capital gain and loss. Due to this feature, the distribution of any values other than “0” is hardly identified in the following histogram.

#create capital column
censusData <- censusData %>% 
  mutate(capital = capital_gain - capital_loss)

#count no. of "0" value
censusData %>% 
  prop_summary(capital) %>% 
  filter(capital == 0) %>% 
  kable_adj()

capital	count	prop
0	28,330	87

capital_summary <- summary(censusData$capital) %>% 
  as.list() %>% 
  tbl_df() %>% 
  gather(stats, distribution, 1:6)

censusData %>% 
  ggplot(. , aes(capital))+
  geom_histogram(fill= "grey", alpha = 0.7)+
  geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1, alpha = 0.5)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution",
       x="Capital",
       y="No. of records")

Therefore, I tried to remove “0” to improve the visibility of the histogram.

censusData %>% 
  filter(capital != 0) %>% 
  ggplot(. , aes(capital))+
  geom_histogram(fill= "grey", alpha = 0.7)+
  geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution - After remove 0 values",
       x="Capital",
       y="No. of records")

“99999” values seem to be suspicious as they are separated from the rest of the distribution (so-called outliers). According to IPUMS-CPS (n.d.), it might represent “Not-In-Universe” values rather than actual capital gain. Let temporarily remove it for clearer visualisation.

censusData %>% 
  filter(capital != 0 & capital != 99999) %>% 
  ggplot(. , aes(capital))+
  geom_histogram(aes(fill = income), binwidth = 500, alpha = 0.7)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution - After remove 0 and 99999 values",
       x="Capital",
       y="No. of records")+
  geom_vline(xintercept = c(-1750, 7250), linetype = "dashed", alpha = 0.5, size =0.5)+
  scale_x_continuous(breaks = seq(-10000, 40000, 10000))+
  theme(legend.position = "bottom")

### {#page14} The boundaries of capital to classified income group are approximately detected from this graph. It seems that people have the capital gain at least 7,250 or capital loss at least -1,750 USD are more likely to have a high-income.

censusData %>% 
  prop_summary(capital) %>% 
  arrange(desc(capital)) %>% 
  head(n = 10) %>% 
  kable_adj()

capital	count	prop
99,999	159	0.5
41,310	2	0.0
34,095	5	0.0
27,828	34	0.1
25,236	11	0.0
25,124	4	0.0
22,040	1	0.0
20,051	37	0.1
18,481	2	0.0
15,831	6	0.0

Caution: Looking a bit closer, the capital recorded values are not rounded numbers (e.g. 27,828 but not 28,000). Based on common sense, it is unlikely to get this precise amount of money based on questionnaire data. According to O’Hara (2004), capital gain came from the Census Bureau’s tax model and was discontinued since 2010 (IPUMS-CPS (n.d.)). Even though this variable might powerful in classifying the income groups, the concern is whether this information is still available for new data. If not, this feature would be useless in this classification model.

3.2.2. Age

From the 1st graph on the next page, the age distribution is right-skewed due to data extraction rule (age > 16). There is a minor spike in the number of respondents aged exactly 90. There might be possible that all ages above 90 or “Unknown” values were encoded as “90”. As mentioned in the Data quality section, it is strange that all “90-year-old people” still work at least 1 hour per week!

censusData_int %>% 
  filter(variable == "age") %>% 
  ggplot(. , aes(value))+
  geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
  # facet_wrap(~variable,scales = "free", ncol = 1)+
  geom_vline(data = chart2 %>% filter(variable == "age"), aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Age distribution",
       x="Age",
       y="No. of records")+
  scale_x_continuous(breaks=seq(0,100,10))+
  theme(legend.position = "right")

censusData %>% 
  group_by(income, age) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(age) %>% 
  mutate(prop = count / sum(count)) %>% 
  filter(income == ">50K", prop >= benchmark$high) -> mark

censusData_int %>% 
  filter(variable == "age") %>% 
  ggplot(., aes(value))+
  geom_freqpoly(aes(col= income), binwidth = 1)+
  geom_point(data = mark, aes(age, count))+
  geom_vline(xintercept = c(35,60), linetype = "dashed")+
  geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
  scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
  scale_colour_hue()+
  theme(legend.position = "right")+
  labs(title = "Age distribution by income groups - Observation count",
       x ="Age",
       y = "Number of Observations")

censusData %>% 
  group_by(income, age) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(age) %>% 
  mutate(prop = count / sum(count)) %>%
  ggplot(., aes(age, prop, fill = income))+
  geom_area()+
  geom_point(data = mark, aes(age, prop))+
  geom_hline(yintercept = benchmark$high, linetype = "dashed")+
  geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
  geom_vline(xintercept = c(35,60), linetype = "dashed")+
  # geom_vline(xintercept = seq(17, 90, by = 10), col = "black")+
  scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
  theme(legend.position = "right") +
  labs(title = "Age distribution by income groups - Proportion",
       x ="Age",
       y = "Proportion")

Note: All the horizontal lines (at 24%) are the percentage of high income group at aggregation level.

Based on the 3rd graph, it can be seen that people who aged from 35 to 60 are more likely to have a higher income than the average. Besides, there are only 5 people aged under 22 have an annual income greater than 50K, which accounts for 0.2% of people in this age group. In other word, it is very unlikely that someone age under 22 could earn more than 50,000 USD in 1994.

3.2.3. Working hours per week

47% of respondents work exactly 40 hours per week.

censusData_int %>% 
  filter(variable == "hours_per_week") %>% 
  ggplot(. , aes(value))+
  geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
  geom_vline(data = chart2 %>% filter(variable == "hours_per_week"), aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Working hours per week distribution",
       x="Hours per week",
       y="No. of records")+
  scale_x_continuous(breaks=seq(0,100,10))

censusData_int %>% 
  filter(variable == "hours_per_week") %>% 
  ggplot(., aes(value))+
  geom_freqpoly(aes(col= income), binwidth = 1)+
  scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
  geom_vline(xintercept = 34, linetype = "dashed")+
  scale_colour_hue()+
  theme(legend.position = "right")+
  labs(title = "Hours per week distribution by income groups - Observation count",
       x ="Hours per week",
       y = "Number of Observations")

censusData %>% 
  group_by(income, hours_per_week) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(hours_per_week) %>% 
  mutate(prop = count / sum(count)) %>%
  ggplot(., aes(hours_per_week, prop, fill = income))+
  geom_area()+
  geom_hline(yintercept = benchmark$high, linetype = "dashed")+
  scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
  geom_vline(xintercept = 34, linetype = "dashed")+
  theme(legend.position = "right") +
  labs(title = "Hours per week distribution by income groups - Proportion",
       x ="Hours per week",
       y = "Proportion")

The percentage of people aged under 35 have high income is typically low comparing to the older group.

censusData %>% 
  filter(workclass == "Never-worked") %>% 
  select(-capital_gain, - capital_loss, - fnlwgt, - capital, - relationship, - race, - sex, - native_country) %>%
  kable_adj()

age	workclass	education	marital_status	occupation	hours_per_week	income	id
18	Never-worked	10th	Never-married	?	40	<=50K	5,362
23	Never-worked	7th-8th	Divorced	?	35	<=50K	10,846
17	Never-worked	10th	Never-married	?	30	<=50K	14,773
18	Never-worked	11th	Never-married	?	10	<=50K	20,338
20	Never-worked	Some-college	Never-married	?	40	<=50K	23,233
30	Never-worked	HS-grad	Married-civ-spouse	?	40	<=50K	32,305
18	Never-worked	Some-college	Never-married	?	4	<=50K	32,315

Another quality concern: There are 7 cases that have working hours per week greater than 0 but classified as “Never-worked”.

3.2.4 Multivariate analysis

There are no correlation between three main continuous variables:

# pairs(censusData %>% select(age, hours_per_week, fnlwgt, capital), pch =21, cex=0.1)

panel.cor <- function(x, y, ...)
{
par(usr = c(0, 1, 0, 1))
txt <- as.character(format(cor(x, y), digits=2))
text(0.5, 0.5, txt, cex = 60* abs(cor(x, y)))
}
pairs(censusData %>% select(age, hours_per_week, capital), upper.panel=panel.cor)

#4. Data pre-processing

In this context, data pre-processing might include four main tasks: cleaning, reduction, discretization and transformation (Han and Pei (2012)). Most of them have been discussed in the previous sections. This part aims to give a concise summary of them:

4.1. Data cleaning

Duplicated and “?” missing values rates are quite small (0.1% and 7.4%). Their impact and some suggestions were discussed in detailed in Section 2.1 & 2.2.
A few abnormal values based on their consistency with other attributes were identified in Section 2.3 (e.g. between gender and relationship, age and education, age and working hours, working hours and workclass…).
Extreme values including “90” (age), “99” (hours_per_week), “99999” (capital_gain) need further investigation to identify whether it is true or missing values.
The most challenging concern is sample bias. In the data set, the genders are imbalanced and income proportion is significantly different from official public data.

In general, correcting point (1,2) errors might be not as important as point (3,4) as the error rates in (1,2) are quite small and identifiable. Meanwhile, concerns in point (3,4) require much more domain knowledge about government Census and their procedure. By clarifying these concerns, the weakness of the data set can be well defined and the feasible solutions might be formulated. For example, if we know that the ratio between male and female should be exactly 1:1, a new weighting or other statistical analysis can be applied to minimize the impact of sample imbalance.

4.2. Data reduction

This might include dimensionality and/or numerosity reduction or both.
Several simple ways that have been done to reduce data dimensions, including:

Defining education as “ordered factor” (ordinal data type) and drop redundant attribute education_num.
Replacing capital_gain and capital_loss by new attribute capital.
Considering filter out some weak, unreliable, correlated or associated variables such as native_country.
Grouping multiple related levels into one group as the example below. 16 levels of education might be reasonably classified into 3 groups by the high-income likelihood and the group sizes. (split by the red lines)

summary1 %>%
  filter(variable=="education") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  geom_vline(xintercept = c(8.5, 12.5), alpha = 0.5, linetype = "dashed", col = "red")+
  coord_flip() + 
  labs(title = "Categorical variables - \n Proportion of high income (>50K) in education groups",
       x = "",
       y = "Percentage of high income group" )+
  theme(legend.position = "none")

Besides, there are other advanced data reduction techniques, that might be useful to try such as Principle Component Analysis for dimensionality reduction and clustering for numerosity reduction. Due to the limit of time, these techniques were not applied, yet would be useful to explore further.

4.3. Data discretization

The general idea is trying to map multiple values into intervals or labels.

In a similar approach, continuous variables (e.g. capital, age, hours_per_week) can also be discretized section 3.2.1. The purity within some sub-groups, which were constructed based on the mentioned rule, do increase significantly as shown below.

censusData <- censusData %>% 
  mutate(capital_group = case_when(capital< (-1750) ~ "[-inf, -1750)",
                                   capital>= (-1750) & capital<0 ~ "[-1750, 0)",
                                   capital== 0 ~ "0",
                                   capital>0 & capital<=7250 ~ "(0,7250]",
                                   capital>7250 ~ "(7250, +inf]"))
censusData$capital_group <- ordered(censusData$capital_group, levels = c("[-inf, -1750)","[-1750, 0)","0","(0,7250]","(7250, +inf]"))
censusData %>% 
  ggplot(., aes(capital_group))+
  geom_bar(aes(fill = income), position = "fill")+
  labs(y = "Percentage")+
  geom_hline(yintercept = 0.241, linetype = "dashed")

4.4. Data transformation

This step depends on what machine learning methods will be applied in the modelling phase. If the algorithms are distance-based (e.g. k-Nearest Neighbours), continuous variables (including age, hours_per_week and capital) should be normalized to the same scale (0,1) and categorical attributes need to transform into dummy binary variables to measure the distance. In this data set case, as categorical variables have many levels, it might be required to reduce the number of levels before transformation or using feature selection to reduce the number of dummy attributes.

5. Summary

Business and data understanding are strongly connected to each other and crucial for the success of data mining. Data exploratory analysis enables people to understand more about domain knowledge (Census methodology). As a result, gaining further knowledge about the area of expertise helps data analysis to discover interesting patterns with a reasonable explanation.

The key features of interest are:

Categorical variables: Those who are male, achieving at least Bachelors’ degree, in married relationships, having Management or Professional occupation, self-employed or working in government are more likely to have high income. Sex, marital status and education might be more important than occupation and workclass in this context.
Continuous variables: To some extent, capital, age and working hours per week support the discrimination of different income groups. Specifically, people in the U.S who have capital gain at least 7,250 or capital loss at least -1,750, aged 35-60 and work at least 35 hours per week have a higher probability of gaining more than 50,000 USD per year in 1994.

Besides minor data errors such as missing, duplicated or illogical values, the main on-going concern is about the representative of the data set and how one can improve it to achieve data mining goals.
Several data pre-processing are considered and hopefully help improve the prediction performance in terms of accuracy, stability, speed and interpretation.

6. Appendix

6.1. Frequency tables for all categorical variables with levels’ full description

temp2[[1]]$description <- c("Private",
                            "Self-employed-not incorporated",
                            "Local government",
                            "Unknown values",
                            "State government",
                            "Self-employed-incorporated",
                            "Federal government",
                            "Without pay",
                            "Never worked")

temp2[[2]]$description <- c("High school graduate - high school diploma or equivalent",
                            "Some college but no degree",
                            "Bachelor's degree (for example: BA,AB,BS)",
                            "Master's degree (for example: MA,MS,MENG,MED, MSW, MBA)",
                            "Associate degree in college - occupation/vocation program",
                            "11th grade",
                            "Associate degree in college - academic program",
                            "10th grade",
                            "7th and 8th grade",
                            "Professional school degree (for example: MD,DDS,DVM,LLB,JD)",
                            "9th grade",
                            "12th grade no diploma",
                            "Doctorate degree (for example: PHD,EDD)",
                            "5th or 6th grade",
                            "1st,2nd,3rd,or 4th grade",
                            "Preschool")

temp2[[3]]$description <- c("Married - civilian spouse present",
                           "Never married",
                           "Divorced",
                           "Separated",
                           "Widowed",
                           "Married - spouse absent (exc. separated)",
                           "Married - AF spouse present")
temp2[[6]]$description <- c("White",
                          "Black",
                          "Asian, Native Hawaiian or Other Pacific Islander",
                          "African American, American Indian or Alaskan Native",
                          "Other race")

map(temp2, kable_adj) %>% walk(print)

workclass	count	prop	description
Private	22,696	69.7	Private
Self-emp-not-inc	2,541	7.8	Self-employed-not incorporated
Local-gov	2,093	6.4	Local government
?	1,836	5.6	Unknown values
State-gov	1,298	4.0	State government
Self-emp-inc	1,116	3.4	Self-employed-incorporated
Federal-gov	960	2.9	Federal government
Without-pay	14	0.0	Without pay
Never-worked	7	0.0	Never worked

education	count	prop	description
HS-grad	10,501	32.3	High school graduate - high school diploma or equivalent
Some-college	7,291	22.4	Some college but no degree
Bachelors	5,355	16.4	Bachelor’s degree (for example: BA,AB,BS)
Masters	1,723	5.3	Master’s degree (for example: MA,MS,MENG,MED, MSW, MBA)
Assoc-voc	1,382	4.2	Associate degree in college - occupation/vocation program
11th	1,175	3.6	11th grade
Assoc-acdm	1,067	3.3	Associate degree in college - academic program
10th	933	2.9	10th grade
7th-8th	646	2.0	7th and 8th grade
Prof-school	576	1.8	Professional school degree (for example: MD,DDS,DVM,LLB,JD)
9th	514	1.6	9th grade
12th	433	1.3	12th grade no diploma
Doctorate	413	1.3	Doctorate degree (for example: PHD,EDD)
5th-6th	333	1.0	5th or 6th grade
1st-4th	168	0.5	1st,2nd,3rd,or 4th grade
Preschool	51	0.2	Preschool

marital_status	count	prop	description
Married-civ-spouse	14,976	46.0	Married - civilian spouse present
Never-married	10,683	32.8	Never married
Divorced	4,443	13.6	Divorced
Separated	1,025	3.1	Separated
Widowed	993	3.0	Widowed
Married-spouse-absent	418	1.3	Married - spouse absent (exc. separated)
Married-AF-spouse	23	0.1	Married - AF spouse present

occupation	count	prop
Prof-specialty	4,140	12.7
Craft-repair	4,099	12.6
Exec-managerial	4,066	12.5
Adm-clerical	3,770	11.6
Sales	3,650	11.2
Other-service	3,295	10.1
Machine-op-inspct	2,002	6.1
?	1,843	5.7
Transport-moving	1,597	4.9
Handlers-cleaners	1,370	4.2
Farming-fishing	994	3.1
Tech-support	928	2.9
Protective-serv	649	2.0
Priv-house-serv	149	0.5
Armed-Forces	9	0.0

relationship	count	prop
Husband	13,193	40.5
Not-in-family	8,305	25.5
Own-child	5,068	15.6
Unmarried	3,446	10.6
Wife	1,568	4.8
Other-relative	981	3.0

race	count	prop	description
White	27,816	85.4	White
Black	3,124	9.6	Black
Asian-Pac-Islander	1,039	3.2	Asian, Native Hawaiian or Other Pacific Islander
Amer-Indian-Eskimo	311	1.0	African American, American Indian or Alaskan Native
Other	271	0.8	Other race

sex	count	prop
Male	21,790	66.9
Female	10,771	33.1

native_country	count	prop
United-States	29,170	89.6
Mexico	643	2.0
?	583	1.8
Philippines	198	0.6
Germany	137	0.4
Canada	121	0.4
Puerto-Rico	114	0.4
El-Salvador	106	0.3
India	100	0.3
Cuba	95	0.3
England	90	0.3
Jamaica	81	0.2
South	80	0.2
China	75	0.2
Italy	73	0.2
Dominican-Republic	70	0.2
Vietnam	67	0.2
Guatemala	64	0.2
Japan	62	0.2
Poland	60	0.2
Columbia	59	0.2
Taiwan	51	0.2
Haiti	44	0.1
Iran	43	0.1
Portugal	37	0.1
Nicaragua	34	0.1
Peru	31	0.1
France	29	0.1
Greece	29	0.1
Ecuador	28	0.1
Ireland	24	0.1
Hong	20	0.1
Cambodia	19	0.1
Trinadad&Tobago	19	0.1
Laos	18	0.1
Thailand	18	0.1
Yugoslavia	16	0.0
Outlying-US(Guam-USVI-etc)	14	0.0
Honduras	13	0.0
Hungary	13	0.0
Scotland	12	0.0
Holand-Netherlands	1	0.0

6.2. Duplicated records

The only difference is the dummy id (based on row number) that I created.

censusData %>% 
  group_by_at(vars(-id)) %>% 
  summarize(dup_group = n()) %>% 
  filter(dup_group>1) %>% 
  select(-dup_group) %>% 
  inner_join(censusData, by = names(censusData)[-(length(censusData)-2)]) %>% 
  kable(format="latex", booktabs=TRUE) %>% 
  kable_styling(latex_options="scale_down", position = "float_left")

6.3. Summary statistics of continuous variables by target groups

colnames(censusData) <- str_replace_all(colnames(censusData), "_", ".")
censusData %>% 
  group_by(income) %>% 
  summarise_at(vars(age, fnlwgt, capital.gain, capital.loss, hours.per.week), funs(min, max, mean, median)) -> s1

s1 %>% 
  gather(variable_measure, value, 2:21) %>% 
  separate(variable_measure, c("variable", "measure"), sep = "_") %>% 
  spread(measure, value) %>% 
  arrange(variable) %>% 
  select(variable, income, min, median, mean, max) %>% 
  kable_adj()

variable	income	min	median	mean	max
age	<=50K	17	34	36.78	90
age	>50K	19	44	44.25	90
capital.gain	<=50K	0	0	148.75	41,310
capital.gain	>50K	0	0	4,006.14	99,999
capital.loss	<=50K	0	0	53.14	4,356
capital.loss	>50K	0	0	195.00	3,683
fnlwgt	<=50K	12,285	179,465	190,340.87	1,484,705
fnlwgt	>50K	14,878	176,101	188,005.00	1,226,583
hours.per.week	<=50K	1	40	38.84	99
hours.per.week	>50K	1	40	45.47	99

colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")

R codes

knitr::opts_chunk$set(warning = FALSE, message = FALSE, comment = NA, fig.align='center')
# knitr::opts_chunk$set(fig.width = 15, fig.height = 14)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
options(dplyr.width = Inf)
options(knitr.table.format = "html")
options(knitr.kable.NA = '')

prop_summary <- function(df, ...){
df %>%
  group_by(...) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(prop = round(count/sum(count)*100,1))
}

kable_adj <- function(data) {
  knitr::kable(data, booktabs = TRUE, digits = 2, format.args = list(big.mark = ",")) %>% 
    kableExtra::kable_styling(full_width = F, position = "left") %>% 
    row_spec(0, bold = T, color = "white", background = "#696969")
}
library("tidyverse")
library("lubridate")
library("knitr")
library("ggpubr")
library("kableExtra")
#import data
censusData <- read.csv("censusData.csv", stringsAsFactors = T, strip.white = TRUE) %>% tbl_df()
censusData_backup <- censusData

#rename to replace "." with "_"
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")
variables_class <- as.data.frame(lapply(censusData, class)) %>% 
  .[1,] %>% 
  gather(variable, R_class , 1:ncol(.)) %>% 
  mutate(data_type = ifelse(R_class == "factor", "categorical", 
                            ifelse(variable == "education_num","categorical","continuous")))
description <- c("age of the individual",
"type of employer the individual has",
"number of people in the universe (U.S. population) that each individual in the survey can represent",
"highest level of education that individual achieved",
"highest level of education in a numerical form",
"marital status of the individual",
"occupation of the individual",
"primary family relationship (the relationship of the individual in his/her primary family)",
"the individual's ethnicity",
"gender",
"capital gains (from selling an asset such as a stock, bond for more than the original purchase price)",
"capital losses (from selling an asset such as a stock, bond for less than the original purchase price)",
"hours worked per week",
"country of origin",
"(target) whether or not the person makes more than $50,000 per annum income")

variables_class$description <- description
variables_class %>% 
  arrange(data_type, variable) %>% 
  kable_adj() %>% 
  row_spec(11:15, color = "white", background = "#696969") %>% 
  column_spec(4, width = "10cm")
#preview data structure and example values
glimpse(censusData)
#pre-processing data without printing out - will be discussed in the pre-processing section
censusData %>% 
  select(education_num, education) %>% 
  distinct() %>% 
  arrange(education_num) -> edu_order

censusData$education <- ordered(censusData$education, levels = edu_order$education)

censusData <- censusData %>% select(-education_num) %>% mutate(id = row_number())
censusData %>% 
  select(-id) %>% 
  summarize(no_of_observations = n(),
            no_of_distinct = n_distinct(.)) %>% 
  mutate(no_of_dup = no_of_observations - no_of_distinct) %>% 
  mutate(pct_dup = round(no_of_dup/no_of_observations*100,1))%>% kable_adj()
censusData %>% 
  group_by_at(vars(-id)) %>% 
  summarize(dup_group = n()) %>% 
  group_by(dup_group) %>% 
  summarize(count = n()) %>% kable_adj()
#check 'normal' missing values
# map_df(censusData, ~sum(is.na(.)))

#'check missing values labeled as "?"
total_missing <- map_df(censusData, ~str_detect(., pattern = "\\?")) %>% 
  rowSums() %>% 
  tbl_df() %>% 
  filter(value > 0) %>% 
  summarize(no_of_missing = n()) %>% 
  mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1),
         variable = "all variables") %>% 
  select(variable, no_of_missing, pct_missing)


map_df(censusData, ~sum(str_detect(., pattern = "\\?"))) %>% 
  gather(variable, no_of_missing, 1:15) %>% 
  mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1)) %>% 
  bind_rows(total_missing) %>% 
  arrange(desc(no_of_missing)) %>% 
  kable_adj()
income_s <- prop_summary(censusData,income)

income_c <- censusData %>% 
  filter_all(all_vars(.!= "?")) %>% 
  prop_summary(income) %>%
  mutate(dataset = "without missing values") %>% 
  bind_rows(income_s %>% mutate(dataset = "with missing values"))

ggplot(income_c, aes(dataset, prop))+
  geom_col(aes(fill=income))+
  labs(y="Percentage")+
  geom_text(aes(label=prop), hjust=2.5)+
  coord_flip()+
  theme(legend.position="bottom")

censusData %>% 
  group_by(age, education) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  ggplot(., aes(age, education))+
  geom_point(aes(size = count, color = (age <= 20 & education =="Masters")), pch = 19, alpha = 0.7)+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_colour_manual(values = setNames(c('red','grey'),c(T, F)))+
  theme(legend.position="bottom")+
  labs(y="")
censusData %>% 
  group_by(sex, relationship) %>% 
  summarize(count = n()) %>% 
  ggplot(., aes(relationship, sex))+
  geom_text(aes(label=count, color = (sex == "Male" & relationship == "Wife") | (sex == "Female" & relationship == "Husband")))+
  scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
  theme(legend.position="bottom")
censusData %>% 
  ggplot(., aes(age, hours_per_week))+
  geom_jitter(aes(col = (age == 90)),alpha = 0.2)+
  scale_x_continuous(breaks = seq(0,100,10))+
  scale_y_continuous(breaks = seq(0,100,10))+
  scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
  theme(legend.position="bottom")
ggplot(income_c %>% filter(dataset == "with missing values"), aes(dataset, prop))+
  geom_col(aes(fill=income))+
  labs(y="Percentage")+
  geom_text(aes(label=prop), hjust=2.5)+
  coord_flip()+
  labs(x ="")+
  scale_x_discrete(labels=c("with missing values" = ""))+
  theme_minimal()+
  theme(legend.position="right")
variables_class$unique_value <- sapply(censusData_backup, n_distinct)
variables_class$levels <- c(NA,
                            "?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay",
                            NA,
                            "Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate",
                            "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16",
                            "Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed",
                            "?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv,     Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving",
                            "Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife",
                            "Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White",
                            "Female, Male",
                            NA,
                            NA,
                            NA,
                            "?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia",
                            "<=50K, >50K")


variables_class %>% 
  select(-description, - R_class) %>% 
  filter(data_type == "categorical") %>% 
  arrange(unique_value, variable) %>% 
  kable_adj() %>% 
  column_spec(4, width = "10cm")
#summary table
y <- censusData %>% summary(maxsum = 42) %>% as.data.frame.matrix(row.names = F) 


#generate counting numbers and empty list
a<-1
b<-1
temp2 <- list()
temp3 <- list()

categorical_list <- variables_class %>% filter(R_class == "factor", variable !="income")
integer_list <- variables_class %>% filter(R_class == "integer", !variable  %in% c("id", "education_num"))

#splitting into categorical and continuous groups
for (i in seq_along(y)){
  y %>% 
    select(i) %>% 
    na.omit() -> temp
  if (str_trim(colnames(temp)[1]) %in% categorical_list[,1] ){
    temp %>% 
      separate(1,c(str_trim(colnames(temp)[1]), "count"), sep = ":") %>% 
      mutate_all(~str_trim(.)) %>% 
      mutate(count = as.numeric(count), prop = round(count / sum(count) * 100,1)) %>% 
      arrange(desc(count))-> temp2[[a]]
    a <- a + 1
  }
  if (str_trim(colnames(temp)[1]) %in% integer_list[,1] ){
    temp %>% 
      separate(1,c(str_trim(colnames(temp)[1]), "distribution"), sep = ":") %>% 
      mutate_all(~str_trim(.)) %>% 
      mutate(distribution = as.numeric(distribution)) -> temp3[[b]]
    b <- b + 1
  }
}
censusData %>%
  gather(variable, value, - id) %>% 
  mutate(value = as.integer(value))-> censusData_long

lapply(seq(temp2), function(i) {
    y <- data.frame(temp2[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("levels")
    return(y)
  }) %>% bind_rows() -> chart1

ggplot(chart1 %>% filter(variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
  geom_col(fill= "#56B4E9", alpha = 0.7)+
  facet_grid(variable~., space = "free", scales = "free")+
  coord_flip()+
  labs(title = "Categorical variables - Frequency bar chart - part 1",
       x = "",
       y = "Percentage" )+
  geom_text(aes(label = round(prop,0)), hjust = -1)+
  scale_y_continuous(breaks = seq(0,100,10))
censusData %>%
  gather(variable, value, - id) %>% 
  mutate(value = as.integer(value))-> censusData_long

lapply(seq(temp2), function(i) {
    y <- data.frame(temp2[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("levels")
    return(y)
  }) %>% bind_rows() %>% mutate(levels = factor(levels))-> chart1

ggplot(chart1 %>% filter(!variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
  geom_col(fill= "#56B4E9", alpha = 0.7)+
  facet_wrap(~variable, ncol =2, scales = "free_y")+
  coord_flip()+
  labs(title = "Categorical variables - Frequency bar chart - part 2",
       x = "",
       y = "Percentage" )+
  geom_text(aes(label = round(prop,0)), hjust = -0.5)+
  scale_y_continuous(breaks = seq(0,90,10), limits = c(0,80))

censusData %>%
  summarize(high = mean(income ==">50K")) -> benchmark

censusData %>%
  gather(variable, value, - id, - income) -> censusData_long1
censusData_long1 %>%
  filter(variable %in% categorical_list$variable) %>%
  group_by(variable, value) %>%
  summarize(count = n(),
            pct_gt50 = round(mean(income == ">50K")*100,2)) %>% 
  mutate(pct_count = round(count/ nrow(censusData)*100,0)) -> summary1


summary1 %>%
  filter(variable!="native_country", variable!="race") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  facet_wrap(.~variable, scales = "free_y", ncol = 2)+
  coord_flip() + 
  labs(title = "Categorical variables - Proportion of high income (>50K) in each demographic group - part 2",
       x = "",
       y = "Percentage of high income group" )+
  theme(legend.position = "none")
summary1 %>%
  filter(variable=="native_country" | variable=="race") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  facet_grid(variable~., scales = "free_y", space = "free")+
  coord_flip() + 
  labs(title = "Categorical variables - Proportion of \n high income (>50K) in each demographic group - part 1",
       x = "",
       y = "Percentage of high income group" ) +
  theme(legend.position = "none")
lapply(seq(temp3), function(i) {
    y <- data.frame(temp3[[i]])
    y$variable = names(y)[1]
    names(y)[1] <- c("stats")
    return(y)
  }) %>% bind_rows() -> chart2

chart2%>% 
  spread(stats, distribution) %>% select(variable, Min., `1st Qu.`, Median, Mean, `3rd Qu.`, Max.) %>% kable_adj()
censusData_long1 %>% 
  filter(variable %in% integer_list$variable) %>% 
  tbl_df() %>% 
  mutate(value = as.integer(value))-> censusData_int

chart2$stats <- ordered(chart2$stats, levels = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max."))
chart2 %>% 
  filter(stats %in% c("Min.", "Max.")) %>% 
  spread(stats, distribution) %>% 
  mutate(diff = Max. - Min.)-> t1
censusData_int %>% 
  ggplot(., aes(1,value))+
  geom_boxplot()+
  facet_wrap(~variable, scales = "free")+
  coord_flip()
#check whether there is any people having both capital_gain and capital_loss positive values
censusData %>% 
  group_by(capital_gain, capital_loss) %>% 
  summarize(count = n()) %>% 
  filter(capital_gain != 0 & capital_loss != 0)

#none of them
#create capital column
censusData <- censusData %>% 
  mutate(capital = capital_gain - capital_loss)
#count no. of "0" value
censusData %>% 
  prop_summary(capital) %>% 
  filter(capital == 0) %>% 
  kable_adj()
capital_summary <- summary(censusData$capital) %>% 
  as.list() %>% 
  tbl_df() %>% 
  gather(stats, distribution, 1:6)
censusData %>% 
  ggplot(. , aes(capital))+
  geom_histogram(fill= "grey", alpha = 0.7)+
  geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1, alpha = 0.5)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution",
       x="Capital",
       y="No. of records")
censusData %>% 
  filter(capital != 0) %>% 
  ggplot(. , aes(capital))+
  geom_histogram(fill= "grey", alpha = 0.7)+
  geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution - After remove 0 values",
       x="Capital",
       y="No. of records")
censusData %>% 
  filter(capital != 0 & capital != 99999) %>% 
  ggplot(. , aes(capital))+
  geom_histogram(aes(fill = income), binwidth = 500, alpha = 0.7)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Capital gain and loss distribution - After remove 0 and 99999 values",
       x="Capital",
       y="No. of records")+
  geom_vline(xintercept = c(-1750, 7250), linetype = "dashed", alpha = 0.5, size =0.5)+
  scale_x_continuous(breaks = seq(-10000, 40000, 10000))+
  theme(legend.position = "bottom")
censusData %>% 
  prop_summary(capital) %>% 
  arrange(desc(capital)) %>% 
  head(n = 10) %>% 
  kable_adj()
censusData_int %>% 
  filter(variable == "age") %>% 
  ggplot(. , aes(value))+
  geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
  # facet_wrap(~variable,scales = "free", ncol = 1)+
  geom_vline(data = chart2 %>% filter(variable == "age"), aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Age distribution",
       x="Age",
       y="No. of records")+
  scale_x_continuous(breaks=seq(0,100,10))+
  theme(legend.position = "right")

censusData %>% 
  group_by(income, age) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(age) %>% 
  mutate(prop = count / sum(count)) %>% 
  filter(income == ">50K", prop >= benchmark$high) -> mark

censusData_int %>% 
  filter(variable == "age") %>% 
  ggplot(., aes(value))+
  geom_freqpoly(aes(col= income), binwidth = 1)+
  geom_point(data = mark, aes(age, count))+
  geom_vline(xintercept = c(35,60), linetype = "dashed")+
  geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
  scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
  scale_colour_hue()+
  theme(legend.position = "right")+
  labs(title = "Age distribution by income groups - Observation count",
       x ="Age",
       y = "Number of Observations")

censusData %>% 
  group_by(income, age) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(age) %>% 
  mutate(prop = count / sum(count)) %>%
  ggplot(., aes(age, prop, fill = income))+
  geom_area()+
  geom_point(data = mark, aes(age, prop))+
  geom_hline(yintercept = benchmark$high, linetype = "dashed")+
  geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
  geom_vline(xintercept = c(35,60), linetype = "dashed")+
  # geom_vline(xintercept = seq(17, 90, by = 10), col = "black")+
  scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
  theme(legend.position = "right") +
  labs(title = "Age distribution by income groups - Proportion",
       x ="Age",
       y = "Proportion")
censusData_int %>% 
  filter(variable == "hours_per_week") %>% 
  ggplot(. , aes(value))+
  geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
  geom_vline(data = chart2 %>% filter(variable == "hours_per_week"), aes(xintercept = distribution, col = stats), size=1)+
  scale_colour_hue()+
  theme(legend.position="right")+
  labs(title = "Working hours per week distribution",
       x="Hours per week",
       y="No. of records")+
  scale_x_continuous(breaks=seq(0,100,10))

censusData_int %>% 
  filter(variable == "hours_per_week") %>% 
  ggplot(., aes(value))+
  geom_freqpoly(aes(col= income), binwidth = 1)+
  scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
  geom_vline(xintercept = 34, linetype = "dashed")+
  scale_colour_hue()+
  theme(legend.position = "right")+
  labs(title = "Hours per week distribution by income groups - Observation count",
       x ="Hours per week",
       y = "Number of Observations")

censusData %>% 
  group_by(income, hours_per_week) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  group_by(hours_per_week) %>% 
  mutate(prop = count / sum(count)) %>%
  ggplot(., aes(hours_per_week, prop, fill = income))+
  geom_area()+
  geom_hline(yintercept = benchmark$high, linetype = "dashed")+
  scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
  geom_vline(xintercept = 34, linetype = "dashed")+
  theme(legend.position = "right") +
  labs(title = "Hours per week distribution by income groups - Proportion",
       x ="Hours per week",
       y = "Proportion")
censusData %>% 
  filter(workclass == "Never-worked") %>% 
  select(-capital_gain, - capital_loss, - fnlwgt, - capital, - relationship, - race, - sex, - native_country) %>%
  kable_adj()
# pairs(censusData %>% select(age, hours_per_week, fnlwgt, capital), pch =21, cex=0.1)

panel.cor <- function(x, y, ...)
{
par(usr = c(0, 1, 0, 1))
txt <- as.character(format(cor(x, y), digits=2))
text(0.5, 0.5, txt, cex = 60* abs(cor(x, y)))
}
pairs(censusData %>% select(age, hours_per_week, capital), upper.panel=panel.cor)
summary1 %>%
  filter(variable=="education") %>% 
  ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) + 
  geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
  scale_y_continuous(breaks = seq(-10,80,10))+
  geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
  geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
  geom_vline(xintercept = c(8.5, 12.5), alpha = 0.5, linetype = "dashed", col = "red")+
  coord_flip() + 
  labs(title = "Categorical variables - \n Proportion of high income (>50K) in education groups",
       x = "",
       y = "Percentage of high income group" )+
  theme(legend.position = "none")
censusData <- censusData %>% 
  mutate(capital_group = case_when(capital< (-1750) ~ "[-inf, -1750)",
                                   capital>= (-1750) & capital<0 ~ "[-1750, 0)",
                                   capital== 0 ~ "0",
                                   capital>0 & capital<=7250 ~ "(0,7250]",
                                   capital>7250 ~ "(7250, +inf]"))
censusData$capital_group <- ordered(censusData$capital_group, levels = c("[-inf, -1750)","[-1750, 0)","0","(0,7250]","(7250, +inf]"))
censusData %>% 
  ggplot(., aes(capital_group))+
  geom_bar(aes(fill = income), position = "fill")+
  labs(y = "Percentage")+
  geom_hline(yintercept = 0.241, linetype = "dashed")
temp2[[1]]$description <- c("Private",
                            "Self-employed-not incorporated",
                            "Local government",
                            "Unknown values",
                            "State government",
                            "Self-employed-incorporated",
                            "Federal government",
                            "Without pay",
                            "Never worked")

temp2[[2]]$description <- c("High school graduate - high school diploma or equivalent",
                            "Some college but no degree",
                            "Bachelor's degree (for example: BA,AB,BS)",
                            "Master's degree (for example: MA,MS,MENG,MED, MSW, MBA)",
                            "Associate degree in college - occupation/vocation program",
                            "11th grade",
                            "Associate degree in college - academic program",
                            "10th grade",
                            "7th and 8th grade",
                            "Professional school degree (for example: MD,DDS,DVM,LLB,JD)",
                            "9th grade",
                            "12th grade no diploma",
                            "Doctorate degree (for example: PHD,EDD)",
                            "5th or 6th grade",
                            "1st,2nd,3rd,or 4th grade",
                            "Preschool")

temp2[[3]]$description <- c("Married - civilian spouse present",
                           "Never married",
                           "Divorced",
                           "Separated",
                           "Widowed",
                           "Married - spouse absent (exc. separated)",
                           "Married - AF spouse present")
temp2[[6]]$description <- c("White",
                          "Black",
                          "Asian, Native Hawaiian or Other Pacific Islander",
                          "African American, American Indian or Alaskan Native",
                          "Other race")

map(temp2, kable_adj) %>% walk(print)
censusData %>% 
  group_by_at(vars(-id)) %>% 
  summarize(dup_group = n()) %>% 
  filter(dup_group>1) %>% 
  select(-dup_group) %>% 
  inner_join(censusData, by = names(censusData)[-(length(censusData)-2)]) %>% 
  kable(format="latex", booktabs=TRUE) %>% 
  kable_styling(latex_options="scale_down", position = "float_left")
colnames(censusData) <- str_replace_all(colnames(censusData), "_", ".")
censusData %>% 
  group_by(income) %>% 
  summarise_at(vars(age, fnlwgt, capital.gain, capital.loss, hours.per.week), funs(min, max, mean, median)) -> s1

s1 %>% 
  gather(variable_measure, value, 2:21) %>% 
  separate(variable_measure, c("variable", "measure"), sep = "_") %>% 
  spread(measure, value) %>% 
  arrange(variable) %>% 
  select(variable, income, min, median, mean, max) %>% 
  kable_adj()

colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")

References

Becker, Barry, and Ron Kohavi. 1996. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. https://archive.ics.uci.edu/ml/machine-learning-databases/adult/.

Census, United States. Bureau of the. 1995. “Current Population Survey: Annual Demographic File, 1994.” Inter-university Consortium for Political; Social Research [distributor]. https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/6461/version/1.

Ethnic, and Hispanic Statistics Branch. n.d. Population Division, U.S. Bureau of the Census Current Population Survey. https://www2.census.gov/programs-surveys/demo/tables/foreign-born/1995/cps1995/tab0108.pdf.

Han, Jiawei., and Jian. Pei. 2012. “Data Mining: Concepts and Techniques.” London;Amsterdam; : Morgan Kaufmann. http://capitadiscovery.co.uk/brighton-ac/items/1431592.

IPUMS-CPS. n.d. University of Minnesota. https://cps.ipums.org/cps-action/variables/CAPGAIN#codes_section.

O’Hara, Amy. 2004. “New Methods for Simulating Cps Taxes.” Mimeo, U.S. Census. https://www.census.gov/content/dam/Census/library/working-papers/2004/demo/oharataxmodel.pdf.

Data Mining and Knowledge Discovery in Data - Assignment One - Final Submission

Xuan Pham

3/27/2019