library("tidyverse")
library("lubridate")
library("knitr")
library("ggpubr")
library("kableExtra")
The main objective is to analyze the characteristics and quality of Census data. Then, several pre-processing considerations are proposed to enhance the performance of income classification prediction.
The chosen data set contains key characteristics of the United States’ population in 1994. It was extracted and donor by (1996). A basic description can be found at UCI Machine Learning Repository. However, because the original source link is not valid, it is challenging to gain a better understanding of data collection methodology and related features. However, based on my desk research, it seems that the data set comes from the Current Population Survey (CPS), which was conducted by the United States Bureau of the Census.
Data collection method: A nationwide sample of 57,000 housing units in 1,973 counties were personally interviewed (Census 1995).
Data extraction conditions: According to UCI, data was extracted using the following rules: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)), which means individuals with:
The extracted data set has 48,842 records and was randomly split into training and test set at ratio 2/3 and 1/3 respectively.
In this report, only the training data set, which includes 32,561 records and 15 attributes, will be analysed. Each record contains data of one respondent and each column describes one feature of that person. 10 variables are categorical and 5 variables are continuous as shown below.
#import data
censusData <- read.csv("censusData.csv", stringsAsFactors = T, strip.white = TRUE) %>% tbl_df()
censusData_backup <- censusData
#rename to replace "." with "_"
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")
variables_class <- as.data.frame(lapply(censusData, class)) %>%
.[1,] %>%
gather(variable, R_class , 1:ncol(.)) %>%
mutate(data_type = ifelse(R_class == "factor", "categorical",
ifelse(variable == "education_num","categorical","continuous")))
variables_class %>%
arrange(data_type, variable) %>%
kable_adj() %>%
row_spec(11:15, color = "white", background = "#696969") %>%
column_spec(4, width = "10cm")
| variable | R_class | data_type | description |
|---|---|---|---|
| education | factor | categorical | highest level of education that individual achieved |
| education_num | integer | categorical | highest level of education in a numerical form |
| income | factor | categorical | (target) whether or not the person makes more than $50,000 per annum income |
| marital_status | factor | categorical | marital status of the individual |
| native_country | factor | categorical | country of origin |
| occupation | factor | categorical | occupation of the individual |
| race | factor | categorical | the individual’s ethnicity |
| relationship | factor | categorical | primary family relationship (the relationship of the individual in his/her primary family) |
| sex | factor | categorical | gender |
| workclass | factor | categorical | type of employer the individual has |
| age | integer | continuous | age of the individual |
| capital_gain | integer | continuous | capital gains (from selling an asset such as a stock, bond for more than the original purchase price) |
| capital_loss | integer | continuous | capital losses (from selling an asset such as a stock, bond for less than the original purchase price) |
| fnlwgt | integer | continuous | number of people in the universe (U.S. population) that each individual in the survey can represent |
| hours_per_week | integer | continuous | hours worked per week |
Data structure and example values:
#preview data structure and example values
glimpse(censusData)
Observations: 32,561
Variables: 15
$ age <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30,...
$ workclass <fct> State-gov, Self-emp-not-inc, Private, Private, ...
$ fnlwgt <int> 77516, 83311, 215646, 234721, 338409, 284582, 1...
$ education <fct> Bachelors, Bachelors, HS-grad, 11th, Bachelors,...
$ education_num <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13,...
$ marital_status <fct> Never-married, Married-civ-spouse, Divorced, Ma...
$ occupation <fct> Adm-clerical, Exec-managerial, Handlers-cleaner...
$ relationship <fct> Not-in-family, Husband, Not-in-family, Husband,...
$ race <fct> White, White, White, Black, Black, White, Black...
$ sex <fct> Male, Male, Male, Male, Female, Female, Female,...
$ capital_gain <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0...
$ capital_loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ hours_per_week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40,...
$ native_country <fct> United-States, United-States, United-States, Un...
$ income <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K...
There are some interesting features from these examples that can be used as general guidance for pre-processing or further exploration:
#pre-processing data without printing out - will be discussed in the pre-processing section
censusData %>%
select(education_num, education) %>%
distinct() %>%
arrange(education_num) -> edu_order
censusData$education <- ordered(censusData$education, levels = edu_order$education)
censusData <- censusData %>% select(-education_num) %>% mutate(id = row_number())
There are 24 duplicated rows (identical in all attributes). 22 of them are repeated 2 times and one is repeated 3 times. A full list of duplicated values is included in Appendix (6.2).
censusData %>%
select(-id) %>%
summarize(no_of_observations = n(),
no_of_distinct = n_distinct(.)) %>%
mutate(no_of_dup = no_of_observations - no_of_distinct) %>%
mutate(pct_dup = round(no_of_dup/no_of_observations*100,1))%>% kable_adj()
| no_of_observations | no_of_distinct | no_of_dup | pct_dup |
|---|---|---|---|
| 32,561 | 32,537 | 24 | 0.1 |
censusData %>%
group_by_at(vars(-id)) %>%
summarize(dup_group = n()) %>%
group_by(dup_group) %>%
summarize(count = n()) %>% kable_adj()
| dup_group | count |
|---|---|
| 1 | 32,514 |
| 2 | 22 |
| 3 | 1 |
According to the description on UCI, “Unknown” values are labelled as “?”. In total, there are 2,399 incomplete records (~7.4%). Specifically, there are missing values in occupation, workclass and native_country, which accounts for 5.7%, 5.6% and 1.8% of the total records respectively.
#check 'normal' missing values
# map_df(censusData, ~sum(is.na(.)))
#'check missing values labeled as "?"
total_missing <- map_df(censusData, ~str_detect(., pattern = "\\?")) %>%
rowSums() %>%
tbl_df() %>%
filter(value > 0) %>%
summarize(no_of_missing = n()) %>%
mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1),
variable = "all variables") %>%
select(variable, no_of_missing, pct_missing)
map_df(censusData, ~sum(str_detect(., pattern = "\\?"))) %>%
gather(variable, no_of_missing, 1:15) %>%
mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1)) %>%
bind_rows(total_missing) %>%
arrange(desc(no_of_missing)) %>%
kable_adj()
| variable | no_of_missing | pct_missing |
|---|---|---|
| all variables | 2,399 | 7.4 |
| occupation | 1,843 | 5.7 |
| workclass | 1,836 | 5.6 |
| native_country | 583 | 1.8 |
| age | 0 | 0.0 |
| fnlwgt | 0 | 0.0 |
| education | 0 | 0.0 |
| marital_status | 0 | 0.0 |
| relationship | 0 | 0.0 |
| race | 0 | 0.0 |
| sex | 0 | 0.0 |
| capital_gain | 0 | 0.0 |
| capital_loss | 0 | 0.0 |
| hours_per_week | 0 | 0.0 |
| income | 0 | 0.0 |
| id | 0 | 0.0 |
However, at the top-line level, the missing values seem to have a minimal impact on the income group contribution. The odd between <=50K & >50K income group with or without missing values are quite similar to each other as shown in the graph below. In addition, some algorithms (e.g. C5.0, CR&T) have their own techniques (e.g. fractional cases, surrogate splits) to handle missing values. Therefore, I kept using the original data set for the following analysis. However, we might need to justify the impact of missing values on the model performance later one. If the performance is poor, we might consider removing incomplete records (or attributes) or imputing missing values. As the missing values are categorical, they might be defined according to the most frequent values or using classification methods to predict the missing values based on other attributes (excluding target variable income). However, these methods can not guarantee the accuracy of imputed values and might introduce bias. Furthermore, the attributes with missing values have many levels (14 occupations and 41 native countries), which makes missing values imputation even more challenging.
income_s <- prop_summary(censusData,income)
income_c <- censusData %>%
filter_all(all_vars(.!= "?")) %>%
prop_summary(income) %>%
mutate(dataset = "without missing values") %>%
bind_rows(income_s %>% mutate(dataset = "with missing values"))
ggplot(income_c, aes(dataset, prop))+
geom_col(aes(fill=income))+
labs(y="Percentage")+
geom_text(aes(label=prop), hjust=2.5)+
coord_flip()+
theme(legend.position="bottom")
Based on the relationship between different attributes, the majority of observations seem to be valid, except very few abnormal values as following:
censusData %>%
group_by(age, education) %>%
summarize(count = n()) %>%
ungroup() %>%
ggplot(., aes(age, education))+
geom_point(aes(size = count, color = (age <= 20 & education =="Masters")), pch = 19, alpha = 0.7)+
scale_x_continuous(breaks = seq(0,100,10))+
scale_colour_manual(values = setNames(c('red','grey'),c(T, F)))+
theme(legend.position="bottom")+
labs(y="")
censusData %>%
group_by(sex, relationship) %>%
summarize(count = n()) %>%
ggplot(., aes(relationship, sex))+
geom_text(aes(label=count, color = (sex == "Male" & relationship == "Wife") | (sex == "Female" & relationship == "Husband")))+
scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
theme(legend.position="bottom")
censusData %>%
ggplot(., aes(age, hours_per_week))+
geom_jitter(aes(col = (age == 90)),alpha = 0.2)+
scale_x_continuous(breaks = seq(0,100,10))+
scale_y_continuous(breaks = seq(0,100,10))+
scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
theme(legend.position="bottom")
Target variable income has two levels “<=50K” and “>50K” with the odds ratio approximately 3:1. The labelled classes are not extremely imbalanced and seem to be equally important in this task. Therefore, this ratio might not have a severe impact on model performance.
ggplot(income_c %>% filter(dataset == "with missing values"), aes(dataset, prop))+
geom_col(aes(fill=income))+
labs(y="Percentage")+
geom_text(aes(label=prop), hjust=2.5)+
coord_flip()+
labs(x ="")+
scale_x_discrete(labels=c("with missing values" = ""))+
theme_minimal()+
theme(legend.position="right")
Note: From this on, for convenience, I am going to mention “<=50K” and “>50K” as low and high income group respectively.
However, there is a big gap between the proportion of high income in this sample (~24%) and the results publicized by U.S. Bureau of the Census (n.d.) (~10%). This re-enforces my concern about the data representativeness.
variables_class$unique_value <- sapply(censusData_backup, n_distinct)
variables_class$levels <- c(NA,
"?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay",
NA,
"Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate",
"1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16",
"Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed",
"?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving",
"Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife",
"Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White",
"Female, Male",
NA,
NA,
NA,
"?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia",
"<=50K, >50K")
variables_class %>%
select(-description, - R_class) %>%
filter(data_type == "categorical") %>%
arrange(unique_value, variable) %>%
kable_adj() %>%
column_spec(4, width = "10cm")
| variable | data_type | unique_value | levels |
|---|---|---|---|
| income | categorical | 2 | <=50K, >50K |
| sex | categorical | 2 | Female, Male |
| race | categorical | 5 | Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White |
| relationship | categorical | 6 | Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife |
| marital_status | categorical | 7 | Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed |
| workclass | categorical | 9 | ?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay |
| occupation | categorical | 15 | ?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving |
| education | categorical | 16 | Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate |
| education_num | categorical | 16 | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 |
| native_country | categorical | 42 | ?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia |
Some categorical variables have quite many levels, esp. native_country, education with 42 and 16 distinct values respectively.
censusData %>%
gather(variable, value, - id) %>%
mutate(value = as.integer(value))-> censusData_long
lapply(seq(temp2), function(i) {
y <- data.frame(temp2[[i]])
y$variable = names(y)[1]
names(y)[1] <- c("levels")
return(y)
}) %>% bind_rows() -> chart1
ggplot(chart1 %>% filter(variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
geom_col(fill= "#56B4E9", alpha = 0.7)+
facet_grid(variable~., space = "free", scales = "free")+
coord_flip()+
labs(title = "Categorical variables - Frequency bar chart - part 1",
x = "",
y = "Percentage" )+
geom_text(aes(label = round(prop,0)), hjust = -1)+
scale_y_continuous(breaks = seq(0,100,10))
The U.S. accounts for 90% in native_country attribute while each of other 40 countries account for less than 1% except Mexico (2%). Similarly, “White” is dominant in race feature with 85%. However, race has fewer levels than native_country and might be more effective in this task.
###{#gender}
censusData %>%
gather(variable, value, - id) %>%
mutate(value = as.integer(value))-> censusData_long
lapply(seq(temp2), function(i) {
y <- data.frame(temp2[[i]])
y$variable = names(y)[1]
names(y)[1] <- c("levels")
return(y)
}) %>% bind_rows() %>% mutate(levels = factor(levels))-> chart1
ggplot(chart1 %>% filter(!variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
geom_col(fill= "#56B4E9", alpha = 0.7)+
facet_wrap(~variable, ncol =2, scales = "free_y")+
coord_flip()+
labs(title = "Categorical variables - Frequency bar chart - part 2",
x = "",
y = "Percentage" )+
geom_text(aes(label = round(prop,0)), hjust = -0.5)+
scale_y_continuous(breaks = seq(0,90,10), limits = c(0,80))
High-school graduate, Some college and Bachelors are the most common education levels, which accounted for 70% of survey responses.
Sample seems to be equally distributed in terms of occupation, yet quite imbalance in terms of sex and workclass. The high proportion of Private workclass might reflect the real world while the gap between genders might result from sample design methodology or data extraction rules.
censusData %>%
summarize(high = mean(income ==">50K")) -> benchmark
censusData %>%
gather(variable, value, - id, - income) -> censusData_long1
censusData_long1 %>%
filter(variable %in% categorical_list$variable) %>%
group_by(variable, value) %>%
summarize(count = n(),
pct_gt50 = round(mean(income == ">50K")*100,2)) %>%
mutate(pct_count = round(count/ nrow(censusData)*100,0)) -> summary1
summary1 %>%
filter(variable!="native_country", variable!="race") %>%
ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) +
geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
scale_y_continuous(breaks = seq(-10,80,10))+
geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
facet_wrap(.~variable, scales = "free_y", ncol = 2)+
coord_flip() +
labs(title = "Categorical variables - Proportion of high income (>50K) in each demographic group - part 2",
x = "",
y = "Percentage of high income group" )+
theme(legend.position = "none")
Graph explaination: The numbers in the graph represent the proportion of specific group in a demographic type (e.g. In the bottom left graph, 67% people in this survey are male.). The bar graphs show the proportion of high income group (“>50K”) in each demographic group. (e.g. Around 30% of male respondents had income more than 50K per year.). The vertical line is the proportion of high income group at the total level (~24%).
As can be seen from the graphs, those who are male, achieving at least Bachelors’ degree, in married relationships, having Management or Professional occupation, self-employed or working in government are more likely to have high income. #### {#page10} However, the size of Managerial and Professional occupation, self-employed or working-in-government groups are relatively small. Therefore, they might be less influential than other mentioned attributes (sex, marital status and education).
Some small groups with similar income likelihood might be grouped together to reduce noise and enhance generalizability. For example, some reclassified grouped might be education below High school graduate, Married-civ-spouse and Married-AF-spouse, Without-pay and Never-worked workclass.
summary1 %>%
filter(variable=="native_country" | variable=="race") %>%
ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) +
geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
scale_y_continuous(breaks = seq(-10,80,10))+
geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
facet_grid(variable~., scales = "free_y", space = "free")+
coord_flip() +
labs(title = "Categorical variables - Proportion of \n high income (>50K) in each demographic group - part 1",
x = "",
y = "Percentage of high income group" ) +
theme(legend.position = "none")
As discussed, the percentage of all native countries other than the U.S. is too small (~10%). In addition, they are equally distributed above and below the high-income likelihood benchmark. Therefore, this attribute might not be important for this classification task. Black, African American, American Indian or Alaskan Native have a much lower proportion of high income than their counterparts.
Descriptive statistics for all numerical variables and boxplot to detect outliers:
lapply(seq(temp3), function(i) {
y <- data.frame(temp3[[i]])
y$variable = names(y)[1]
names(y)[1] <- c("stats")
return(y)
}) %>% bind_rows() -> chart2
chart2%>%
spread(stats, distribution) %>% select(variable, Min., `1st Qu.`, Median, Mean, `3rd Qu.`, Max.) %>% kable_adj()
| variable | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|---|
| age | 17 | 28 | 37 | 38.58 | 48 | 90 |
| capital_gain | 0 | 0 | 0 | 1,078.00 | 0 | 99,999 |
| capital_loss | 0 | 0 | 0 | 87.30 | 0 | 4,356 |
| fnlwgt | 12,285 | 117,827 | 178,356 | 189,778.00 | 237,051 | 1,484,705 |
| hours_per_week | 1 | 40 | 40 | 40.44 | 45 | 99 |
censusData_long1 %>%
filter(variable %in% integer_list$variable) %>%
tbl_df() %>%
mutate(value = as.integer(value))-> censusData_int
chart2$stats <- ordered(chart2$stats, levels = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max."))
chart2 %>%
filter(stats %in% c("Min.", "Max.")) %>%
spread(stats, distribution) %>%
mutate(diff = Max. - Min.)-> t1
censusData_int %>%
ggplot(., aes(1,value))+
geom_boxplot()+
facet_wrap(~variable, scales = "free")+
coord_flip()
In capital_gain and capital_loss, minimum, 1st, 2nd and 3rd quantiles are all equal to “0”, which reflects the fact that at least 75% of people in this data set has no capital gain and capital loss. The distributions of these variables would be extremely right-skewed.
#check whether there is any people having both capital_gain and capital_loss positive values
censusData %>%
group_by(capital_gain, capital_loss) %>%
summarize(count = n()) %>%
filter(capital_gain != 0 & capital_loss != 0)
#none of them
For fnlwgt, the weighting factor has a very wide range. The maximum value is 120 times greater than the minimum.
capital was created by taking capital_gain minus capital_loss. There are 87% of respondents do not have any capital gain and loss. Due to this feature, the distribution of any values other than “0” is hardly identified in the following histogram.
#create capital column
censusData <- censusData %>%
mutate(capital = capital_gain - capital_loss)
#count no. of "0" value
censusData %>%
prop_summary(capital) %>%
filter(capital == 0) %>%
kable_adj()
| capital | count | prop |
|---|---|---|
| 0 | 28,330 | 87 |
capital_summary <- summary(censusData$capital) %>%
as.list() %>%
tbl_df() %>%
gather(stats, distribution, 1:6)
censusData %>%
ggplot(. , aes(capital))+
geom_histogram(fill= "grey", alpha = 0.7)+
geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1, alpha = 0.5)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Capital gain and loss distribution",
x="Capital",
y="No. of records")
Therefore, I tried to remove “0” to improve the visibility of the histogram.
censusData %>%
filter(capital != 0) %>%
ggplot(. , aes(capital))+
geom_histogram(fill= "grey", alpha = 0.7)+
geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Capital gain and loss distribution - After remove 0 values",
x="Capital",
y="No. of records")
“99999” values seem to be suspicious as they are separated from the rest of the distribution (so-called outliers). According to IPUMS-CPS (n.d.), it might represent “Not-In-Universe” values rather than actual capital gain. Let temporarily remove it for clearer visualisation.
censusData %>%
filter(capital != 0 & capital != 99999) %>%
ggplot(. , aes(capital))+
geom_histogram(aes(fill = income), binwidth = 500, alpha = 0.7)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Capital gain and loss distribution - After remove 0 and 99999 values",
x="Capital",
y="No. of records")+
geom_vline(xintercept = c(-1750, 7250), linetype = "dashed", alpha = 0.5, size =0.5)+
scale_x_continuous(breaks = seq(-10000, 40000, 10000))+
theme(legend.position = "bottom")
### {#page14} The boundaries of capital to classified income group are approximately detected from this graph. It seems that people have the capital gain at least 7,250 or capital loss at least -1,750 USD are more likely to have a high-income.
censusData %>%
prop_summary(capital) %>%
arrange(desc(capital)) %>%
head(n = 10) %>%
kable_adj()
| capital | count | prop |
|---|---|---|
| 99,999 | 159 | 0.5 |
| 41,310 | 2 | 0.0 |
| 34,095 | 5 | 0.0 |
| 27,828 | 34 | 0.1 |
| 25,236 | 11 | 0.0 |
| 25,124 | 4 | 0.0 |
| 22,040 | 1 | 0.0 |
| 20,051 | 37 | 0.1 |
| 18,481 | 2 | 0.0 |
| 15,831 | 6 | 0.0 |
Caution: Looking a bit closer, the capital recorded values are not rounded numbers (e.g. 27,828 but not 28,000). Based on common sense, it is unlikely to get this precise amount of money based on questionnaire data. According to O’Hara (2004), capital gain came from the Census Bureau’s tax model and was discontinued since 2010 (IPUMS-CPS (n.d.)). Even though this variable might powerful in classifying the income groups, the concern is whether this information is still available for new data. If not, this feature would be useless in this classification model.
From the 1st graph on the next page, the age distribution is right-skewed due to data extraction rule (age > 16). There is a minor spike in the number of respondents aged exactly 90. There might be possible that all ages above 90 or “Unknown” values were encoded as “90”. As mentioned in the Data quality section, it is strange that all “90-year-old people” still work at least 1 hour per week!
censusData_int %>%
filter(variable == "age") %>%
ggplot(. , aes(value))+
geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
# facet_wrap(~variable,scales = "free", ncol = 1)+
geom_vline(data = chart2 %>% filter(variable == "age"), aes(xintercept = distribution, col = stats), size=1)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Age distribution",
x="Age",
y="No. of records")+
scale_x_continuous(breaks=seq(0,100,10))+
theme(legend.position = "right")
censusData %>%
group_by(income, age) %>%
summarize(count = n()) %>%
ungroup() %>%
group_by(age) %>%
mutate(prop = count / sum(count)) %>%
filter(income == ">50K", prop >= benchmark$high) -> mark
censusData_int %>%
filter(variable == "age") %>%
ggplot(., aes(value))+
geom_freqpoly(aes(col= income), binwidth = 1)+
geom_point(data = mark, aes(age, count))+
geom_vline(xintercept = c(35,60), linetype = "dashed")+
geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
scale_colour_hue()+
theme(legend.position = "right")+
labs(title = "Age distribution by income groups - Observation count",
x ="Age",
y = "Number of Observations")
censusData %>%
group_by(income, age) %>%
summarize(count = n()) %>%
ungroup() %>%
group_by(age) %>%
mutate(prop = count / sum(count)) %>%
ggplot(., aes(age, prop, fill = income))+
geom_area()+
geom_point(data = mark, aes(age, prop))+
geom_hline(yintercept = benchmark$high, linetype = "dashed")+
geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
geom_vline(xintercept = c(35,60), linetype = "dashed")+
# geom_vline(xintercept = seq(17, 90, by = 10), col = "black")+
scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
theme(legend.position = "right") +
labs(title = "Age distribution by income groups - Proportion",
x ="Age",
y = "Proportion")
Note: All the horizontal lines (at 24%) are the percentage of high income group at aggregation level.
Based on the 3rd graph, it can be seen that people who aged from 35 to 60 are more likely to have a higher income than the average. Besides, there are only 5 people aged under 22 have an annual income greater than 50K, which accounts for 0.2% of people in this age group. In other word, it is very unlikely that someone age under 22 could earn more than 50,000 USD in 1994.
47% of respondents work exactly 40 hours per week.
censusData_int %>%
filter(variable == "hours_per_week") %>%
ggplot(. , aes(value))+
geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
geom_vline(data = chart2 %>% filter(variable == "hours_per_week"), aes(xintercept = distribution, col = stats), size=1)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Working hours per week distribution",
x="Hours per week",
y="No. of records")+
scale_x_continuous(breaks=seq(0,100,10))
censusData_int %>%
filter(variable == "hours_per_week") %>%
ggplot(., aes(value))+
geom_freqpoly(aes(col= income), binwidth = 1)+
scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
geom_vline(xintercept = 34, linetype = "dashed")+
scale_colour_hue()+
theme(legend.position = "right")+
labs(title = "Hours per week distribution by income groups - Observation count",
x ="Hours per week",
y = "Number of Observations")
censusData %>%
group_by(income, hours_per_week) %>%
summarize(count = n()) %>%
ungroup() %>%
group_by(hours_per_week) %>%
mutate(prop = count / sum(count)) %>%
ggplot(., aes(hours_per_week, prop, fill = income))+
geom_area()+
geom_hline(yintercept = benchmark$high, linetype = "dashed")+
scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
geom_vline(xintercept = 34, linetype = "dashed")+
theme(legend.position = "right") +
labs(title = "Hours per week distribution by income groups - Proportion",
x ="Hours per week",
y = "Proportion")
The percentage of people aged under 35 have high income is typically low comparing to the older group.
censusData %>%
filter(workclass == "Never-worked") %>%
select(-capital_gain, - capital_loss, - fnlwgt, - capital, - relationship, - race, - sex, - native_country) %>%
kable_adj()
| age | workclass | education | marital_status | occupation | hours_per_week | income | id |
|---|---|---|---|---|---|---|---|
| 18 | Never-worked | 10th | Never-married | ? | 40 | <=50K | 5,362 |
| 23 | Never-worked | 7th-8th | Divorced | ? | 35 | <=50K | 10,846 |
| 17 | Never-worked | 10th | Never-married | ? | 30 | <=50K | 14,773 |
| 18 | Never-worked | 11th | Never-married | ? | 10 | <=50K | 20,338 |
| 20 | Never-worked | Some-college | Never-married | ? | 40 | <=50K | 23,233 |
| 30 | Never-worked | HS-grad | Married-civ-spouse | ? | 40 | <=50K | 32,305 |
| 18 | Never-worked | Some-college | Never-married | ? | 4 | <=50K | 32,315 |
There are no correlation between three main continuous variables:
# pairs(censusData %>% select(age, hours_per_week, fnlwgt, capital), pch =21, cex=0.1)
panel.cor <- function(x, y, ...)
{
par(usr = c(0, 1, 0, 1))
txt <- as.character(format(cor(x, y), digits=2))
text(0.5, 0.5, txt, cex = 60* abs(cor(x, y)))
}
pairs(censusData %>% select(age, hours_per_week, capital), upper.panel=panel.cor)
#4. Data pre-processing
In this context, data pre-processing might include four main tasks: cleaning, reduction, discretization and transformation (Han and Pei (2012)). Most of them have been discussed in the previous sections. This part aims to give a concise summary of them:
In general, correcting point (1,2) errors might be not as important as point (3,4) as the error rates in (1,2) are quite small and identifiable. Meanwhile, concerns in point (3,4) require much more domain knowledge about government Census and their procedure. By clarifying these concerns, the weakness of the data set can be well defined and the feasible solutions might be formulated. For example, if we know that the ratio between male and female should be exactly 1:1, a new weighting or other statistical analysis can be applied to minimize the impact of sample imbalance.
This might include dimensionality and/or numerosity reduction or both.
Several simple ways that have been done to reduce data dimensions, including:
summary1 %>%
filter(variable=="education") %>%
ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) +
geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
scale_y_continuous(breaks = seq(-10,80,10))+
geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
geom_vline(xintercept = c(8.5, 12.5), alpha = 0.5, linetype = "dashed", col = "red")+
coord_flip() +
labs(title = "Categorical variables - \n Proportion of high income (>50K) in education groups",
x = "",
y = "Percentage of high income group" )+
theme(legend.position = "none")
Besides, there are other advanced data reduction techniques, that might be useful to try such as Principle Component Analysis for dimensionality reduction and clustering for numerosity reduction. Due to the limit of time, these techniques were not applied, yet would be useful to explore further.
The general idea is trying to map multiple values into intervals or labels.
In a similar approach, continuous variables (e.g. capital, age, hours_per_week) can also be discretized section 3.2.1. The purity within some sub-groups, which were constructed based on the mentioned rule, do increase significantly as shown below.
censusData <- censusData %>%
mutate(capital_group = case_when(capital< (-1750) ~ "[-inf, -1750)",
capital>= (-1750) & capital<0 ~ "[-1750, 0)",
capital== 0 ~ "0",
capital>0 & capital<=7250 ~ "(0,7250]",
capital>7250 ~ "(7250, +inf]"))
censusData$capital_group <- ordered(censusData$capital_group, levels = c("[-inf, -1750)","[-1750, 0)","0","(0,7250]","(7250, +inf]"))
censusData %>%
ggplot(., aes(capital_group))+
geom_bar(aes(fill = income), position = "fill")+
labs(y = "Percentage")+
geom_hline(yintercept = 0.241, linetype = "dashed")
This step depends on what machine learning methods will be applied in the modelling phase. If the algorithms are distance-based (e.g. k-Nearest Neighbours), continuous variables (including age, hours_per_week and capital) should be normalized to the same scale (0,1) and categorical attributes need to transform into dummy binary variables to measure the distance. In this data set case, as categorical variables have many levels, it might be required to reduce the number of levels before transformation or using feature selection to reduce the number of dummy attributes.
Business and data understanding are strongly connected to each other and crucial for the success of data mining. Data exploratory analysis enables people to understand more about domain knowledge (Census methodology). As a result, gaining further knowledge about the area of expertise helps data analysis to discover interesting patterns with a reasonable explanation.
Categorical variables: Those who are male, achieving at least Bachelors’ degree, in married relationships, having Management or Professional occupation, self-employed or working in government are more likely to have high income. Sex, marital status and education might be more important than occupation and workclass in this context.
Continuous variables: To some extent, capital, age and working hours per week support the discrimination of different income groups. Specifically, people in the U.S who have capital gain at least 7,250 or capital loss at least -1,750, aged 35-60 and work at least 35 hours per week have a higher probability of gaining more than 50,000 USD per year in 1994.
Besides minor data errors such as missing, duplicated or illogical values, the main on-going concern is about the representative of the data set and how one can improve it to achieve data mining goals.
Several data pre-processing are considered and hopefully help improve the prediction performance in terms of accuracy, stability, speed and interpretation.
temp2[[1]]$description <- c("Private",
"Self-employed-not incorporated",
"Local government",
"Unknown values",
"State government",
"Self-employed-incorporated",
"Federal government",
"Without pay",
"Never worked")
temp2[[2]]$description <- c("High school graduate - high school diploma or equivalent",
"Some college but no degree",
"Bachelor's degree (for example: BA,AB,BS)",
"Master's degree (for example: MA,MS,MENG,MED, MSW, MBA)",
"Associate degree in college - occupation/vocation program",
"11th grade",
"Associate degree in college - academic program",
"10th grade",
"7th and 8th grade",
"Professional school degree (for example: MD,DDS,DVM,LLB,JD)",
"9th grade",
"12th grade no diploma",
"Doctorate degree (for example: PHD,EDD)",
"5th or 6th grade",
"1st,2nd,3rd,or 4th grade",
"Preschool")
temp2[[3]]$description <- c("Married - civilian spouse present",
"Never married",
"Divorced",
"Separated",
"Widowed",
"Married - spouse absent (exc. separated)",
"Married - AF spouse present")
temp2[[6]]$description <- c("White",
"Black",
"Asian, Native Hawaiian or Other Pacific Islander",
"African American, American Indian or Alaskan Native",
"Other race")
map(temp2, kable_adj) %>% walk(print)
| workclass | count | prop | description |
|---|---|---|---|
| Private | 22,696 | 69.7 | Private |
| Self-emp-not-inc | 2,541 | 7.8 | Self-employed-not incorporated |
| Local-gov | 2,093 | 6.4 | Local government |
| ? | 1,836 | 5.6 | Unknown values |
| State-gov | 1,298 | 4.0 | State government |
| Self-emp-inc | 1,116 | 3.4 | Self-employed-incorporated |
| Federal-gov | 960 | 2.9 | Federal government |
| Without-pay | 14 | 0.0 | Without pay |
| Never-worked | 7 | 0.0 | Never worked |
| education | count | prop | description |
|---|---|---|---|
| HS-grad | 10,501 | 32.3 | High school graduate - high school diploma or equivalent |
| Some-college | 7,291 | 22.4 | Some college but no degree |
| Bachelors | 5,355 | 16.4 | Bachelor’s degree (for example: BA,AB,BS) |
| Masters | 1,723 | 5.3 | Master’s degree (for example: MA,MS,MENG,MED, MSW, MBA) |
| Assoc-voc | 1,382 | 4.2 | Associate degree in college - occupation/vocation program |
| 11th | 1,175 | 3.6 | 11th grade |
| Assoc-acdm | 1,067 | 3.3 | Associate degree in college - academic program |
| 10th | 933 | 2.9 | 10th grade |
| 7th-8th | 646 | 2.0 | 7th and 8th grade |
| Prof-school | 576 | 1.8 | Professional school degree (for example: MD,DDS,DVM,LLB,JD) |
| 9th | 514 | 1.6 | 9th grade |
| 12th | 433 | 1.3 | 12th grade no diploma |
| Doctorate | 413 | 1.3 | Doctorate degree (for example: PHD,EDD) |
| 5th-6th | 333 | 1.0 | 5th or 6th grade |
| 1st-4th | 168 | 0.5 | 1st,2nd,3rd,or 4th grade |
| Preschool | 51 | 0.2 | Preschool |
| marital_status | count | prop | description |
|---|---|---|---|
| Married-civ-spouse | 14,976 | 46.0 | Married - civilian spouse present |
| Never-married | 10,683 | 32.8 | Never married |
| Divorced | 4,443 | 13.6 | Divorced |
| Separated | 1,025 | 3.1 | Separated |
| Widowed | 993 | 3.0 | Widowed |
| Married-spouse-absent | 418 | 1.3 | Married - spouse absent (exc. separated) |
| Married-AF-spouse | 23 | 0.1 | Married - AF spouse present |
| occupation | count | prop |
|---|---|---|
| Prof-specialty | 4,140 | 12.7 |
| Craft-repair | 4,099 | 12.6 |
| Exec-managerial | 4,066 | 12.5 |
| Adm-clerical | 3,770 | 11.6 |
| Sales | 3,650 | 11.2 |
| Other-service | 3,295 | 10.1 |
| Machine-op-inspct | 2,002 | 6.1 |
| ? | 1,843 | 5.7 |
| Transport-moving | 1,597 | 4.9 |
| Handlers-cleaners | 1,370 | 4.2 |
| Farming-fishing | 994 | 3.1 |
| Tech-support | 928 | 2.9 |
| Protective-serv | 649 | 2.0 |
| Priv-house-serv | 149 | 0.5 |
| Armed-Forces | 9 | 0.0 |
| relationship | count | prop |
|---|---|---|
| Husband | 13,193 | 40.5 |
| Not-in-family | 8,305 | 25.5 |
| Own-child | 5,068 | 15.6 |
| Unmarried | 3,446 | 10.6 |
| Wife | 1,568 | 4.8 |
| Other-relative | 981 | 3.0 |
| race | count | prop | description |
|---|---|---|---|
| White | 27,816 | 85.4 | White |
| Black | 3,124 | 9.6 | Black |
| Asian-Pac-Islander | 1,039 | 3.2 | Asian, Native Hawaiian or Other Pacific Islander |
| Amer-Indian-Eskimo | 311 | 1.0 | African American, American Indian or Alaskan Native |
| Other | 271 | 0.8 | Other race |
| sex | count | prop |
|---|---|---|
| Male | 21,790 | 66.9 |
| Female | 10,771 | 33.1 |
| native_country | count | prop |
|---|---|---|
| United-States | 29,170 | 89.6 |
| Mexico | 643 | 2.0 |
| ? | 583 | 1.8 |
| Philippines | 198 | 0.6 |
| Germany | 137 | 0.4 |
| Canada | 121 | 0.4 |
| Puerto-Rico | 114 | 0.4 |
| El-Salvador | 106 | 0.3 |
| India | 100 | 0.3 |
| Cuba | 95 | 0.3 |
| England | 90 | 0.3 |
| Jamaica | 81 | 0.2 |
| South | 80 | 0.2 |
| China | 75 | 0.2 |
| Italy | 73 | 0.2 |
| Dominican-Republic | 70 | 0.2 |
| Vietnam | 67 | 0.2 |
| Guatemala | 64 | 0.2 |
| Japan | 62 | 0.2 |
| Poland | 60 | 0.2 |
| Columbia | 59 | 0.2 |
| Taiwan | 51 | 0.2 |
| Haiti | 44 | 0.1 |
| Iran | 43 | 0.1 |
| Portugal | 37 | 0.1 |
| Nicaragua | 34 | 0.1 |
| Peru | 31 | 0.1 |
| France | 29 | 0.1 |
| Greece | 29 | 0.1 |
| Ecuador | 28 | 0.1 |
| Ireland | 24 | 0.1 |
| Hong | 20 | 0.1 |
| Cambodia | 19 | 0.1 |
| Trinadad&Tobago | 19 | 0.1 |
| Laos | 18 | 0.1 |
| Thailand | 18 | 0.1 |
| Yugoslavia | 16 | 0.0 |
| Outlying-US(Guam-USVI-etc) | 14 | 0.0 |
| Honduras | 13 | 0.0 |
| Hungary | 13 | 0.0 |
| Scotland | 12 | 0.0 |
| Holand-Netherlands | 1 | 0.0 |
The only difference is the dummy id (based on row number) that I created.
censusData %>%
group_by_at(vars(-id)) %>%
summarize(dup_group = n()) %>%
filter(dup_group>1) %>%
select(-dup_group) %>%
inner_join(censusData, by = names(censusData)[-(length(censusData)-2)]) %>%
kable(format="latex", booktabs=TRUE) %>%
kable_styling(latex_options="scale_down", position = "float_left")
colnames(censusData) <- str_replace_all(colnames(censusData), "_", ".")
censusData %>%
group_by(income) %>%
summarise_at(vars(age, fnlwgt, capital.gain, capital.loss, hours.per.week), funs(min, max, mean, median)) -> s1
s1 %>%
gather(variable_measure, value, 2:21) %>%
separate(variable_measure, c("variable", "measure"), sep = "_") %>%
spread(measure, value) %>%
arrange(variable) %>%
select(variable, income, min, median, mean, max) %>%
kable_adj()
| variable | income | min | median | mean | max |
|---|---|---|---|---|---|
| age | <=50K | 17 | 34 | 36.78 | 90 |
| age | >50K | 19 | 44 | 44.25 | 90 |
| capital.gain | <=50K | 0 | 0 | 148.75 | 41,310 |
| capital.gain | >50K | 0 | 0 | 4,006.14 | 99,999 |
| capital.loss | <=50K | 0 | 0 | 53.14 | 4,356 |
| capital.loss | >50K | 0 | 0 | 195.00 | 3,683 |
| fnlwgt | <=50K | 12,285 | 179,465 | 190,340.87 | 1,484,705 |
| fnlwgt | >50K | 14,878 | 176,101 | 188,005.00 | 1,226,583 |
| hours.per.week | <=50K | 1 | 40 | 38.84 | 99 |
| hours.per.week | >50K | 1 | 40 | 45.47 | 99 |
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")
knitr::opts_chunk$set(warning = FALSE, message = FALSE, comment = NA, fig.align='center')
# knitr::opts_chunk$set(fig.width = 15, fig.height = 14)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
options(dplyr.width = Inf)
options(knitr.table.format = "html")
options(knitr.kable.NA = '')
prop_summary <- function(df, ...){
df %>%
group_by(...) %>%
summarize(count = n()) %>%
ungroup() %>%
mutate(prop = round(count/sum(count)*100,1))
}
kable_adj <- function(data) {
knitr::kable(data, booktabs = TRUE, digits = 2, format.args = list(big.mark = ",")) %>%
kableExtra::kable_styling(full_width = F, position = "left") %>%
row_spec(0, bold = T, color = "white", background = "#696969")
}
library("tidyverse")
library("lubridate")
library("knitr")
library("ggpubr")
library("kableExtra")
#import data
censusData <- read.csv("censusData.csv", stringsAsFactors = T, strip.white = TRUE) %>% tbl_df()
censusData_backup <- censusData
#rename to replace "." with "_"
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")
variables_class <- as.data.frame(lapply(censusData, class)) %>%
.[1,] %>%
gather(variable, R_class , 1:ncol(.)) %>%
mutate(data_type = ifelse(R_class == "factor", "categorical",
ifelse(variable == "education_num","categorical","continuous")))
description <- c("age of the individual",
"type of employer the individual has",
"number of people in the universe (U.S. population) that each individual in the survey can represent",
"highest level of education that individual achieved",
"highest level of education in a numerical form",
"marital status of the individual",
"occupation of the individual",
"primary family relationship (the relationship of the individual in his/her primary family)",
"the individual's ethnicity",
"gender",
"capital gains (from selling an asset such as a stock, bond for more than the original purchase price)",
"capital losses (from selling an asset such as a stock, bond for less than the original purchase price)",
"hours worked per week",
"country of origin",
"(target) whether or not the person makes more than $50,000 per annum income")
variables_class$description <- description
variables_class %>%
arrange(data_type, variable) %>%
kable_adj() %>%
row_spec(11:15, color = "white", background = "#696969") %>%
column_spec(4, width = "10cm")
#preview data structure and example values
glimpse(censusData)
#pre-processing data without printing out - will be discussed in the pre-processing section
censusData %>%
select(education_num, education) %>%
distinct() %>%
arrange(education_num) -> edu_order
censusData$education <- ordered(censusData$education, levels = edu_order$education)
censusData <- censusData %>% select(-education_num) %>% mutate(id = row_number())
censusData %>%
select(-id) %>%
summarize(no_of_observations = n(),
no_of_distinct = n_distinct(.)) %>%
mutate(no_of_dup = no_of_observations - no_of_distinct) %>%
mutate(pct_dup = round(no_of_dup/no_of_observations*100,1))%>% kable_adj()
censusData %>%
group_by_at(vars(-id)) %>%
summarize(dup_group = n()) %>%
group_by(dup_group) %>%
summarize(count = n()) %>% kable_adj()
#check 'normal' missing values
# map_df(censusData, ~sum(is.na(.)))
#'check missing values labeled as "?"
total_missing <- map_df(censusData, ~str_detect(., pattern = "\\?")) %>%
rowSums() %>%
tbl_df() %>%
filter(value > 0) %>%
summarize(no_of_missing = n()) %>%
mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1),
variable = "all variables") %>%
select(variable, no_of_missing, pct_missing)
map_df(censusData, ~sum(str_detect(., pattern = "\\?"))) %>%
gather(variable, no_of_missing, 1:15) %>%
mutate(pct_missing = round(no_of_missing/ nrow(censusData)*100,1)) %>%
bind_rows(total_missing) %>%
arrange(desc(no_of_missing)) %>%
kable_adj()
income_s <- prop_summary(censusData,income)
income_c <- censusData %>%
filter_all(all_vars(.!= "?")) %>%
prop_summary(income) %>%
mutate(dataset = "without missing values") %>%
bind_rows(income_s %>% mutate(dataset = "with missing values"))
ggplot(income_c, aes(dataset, prop))+
geom_col(aes(fill=income))+
labs(y="Percentage")+
geom_text(aes(label=prop), hjust=2.5)+
coord_flip()+
theme(legend.position="bottom")
censusData %>%
group_by(age, education) %>%
summarize(count = n()) %>%
ungroup() %>%
ggplot(., aes(age, education))+
geom_point(aes(size = count, color = (age <= 20 & education =="Masters")), pch = 19, alpha = 0.7)+
scale_x_continuous(breaks = seq(0,100,10))+
scale_colour_manual(values = setNames(c('red','grey'),c(T, F)))+
theme(legend.position="bottom")+
labs(y="")
censusData %>%
group_by(sex, relationship) %>%
summarize(count = n()) %>%
ggplot(., aes(relationship, sex))+
geom_text(aes(label=count, color = (sex == "Male" & relationship == "Wife") | (sex == "Female" & relationship == "Husband")))+
scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
theme(legend.position="bottom")
censusData %>%
ggplot(., aes(age, hours_per_week))+
geom_jitter(aes(col = (age == 90)),alpha = 0.2)+
scale_x_continuous(breaks = seq(0,100,10))+
scale_y_continuous(breaks = seq(0,100,10))+
scale_colour_manual(values = setNames(c('red','black'),c(T, F)))+
theme(legend.position="bottom")
ggplot(income_c %>% filter(dataset == "with missing values"), aes(dataset, prop))+
geom_col(aes(fill=income))+
labs(y="Percentage")+
geom_text(aes(label=prop), hjust=2.5)+
coord_flip()+
labs(x ="")+
scale_x_discrete(labels=c("with missing values" = ""))+
theme_minimal()+
theme(legend.position="right")
variables_class$unique_value <- sapply(censusData_backup, n_distinct)
variables_class$levels <- c(NA,
"?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay",
NA,
"Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate",
"1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16",
"Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed",
"?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving",
"Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife",
"Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other, White",
"Female, Male",
NA,
NA,
NA,
"?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam, Yugoslavia",
"<=50K, >50K")
variables_class %>%
select(-description, - R_class) %>%
filter(data_type == "categorical") %>%
arrange(unique_value, variable) %>%
kable_adj() %>%
column_spec(4, width = "10cm")
#summary table
y <- censusData %>% summary(maxsum = 42) %>% as.data.frame.matrix(row.names = F)
#generate counting numbers and empty list
a<-1
b<-1
temp2 <- list()
temp3 <- list()
categorical_list <- variables_class %>% filter(R_class == "factor", variable !="income")
integer_list <- variables_class %>% filter(R_class == "integer", !variable %in% c("id", "education_num"))
#splitting into categorical and continuous groups
for (i in seq_along(y)){
y %>%
select(i) %>%
na.omit() -> temp
if (str_trim(colnames(temp)[1]) %in% categorical_list[,1] ){
temp %>%
separate(1,c(str_trim(colnames(temp)[1]), "count"), sep = ":") %>%
mutate_all(~str_trim(.)) %>%
mutate(count = as.numeric(count), prop = round(count / sum(count) * 100,1)) %>%
arrange(desc(count))-> temp2[[a]]
a <- a + 1
}
if (str_trim(colnames(temp)[1]) %in% integer_list[,1] ){
temp %>%
separate(1,c(str_trim(colnames(temp)[1]), "distribution"), sep = ":") %>%
mutate_all(~str_trim(.)) %>%
mutate(distribution = as.numeric(distribution)) -> temp3[[b]]
b <- b + 1
}
}
censusData %>%
gather(variable, value, - id) %>%
mutate(value = as.integer(value))-> censusData_long
lapply(seq(temp2), function(i) {
y <- data.frame(temp2[[i]])
y$variable = names(y)[1]
names(y)[1] <- c("levels")
return(y)
}) %>% bind_rows() -> chart1
ggplot(chart1 %>% filter(variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
geom_col(fill= "#56B4E9", alpha = 0.7)+
facet_grid(variable~., space = "free", scales = "free")+
coord_flip()+
labs(title = "Categorical variables - Frequency bar chart - part 1",
x = "",
y = "Percentage" )+
geom_text(aes(label = round(prop,0)), hjust = -1)+
scale_y_continuous(breaks = seq(0,100,10))
censusData %>%
gather(variable, value, - id) %>%
mutate(value = as.integer(value))-> censusData_long
lapply(seq(temp2), function(i) {
y <- data.frame(temp2[[i]])
y$variable = names(y)[1]
names(y)[1] <- c("levels")
return(y)
}) %>% bind_rows() %>% mutate(levels = factor(levels))-> chart1
ggplot(chart1 %>% filter(!variable %in% c("native_country", "race")), aes(reorder(levels,prop), prop))+
geom_col(fill= "#56B4E9", alpha = 0.7)+
facet_wrap(~variable, ncol =2, scales = "free_y")+
coord_flip()+
labs(title = "Categorical variables - Frequency bar chart - part 2",
x = "",
y = "Percentage" )+
geom_text(aes(label = round(prop,0)), hjust = -0.5)+
scale_y_continuous(breaks = seq(0,90,10), limits = c(0,80))
censusData %>%
summarize(high = mean(income ==">50K")) -> benchmark
censusData %>%
gather(variable, value, - id, - income) -> censusData_long1
censusData_long1 %>%
filter(variable %in% categorical_list$variable) %>%
group_by(variable, value) %>%
summarize(count = n(),
pct_gt50 = round(mean(income == ">50K")*100,2)) %>%
mutate(pct_count = round(count/ nrow(censusData)*100,0)) -> summary1
summary1 %>%
filter(variable!="native_country", variable!="race") %>%
ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) +
geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
scale_y_continuous(breaks = seq(-10,80,10))+
geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
facet_wrap(.~variable, scales = "free_y", ncol = 2)+
coord_flip() +
labs(title = "Categorical variables - Proportion of high income (>50K) in each demographic group - part 2",
x = "",
y = "Percentage of high income group" )+
theme(legend.position = "none")
summary1 %>%
filter(variable=="native_country" | variable=="race") %>%
ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) +
geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
scale_y_continuous(breaks = seq(-10,80,10))+
geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
facet_grid(variable~., scales = "free_y", space = "free")+
coord_flip() +
labs(title = "Categorical variables - Proportion of \n high income (>50K) in each demographic group - part 1",
x = "",
y = "Percentage of high income group" ) +
theme(legend.position = "none")
lapply(seq(temp3), function(i) {
y <- data.frame(temp3[[i]])
y$variable = names(y)[1]
names(y)[1] <- c("stats")
return(y)
}) %>% bind_rows() -> chart2
chart2%>%
spread(stats, distribution) %>% select(variable, Min., `1st Qu.`, Median, Mean, `3rd Qu.`, Max.) %>% kable_adj()
censusData_long1 %>%
filter(variable %in% integer_list$variable) %>%
tbl_df() %>%
mutate(value = as.integer(value))-> censusData_int
chart2$stats <- ordered(chart2$stats, levels = c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max."))
chart2 %>%
filter(stats %in% c("Min.", "Max.")) %>%
spread(stats, distribution) %>%
mutate(diff = Max. - Min.)-> t1
censusData_int %>%
ggplot(., aes(1,value))+
geom_boxplot()+
facet_wrap(~variable, scales = "free")+
coord_flip()
#check whether there is any people having both capital_gain and capital_loss positive values
censusData %>%
group_by(capital_gain, capital_loss) %>%
summarize(count = n()) %>%
filter(capital_gain != 0 & capital_loss != 0)
#none of them
#create capital column
censusData <- censusData %>%
mutate(capital = capital_gain - capital_loss)
#count no. of "0" value
censusData %>%
prop_summary(capital) %>%
filter(capital == 0) %>%
kable_adj()
capital_summary <- summary(censusData$capital) %>%
as.list() %>%
tbl_df() %>%
gather(stats, distribution, 1:6)
censusData %>%
ggplot(. , aes(capital))+
geom_histogram(fill= "grey", alpha = 0.7)+
geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1, alpha = 0.5)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Capital gain and loss distribution",
x="Capital",
y="No. of records")
censusData %>%
filter(capital != 0) %>%
ggplot(. , aes(capital))+
geom_histogram(fill= "grey", alpha = 0.7)+
geom_vline(data = capital_summary, aes(xintercept = distribution, col = stats), size=1)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Capital gain and loss distribution - After remove 0 values",
x="Capital",
y="No. of records")
censusData %>%
filter(capital != 0 & capital != 99999) %>%
ggplot(. , aes(capital))+
geom_histogram(aes(fill = income), binwidth = 500, alpha = 0.7)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Capital gain and loss distribution - After remove 0 and 99999 values",
x="Capital",
y="No. of records")+
geom_vline(xintercept = c(-1750, 7250), linetype = "dashed", alpha = 0.5, size =0.5)+
scale_x_continuous(breaks = seq(-10000, 40000, 10000))+
theme(legend.position = "bottom")
censusData %>%
prop_summary(capital) %>%
arrange(desc(capital)) %>%
head(n = 10) %>%
kable_adj()
censusData_int %>%
filter(variable == "age") %>%
ggplot(. , aes(value))+
geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
# facet_wrap(~variable,scales = "free", ncol = 1)+
geom_vline(data = chart2 %>% filter(variable == "age"), aes(xintercept = distribution, col = stats), size=1)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Age distribution",
x="Age",
y="No. of records")+
scale_x_continuous(breaks=seq(0,100,10))+
theme(legend.position = "right")
censusData %>%
group_by(income, age) %>%
summarize(count = n()) %>%
ungroup() %>%
group_by(age) %>%
mutate(prop = count / sum(count)) %>%
filter(income == ">50K", prop >= benchmark$high) -> mark
censusData_int %>%
filter(variable == "age") %>%
ggplot(., aes(value))+
geom_freqpoly(aes(col= income), binwidth = 1)+
geom_point(data = mark, aes(age, count))+
geom_vline(xintercept = c(35,60), linetype = "dashed")+
geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
scale_colour_hue()+
theme(legend.position = "right")+
labs(title = "Age distribution by income groups - Observation count",
x ="Age",
y = "Number of Observations")
censusData %>%
group_by(income, age) %>%
summarize(count = n()) %>%
ungroup() %>%
group_by(age) %>%
mutate(prop = count / sum(count)) %>%
ggplot(., aes(age, prop, fill = income))+
geom_area()+
geom_point(data = mark, aes(age, prop))+
geom_hline(yintercept = benchmark$high, linetype = "dashed")+
geom_vline(xintercept = c(22), linetype = "dashed", col = "red")+
geom_vline(xintercept = c(35,60), linetype = "dashed")+
# geom_vline(xintercept = seq(17, 90, by = 10), col = "black")+
scale_x_continuous(breaks = round(seq(min(censusData$age), max(censusData$age), by = 10),-1))+
theme(legend.position = "right") +
labs(title = "Age distribution by income groups - Proportion",
x ="Age",
y = "Proportion")
censusData_int %>%
filter(variable == "hours_per_week") %>%
ggplot(. , aes(value))+
geom_histogram(binwidth = 2, fill= "grey", alpha = 0.7)+
geom_vline(data = chart2 %>% filter(variable == "hours_per_week"), aes(xintercept = distribution, col = stats), size=1)+
scale_colour_hue()+
theme(legend.position="right")+
labs(title = "Working hours per week distribution",
x="Hours per week",
y="No. of records")+
scale_x_continuous(breaks=seq(0,100,10))
censusData_int %>%
filter(variable == "hours_per_week") %>%
ggplot(., aes(value))+
geom_freqpoly(aes(col= income), binwidth = 1)+
scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
geom_vline(xintercept = 34, linetype = "dashed")+
scale_colour_hue()+
theme(legend.position = "right")+
labs(title = "Hours per week distribution by income groups - Observation count",
x ="Hours per week",
y = "Number of Observations")
censusData %>%
group_by(income, hours_per_week) %>%
summarize(count = n()) %>%
ungroup() %>%
group_by(hours_per_week) %>%
mutate(prop = count / sum(count)) %>%
ggplot(., aes(hours_per_week, prop, fill = income))+
geom_area()+
geom_hline(yintercept = benchmark$high, linetype = "dashed")+
scale_x_continuous(breaks = round(seq(min(censusData$hours_per_week), max(censusData$hours_per_week), by = 10),-1))+
geom_vline(xintercept = 34, linetype = "dashed")+
theme(legend.position = "right") +
labs(title = "Hours per week distribution by income groups - Proportion",
x ="Hours per week",
y = "Proportion")
censusData %>%
filter(workclass == "Never-worked") %>%
select(-capital_gain, - capital_loss, - fnlwgt, - capital, - relationship, - race, - sex, - native_country) %>%
kable_adj()
# pairs(censusData %>% select(age, hours_per_week, fnlwgt, capital), pch =21, cex=0.1)
panel.cor <- function(x, y, ...)
{
par(usr = c(0, 1, 0, 1))
txt <- as.character(format(cor(x, y), digits=2))
text(0.5, 0.5, txt, cex = 60* abs(cor(x, y)))
}
pairs(censusData %>% select(age, hours_per_week, capital), upper.panel=panel.cor)
summary1 %>%
filter(variable=="education") %>%
ggplot(., aes(reorder(value,pct_gt50), pct_gt50)) +
geom_bar(stat = "identity", fill= "#56B4E9", alpha = 0.7) +
scale_y_continuous(breaks = seq(-10,80,10))+
geom_text(aes(reorder(value,pct_gt50), -5, label = pct_count, size = log(pct_count+1)))+
geom_hline(yintercept = benchmark$high[1]*100, alpha = 0.5, linetype = "dashed")+
geom_vline(xintercept = c(8.5, 12.5), alpha = 0.5, linetype = "dashed", col = "red")+
coord_flip() +
labs(title = "Categorical variables - \n Proportion of high income (>50K) in education groups",
x = "",
y = "Percentage of high income group" )+
theme(legend.position = "none")
censusData <- censusData %>%
mutate(capital_group = case_when(capital< (-1750) ~ "[-inf, -1750)",
capital>= (-1750) & capital<0 ~ "[-1750, 0)",
capital== 0 ~ "0",
capital>0 & capital<=7250 ~ "(0,7250]",
capital>7250 ~ "(7250, +inf]"))
censusData$capital_group <- ordered(censusData$capital_group, levels = c("[-inf, -1750)","[-1750, 0)","0","(0,7250]","(7250, +inf]"))
censusData %>%
ggplot(., aes(capital_group))+
geom_bar(aes(fill = income), position = "fill")+
labs(y = "Percentage")+
geom_hline(yintercept = 0.241, linetype = "dashed")
temp2[[1]]$description <- c("Private",
"Self-employed-not incorporated",
"Local government",
"Unknown values",
"State government",
"Self-employed-incorporated",
"Federal government",
"Without pay",
"Never worked")
temp2[[2]]$description <- c("High school graduate - high school diploma or equivalent",
"Some college but no degree",
"Bachelor's degree (for example: BA,AB,BS)",
"Master's degree (for example: MA,MS,MENG,MED, MSW, MBA)",
"Associate degree in college - occupation/vocation program",
"11th grade",
"Associate degree in college - academic program",
"10th grade",
"7th and 8th grade",
"Professional school degree (for example: MD,DDS,DVM,LLB,JD)",
"9th grade",
"12th grade no diploma",
"Doctorate degree (for example: PHD,EDD)",
"5th or 6th grade",
"1st,2nd,3rd,or 4th grade",
"Preschool")
temp2[[3]]$description <- c("Married - civilian spouse present",
"Never married",
"Divorced",
"Separated",
"Widowed",
"Married - spouse absent (exc. separated)",
"Married - AF spouse present")
temp2[[6]]$description <- c("White",
"Black",
"Asian, Native Hawaiian or Other Pacific Islander",
"African American, American Indian or Alaskan Native",
"Other race")
map(temp2, kable_adj) %>% walk(print)
censusData %>%
group_by_at(vars(-id)) %>%
summarize(dup_group = n()) %>%
filter(dup_group>1) %>%
select(-dup_group) %>%
inner_join(censusData, by = names(censusData)[-(length(censusData)-2)]) %>%
kable(format="latex", booktabs=TRUE) %>%
kable_styling(latex_options="scale_down", position = "float_left")
colnames(censusData) <- str_replace_all(colnames(censusData), "_", ".")
censusData %>%
group_by(income) %>%
summarise_at(vars(age, fnlwgt, capital.gain, capital.loss, hours.per.week), funs(min, max, mean, median)) -> s1
s1 %>%
gather(variable_measure, value, 2:21) %>%
separate(variable_measure, c("variable", "measure"), sep = "_") %>%
spread(measure, value) %>%
arrange(variable) %>%
select(variable, income, min, median, mean, max) %>%
kable_adj()
colnames(censusData) <- str_replace_all(colnames(censusData), "\\.", "_")
Becker, Barry, and Ron Kohavi. 1996. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. https://archive.ics.uci.edu/ml/machine-learning-databases/adult/.
Census, United States. Bureau of the. 1995. “Current Population Survey: Annual Demographic File, 1994.” Inter-university Consortium for Political; Social Research [distributor]. https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/6461/version/1.
Ethnic, and Hispanic Statistics Branch. n.d. Population Division, U.S. Bureau of the Census Current Population Survey. https://www2.census.gov/programs-surveys/demo/tables/foreign-born/1995/cps1995/tab0108.pdf.
Han, Jiawei., and Jian. Pei. 2012. “Data Mining: Concepts and Techniques.” London;Amsterdam; : Morgan Kaufmann. http://capitadiscovery.co.uk/brighton-ac/items/1431592.
IPUMS-CPS. n.d. University of Minnesota. https://cps.ipums.org/cps-action/variables/CAPGAIN#codes_section.
O’Hara, Amy. 2004. “New Methods for Simulating Cps Taxes.” Mimeo, U.S. Census. https://www.census.gov/content/dam/Census/library/working-papers/2004/demo/oharataxmodel.pdf.