colHeaders <- c("age",
"workclass",
"fnlwgt",
"education",
"education_num",
"marital_status",
"occupation",
"relationship",
"race",
"sex",
"capital_gain",
"capital_loss",
"hours_per_week",
"native_country",
"salary")
colClass <- c("numeric", #age
"factor", #workclass
"numeric", #fnlwg
"factor", #education
"numeric", #education-num
"factor", #marital-status
"factor", #occupation
"factor", #relationship
"factor", #race
"factor", #sex
"numeric", #capital-gain
"numeric", #capital-loss
"numeric", #hours-per-week
"factor", #native-country
"factor" # Salary (> or <= 50k)
)
# All missing values in the dataset are denoted with " ?" in the original data. These will be replaced with NA.
data <- read.table("data/adult/adult.data", sep=",", col.names = colHeaders, na.strings=c(" ?"), colClasses = colClass)
head(data, 5)
## age workclass fnlwgt education education_num marital_status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## hours_per_week native_country salary
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
dim(data)
## [1] 32561 15
# Check data types
str(data)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : num 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 8 levels " Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ fnlwgt : num 77516 83311 215646 234721 338409 ...
## $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num : num 13 13 9 7 13 14 5 9 14 13 ...
## $ marital_status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels " Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : num 2174 0 0 0 0 ...
## $ capital_loss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ hours_per_week: num 40 13 40 40 40 40 16 45 50 40 ...
## $ native_country: Factor w/ 41 levels " Cambodia"," Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ salary : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
for (col in names(data)){
cat("\nSummary of", col, ":\n")
print(summary(data[[col]]))
}
##
## Summary of age :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 28.00 37.00 38.58 48.00 90.00
##
## Summary of workclass :
## Federal-gov Local-gov Never-worked Private
## 960 2093 7 22696
## Self-emp-inc Self-emp-not-inc State-gov Without-pay
## 1116 2541 1298 14
## NA's
## 1836
##
## Summary of fnlwgt :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12285 117827 178356 189778 237051 1484705
##
## Summary of education :
## 10th 11th 12th 1st-4th 5th-6th
## 933 1175 433 168 333
## 7th-8th 9th Assoc-acdm Assoc-voc Bachelors
## 646 514 1067 1382 5355
## Doctorate HS-grad Masters Preschool Prof-school
## 413 10501 1723 51 576
## Some-college
## 7291
##
## Summary of education_num :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 10.00 10.08 12.00 16.00
##
## Summary of marital_status :
## Divorced Married-AF-spouse Married-civ-spouse
## 4443 23 14976
## Married-spouse-absent Never-married Separated
## 418 10683 1025
## Widowed
## 993
##
## Summary of occupation :
## Adm-clerical Armed-Forces Craft-repair Exec-managerial
## 3770 9 4099 4066
## Farming-fishing Handlers-cleaners Machine-op-inspct Other-service
## 994 1370 2002 3295
## Priv-house-serv Prof-specialty Protective-serv Sales
## 149 4140 649 3650
## Tech-support Transport-moving NA's
## 928 1597 1843
##
## Summary of relationship :
## Husband Not-in-family Other-relative Own-child Unmarried
## 13193 8305 981 5068 3446
## Wife
## 1568
##
## Summary of race :
## Amer-Indian-Eskimo Asian-Pac-Islander Black Other
## 311 1039 3124 271
## White
## 27816
##
## Summary of sex :
## Female Male
## 10771 21790
##
## Summary of capital_gain :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1078 0 99999
##
## Summary of capital_loss :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 87.3 0.0 4356.0
##
## Summary of hours_per_week :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 40.00 40.00 40.44 45.00 99.00
##
## Summary of native_country :
## Cambodia Canada
## 19 121
## China Columbia
## 75 59
## Cuba Dominican-Republic
## 95 70
## Ecuador El-Salvador
## 28 106
## England France
## 90 29
## Germany Greece
## 137 29
## Guatemala Haiti
## 64 44
## Holand-Netherlands Honduras
## 1 13
## Hong Hungary
## 20 13
## India Iran
## 100 43
## Ireland Italy
## 24 73
## Jamaica Japan
## 81 62
## Laos Mexico
## 18 643
## Nicaragua Outlying-US(Guam-USVI-etc)
## 34 14
## Peru Philippines
## 31 198
## Poland Portugal
## 60 37
## Puerto-Rico Scotland
## 114 12
## South Taiwan
## 80 51
## Thailand Trinadad&Tobago
## 18 19
## United-States Vietnam
## 29170 67
## Yugoslavia NA's
## 16 583
##
## Summary of salary :
## <=50K >50K
## 24720 7841
Count Total Number of NAs = 4262
sum(is.na(data)==TRUE)
## [1] 4262
Number of rows that have complete data and number of rows that do not (e.g., contain NA values).
table(complete.cases(data))
##
## FALSE TRUE
## 2399 30162
Percentage of complete rows and % of incomplete rows (e.g., contain NA values)
prop.table(table(complete.cases(data))) * 100
##
## FALSE TRUE
## 7.36771 92.63229
Table of the number of missing values in a case:
miss_case_table(data)
## # A tibble: 4 × 3
## n_miss_in_case n_cases pct_cases
## <int> <int> <dbl>
## 1 0 30162 92.6
## 2 1 563 1.73
## 3 2 1809 5.56
## 4 3 27 0.0829
Counts and percentages of missing values in each variable:
miss_var_summary(data)
## # A tibble: 15 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 occupation 1843 5.66
## 2 workclass 1836 5.64
## 3 native_country 583 1.79
## 4 age 0 0
## 5 fnlwgt 0 0
## 6 education 0 0
## 7 education_num 0 0
## 8 marital_status 0 0
## 9 relationship 0 0
## 10 race 0 0
## 11 sex 0 0
## 12 capital_gain 0 0
## 13 capital_loss 0 0
## 14 hours_per_week 0 0
## 15 salary 0 0
Visualization of the whole dataset’s classes and missing data:
vis_dat(data)
More in-depth visualization of missing data:
vis_miss(data)
gg_miss_var(data) + labs(y="Count of Missing Values")
Summary of Missing Data
Out of the 32,651 total cases, there are 30,162 complete cases, about 93% of the data, and 2399 incomplete cases, around 7% of the data. Overall, there are a total of 4262 missing values in the data set, meaning many of the cases with missing information are missing more than one variable. As seen in the table created from the miss_case_table() function, most of the incomplete cases are missing 2 variables. Of the 15 variables, three account for all of the missing values in the data: native_country, workclass, and occupation. Given that workclass and occupation both account for the majority of missing values, these are the two variables most likely to be missing in the incomplete cases.
One important thing to take into account before tackling Data Visualization is to ensure that the classes of each variable are correct. In the initial data importation, if the variables were left as is, the variables would all imported as “integer” or “character”. By converting necessary groups to a factor class, we are able to work with some variables as categorical variables in the visualization step.
Salary: The Dependent Variable
The purpose of the Adult data set is to predict whether a person’s income will exceed $50K/year making salary is the dependent variable. The barplot below shows the count of those in the data set who make less than or equal to $50K and those that make more. As we can see, the number of those who make less than or equal to $50K far exceeds the number making more than $50K.
p <- ggplot(data, aes(x = salary)) +
geom_bar(fill="salmon") +
ggtitle("Barplot of Salary") +
xlab("Salary")
p
Age
A boxplot of the age data gives us a quick summary about the distribution of ages in the data set. The youngest person in the data set is just under 20 years old, while the oldest is over 80. The median age is 37. Half of the people in the data set, represented by the interquartile range, are between 28 and 48 years old. While there are no outliers representing young people (according to metadata, the data extracted from the census excluded those under the age of 16), there are several outliers who are over 80 years old.
summary(data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 28.00 37.00 38.58 48.00 90.00
boxplot(data$age, col="salmon",
main = "Boxplot of Age")
The histogram below gives us more detail on the distribution of ages in the dataset. Here, we have grouped age by salary to give us even more information. We can surmise several things from the plot:
p <- ggplot(data) +
geom_histogram(aes(x=age, group =salary, fill=salary),
binwidth=1, color="black") + #(Grouped, stacked and percent stacked barplot in ggplot2, 2018)
ggtitle("Distribution of Age, Grouped by Salary") +
xlab("Age")
p
Looking at a side-by-side boxplot of Age grouped by salary, we can see that those who make over $50K a year tended to be older than those who did not. The median age for those making over $50K was older than the mean age of 39, while the median age for those making less than or equal to $50K, was younger. Both groups had outliers older than the Social Security Retirement Age.
boxplot(age ~ salary, data=data, col="salmon",
main = "Boxplot of Age by Salary",
xlab = "Salary",
ylab = "Age")
abline(h=mean(data$age), col="blue",lwd=3)
text(x=2.33, y = 32, 'Mean Age (38.5)',col="blue")
abline(h=67, col="blue", lwd=3)
text(x=2.08, y = 62, 'Social Security Retirement Age (67)',col="blue")
Final Weight
The Final Weight estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights.
The 3 sets of control are:
Below are a side-by-side boxplot and a histogram showing the distribution of Final Weight. There are several extreme outliers with very high final weight values. The distributions of final weight between those that make less than or equal to 50K and those that make more are comparable to each other, though outliers for those making less than or equal to $50K are more extreme than those making over.
boxplot(fnlwgt ~ salary, data=data, col="salmon",
main = "Boxplot of Final Weight by Salary",
xlab = "Salary",
ylab = "Final Weight")
p <- ggplot(data) +
geom_histogram(aes(x=fnlwgt, group =salary, fill=salary),
bins=5000, ) +
ggtitle("Distribution of Final Weight, Grouped by Salary") +
xlab("Final Weight")
p
Numeric Education Level
Numeric education level runs from 1 to 16, and gives a numeric value to how much education a respondent in the data set has received. In the U.S., grades 9-12 represent high school while 13-16 represent college and post-graduate studies.
The boxplot below shows a fairly symmetrical distribution of Numeric education level but there are very low outliers skewing the data a bit; this is due to the respondents who reported education of preschool and elementary school levels. The median education level is around the 10th grade and 50% of the dataset reported education levels between 9 and 12, most received at least some high school level education.
summary(data$education)
## 10th 11th 12th 1st-4th 5th-6th
## 933 1175 433 168 333
## 7th-8th 9th Assoc-acdm Assoc-voc Bachelors
## 646 514 1067 1382 5355
## Doctorate HS-grad Masters Preschool Prof-school
## 413 10501 1723 51 576
## Some-college
## 7291
boxplot(data$education_num , data=data, col="salmon",
main = "Boxplot of Numeric Education Level",
ylab = "Numeric Education Level")
The histogram below shows a more detailed distribution of the Numeric Education Levels in the data set. As shown by the turquoise portion of the bars, almost all of those that make $50K or over have at least some high school education, if not some amount of further education. The blue and green dashed lines show the mean and the median of the numeric educational levels respectively. They are extremely close together, indicating a fairly symmetrical distribution, though (as in the boxplot above) we can see that there are a number of very low outliers.
p <- ggplot(data) +
geom_histogram(aes(x=education_num, group =salary, fill=salary),
binwidth=1, color="black") +
ggtitle("Distribution of Numeric Education Level, Grouped by Salary") +
xlab("Numeric Education Level") +
geom_vline(aes(xintercept=mean(education_num)),
color="blue",
linewidth=1,
linetype="dashed") +
geom_vline(aes(xintercept=median(education_num)),
color="green",
linewidth=1,
linetype="dashed") +
annotate("text", x=12.8, y = 10400, #(How to add label to geom_vline in ggplot2, 2022)
label = "Mean Education Level",
color="blue") +
annotate("text", x=7, y = 10800, #(How to add label to geom_vline in ggplot2, 2022)
label = "Median Education Level",
color="green")
p
The side-by-side boxplots below show that in general, those that make more than $50K a year have higher educational levels than those who make less than or equal to $50K. The blue line shows that the mean numeric education level is around 10. The middle 50% (represented by the IQR) of those that make less than or equal to $50K lies below this mean while those making over $50K lies above. Also, while those who make less than or equal to $50K a year have outliers both low and high outliers, those that make over $50K only have outliers below the lower whiskers.
boxplot(education_num ~ salary, data=data, col="salmon",
main = "Boxplot of Numeric Education Level by Salary",
xlab = "Salary",
ylab = "Numeric Education Level")
abline(h=mean(data$education_num),
col="blue",
lwd=3,
lty=2
)
text(x=2.2, y = 9, 'Mean Education Level (10)',col="blue")
Education Level
Education Level translates the numeric education levels seen above to categorical data. In the barplot shown below, we can see that a good majority of the respondents in the data set have at least a high school diploma. Very few people reported a doctorate or prof-school education but these are the only two education levels where the majority of people reported earning over $50K a year. A small portion of respondents reported Preschool, 1st-4th, and 5th-6th grade level education, accounting for the low outliers we saw in the boxplots above. These groups mostly reported making less than or equal to $50K, though a very small number of those with a 5th-6th grade level education made more.
p <- ggplot(data,
aes(x = fct_infreq(education), fill=salary)) + #(Reordering geom_bar and geom_col by Count or Value, 2022)
geom_bar(position="stack") +
ggtitle("Barplot of Education, Grouped by Salary") +
xlab("Education Level") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
p
Marital Status
Marital Status was another categorical variable in the data set. The following 3 bar plots represent the marital statuses for:
In the plot representing the whole dataset, most respondents reported being married, and this segment was split fairly evenly between those who made over $50K a year and those who did not. Those who were never married or divorced were the next two most common segments made up primarily by those who made less than or equal to $50K.
The top segment for those who made up less than or equal to $50K was “Never-Married”, closely followed by “Married” while the clear top category for those who made over $50K a year was “Married”.
ggplot(data,
aes(x = fct_infreq(marital_status), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Marital Status, Grouped by Salary") +
xlab("Marital Status") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
data %>%
filter(salary == " <=50K") %>%
ggplot(aes(x=fct_infreq(marital_status))) +
geom_bar(fill="salmon") +
xlab("Marital Status") +
ylim(0,15000)+
ggtitle("Barplot of Marital Status for those making <=50K") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
data %>%
filter(salary != " <=50K") %>%
ggplot(aes(x=fct_infreq(marital_status))) +
geom_bar(fill="turquoise") +
xlab("Marital Status") +
ylim(0,15000)+
ggtitle("Barplot of Marital Status for those making >50K") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
Occupation
Occupation was a categorical variable that, along with the workclass variable, had the most number of missing values in the data set. The following 3 bar plots represent the occupations for:
The top 3 occupations for the whole data set were:
The top 3 for those who made less than or equal to $50K a year:
The top 3 for those who made more than $50K a year:
ggplot(data,
aes(x = fct_infreq(occupation), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Occupations, Grouped by Salary") +
xlab("Occupation") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
data %>%
filter(salary == " <=50K") %>%
ggplot(aes(x=fct_infreq(occupation))) +
geom_bar(fill="salmon") +
xlab("Occupation") +
ylim(0,4000)+
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
data %>%
filter(salary != " <=50K") %>%
ggplot(aes(x=fct_infreq(occupation))) +
geom_bar(fill="turquoise") +
xlab("Occupation") +
ylim(0,4000)+
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
Relationships
The Relationship categorical variable is tied in with the marital status variable to a certain extent. The marital status variable plots showed us that marriage was the most common relationship type in the data set. Here, those who reported being husbands were the most typical response. This is probably due to the fact there were many more men in the data set than women, and that in the U.S. in 1994, men were the primary breadwinners of families. What is interesting is that those who reported being wives were the only group where over half of the respondant reported making over $50K a year.
p <- ggplot(data,
aes(x = fct_infreq(relationship), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Relationships, Grouped by Salary") +
xlab("Relationship") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
p
Race
As we can see in the Barplot of Race, Grouped by Salary, the vast majority of respondents in the data set reported being White, followed by Black, Asian-Pac-Islander, Amer-Indian-Eskimo, and lastly Other. In the Percent Stacked Bar Plot of Race, Grouped by Salary, we can see that racial groups White and Asian-Pac-Islander were most likely to earn over $50K a year, with about 25% of each group doing so. Out of those who reported being Black, Amer-Indian-Eskimo, or Other, about 12-13% reported making over $50K a year.
ggplot(data,
aes(x = fct_infreq(race), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Race, Grouped by Salary") +
xlab("Race") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
#Percent Stacked Barplot (Grouped, stacked and percent stacked barplot in ggplot2, 2021)
ggplot(data, aes(fill=salary, x=fct_infreq(race))) +
geom_bar(position="fill") +
ggtitle("Percent Stacked Bar Plot of Race, Grouped by Salary") +
xlab("Race") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
Sex
Another key demographic is the gender of those in the data set. As we can see in the bar plot below, there are about twice as many male respondents in the data set as women. There are a higher percentage of males that make over $50K than females.
ggplot(data,
aes(x = sex, fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Sex, Grouped by Salary") +
xlab("Sex")
When separated by sex, we can see in the side-by-side box plots below that female and male respondents in the data set share similar distributions of age, though men tended to be just a bit older than females. The median age for both groups seems to be in the late 30s with comparable minimum and maximum ages. There are older outliers in both groups as well.
boxplot(age ~ sex, data=data, col="coral1",
main = "Boxplot of Age by Sex",
xlab = "Sex",
ylab = "Age")
Capital Gains
It is difficult to evaluate the distribution of Capital Gains because the high number of Capital Gains equaling zero heavily skews the model.
ggplot(data) +
geom_histogram(aes(x=capital_gain, group =salary, fill=salary), bins=50, color="black") +
ggtitle("Distribution of Capital Gain, Grouped by Salary") +
xlab("Capital Gain")
The histogram below shows the Capital Gain excluding those with a Capital Gain of 0, which does not have any impact on income. The histogram shows that people with more capital gains, tended to make more than $50K annually. The histogram was still skewed to the right with high outliers with capital gains of 100,000. Clearly, having capital gain to report had positive impacts on income.
data %>%
filter(capital_gain != 0) %>%
ggplot(aes(x=capital_gain, group=salary, fill=salary)) +
geom_histogram(bins=50, color="black") +
xlab("Capital Gain")
Capital Losses
It is difficult to evaluate the distribution of Capital Losses because the high number of Capital Losses equaling zero heavily skews the model to the right. To deal with this, we will examine the data excluding the zeros (which have no impact on income).
ggplot(data) +
geom_histogram(aes(x=capital_loss, group =salary, fill=salary),
bins=50, color="black") +
ggtitle("Distribution of Capital Loss, Grouped by Salary") +
xlab("Capital Loss")
The histogram below shows the Capital Losses excluding those with a Loss of 0. The 0s were heavily skewing the the distribution to the right and made no impact on income. The histogram shows a fairly symmetrical distribution with both salary groups reporting capital losses. There is a sizeable spike of those reporting losses at around $2000, especially those who made more than $50K. This could be worth investigating further.
data %>%
filter(capital_loss != 0) %>%
ggplot(aes(x=capital_loss, group=salary, fill=salary)) +
geom_histogram(bins=50, color="black") +
xlab("Capital Loss")
Hours per Week
The hours per week variable shows how many hours a week people reported working. The distribution is symmetrical and there is a huge spike at around 40 hours a week, which makes sense as this is the standard work week for many Americans. Those who made over $50K in the year generally worked at least 40 hours a week, if not more. This makes sense as we would assume that the more one works, the more they would make.
ggplot(data) +
geom_histogram(aes(x=hours_per_week, group =salary, fill=salary),
bins = 30,
color="black") +
ggtitle("Distribution of Hours Worked per Week, Grouped by Salary") +
xlab("Hours Worked Per Week")
In the side-by-side scatter plot comparing age and hours worked per week for the two salary groups, we can see that younger people were more likely to make less than or equal to $50K a year than those who were older. Across the board, most people worked around the standard 40 hour work week. Additionally, those 80 and older were not only less likely to work, they were less likely to work more than the 40 hour standard.
## Lattice
library(lattice)
xyplot(hours_per_week ~ age | salary, data = data)
Native Country
As shown in the bar plot below, the primary respondants in the data set reported the United States as their native country, followed by Mexico as a distant second.
ggplot(data,
aes(x = fct_infreq(native_country), fill=salary, )) +
geom_bar(position="stack") +
ggtitle("Barplot of Native Country, Grouped by Salary") +
xlab("Country") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))