This is part 2 of the Census Data Project. In this project we analyze a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. The project is divided into four parts. The first part of the project is Cleaning and Preprocessing the Data. The third part is Predictive Analysis and the last part is Theoretical Background. Our final goal is to build a model, which can predict whether the income of a random adult American citizen is less or greater than 50000$ a year based on given features, such as age, education, occupation, gender, race, etc. In the first part of the project – Cleaning and Preprocessing the Data, we introduced the U.S. census dataset and we cleaned the data and transformed it into a form suitable for further analysis. In this part of the project we perform exploratory data analysis. We investigate the predictor variables as well as their relationship with the response variable “income”.
First we load the necessary packages:
library(ggplot2)
library(plyr)
library(gridExtra)
library(gmodels)
library(grid)
library(vcd)
library(scales)
library(ggthemes)In the first part of the project we cleaned and preprocessed the data, and we wrote the transformed datasets into csv files. Below we set the working directory and then we read the preprocessed training dataset into the “db.adult” data frame:
setwd("D:/Data_Science_Projects/Census_DataSet")
db.adult <- read.csv("adult_df.csv")We start with the response variable (called also dependent variable). In the next part of the project - Predictive Analysis, our goal will be to build a model that predicts if a person earns more than 50,000 dollars a year. We recall that “income” is a factor variable with two levels:
class(db.adult$income)[1] "factor"
levels(db.adult$income)[1] " <=50K" " >50K"
First we show a summary statisitc:
summary(db.adult$income) <=50K >50K
22654 7508
and then we visualize the above results with a bar plot:
ggplot(data = db.adult,
mapping = aes(x = db.adult$income, fill = db.adult$income)) +
geom_bar(mapping = aes(y = (..count..)/sum(..count..))) +
geom_text(mapping = aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1) +
labs(x = "Income",
y = "",
fill = "Income") +
scale_y_continuous(labels = percent)The graph above shows us the percentage of people earning less than 50K a year and more than 50K. We see that 75.1% of the participants in the study are paid less than 50K and 24.9% are paid more than 50K.
Below we show the box plots only of the nonzero capital gain and loss grouped by income. The mean value is depicted with a filled red dot and the black horizontal line inside the boxes is the median.
First we consider the variable “capital_gain”. We can see that for people earning more than 50K a year, the bulk of the values (50% of the data points) as well as the median, and the mean value of the capital gain are significantly greater than these of people earning less than 50K:
ggplot(mapping = aes(x = income, y = capital_gain),
data = subset(db.adult, db.adult$capital_gain > 0)) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
labs(x = "Income",
y = "Capital Gain") +
ggtitle("Box Plot of Nonzero Capital Gain by Income") # Zoomed version:
ggplot(mapping = aes(x = income, y = capital_gain),
data = subset(db.adult, db.adult$capital_gain > 0)) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
coord_cartesian(ylim = c(0, 30000)) +
scale_y_continuous(breaks = seq(0, 30000, 1500)) +
labs(x = "Income",
y = "Capital Gain") +
ggtitle("Box Plot of Nonzero Capital Gain by Income") Next we give the box plot of nonzero capital loss grouped by income. Here we observe the same tendency as for capital gain: the mean and median for people having an income of more than 50K a year is bigger than that of people earning less than 50K a year. This might be due to the fact that people with higher income are more prone to invest more money more often. This leads to higher chances of not only good but also bad investments resulting in bigger losses.
ggplot(mapping = aes(x = income, y = capital_loss),
data = subset(db.adult, db.adult$capital_loss > 0)) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
labs(x = "Income",
y = "Capital Loss") +
ggtitle("Box Plot of Nonzero Capital Loss by Income") # Zoomed version:
ggplot(mapping = aes(x = income, y = capital_loss),
data = subset(db.adult, db.adult$capital_loss > 0)) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
coord_cartesian(ylim = c(0, 3000))+
scale_y_continuous(breaks = seq(0, 3000, 200)) +
labs(x = "Income",
y = "Capital Loss") +
ggtitle("Box Plot of Nonzero Capital Loss by Income") As a conclusion, we can say that there is evidence for strong relationship between the nonzero values of “capital_gain” and “capital_loss”, and “income”. However, we will not include these variables in the predictive model because of the extremely high number of zeros among their values. Let us recall that, as we discussed in part 1 of the project – Cleaning and Preprocessing the Data, the number of zeros in these variables is more than 90%. This means that less than 10% of the participants in the survey make any investments. This gives rise to the question of whether to include these variables in the predictive model at the first place. For the predictive analysis in the second part of the project we will include the categorical variables “cap_gain” and “cap_loss” as covariates, but we will also fit a predictive model without them. If “cap_gain” and “cap_loss” do not contribute to the prediction accuracy of the model, we will drop them.
Next, we will examine the relationship between the factor variables “cap_gain” and “cap_loss” and the categorical variable “income”. We will show bar plots of the two variables grouped by income. We do this with the help of the functions “lapply()” and “grid.arrange()”. As the R documentation states: “lapply(X, FUN, …) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X, where X is a vector or list.” In our case we create a list called “lg_cap.gain”, which consists of 2 graphical objects (2 = number of levels of the factor variable “income”). After that we plot the two graphical objects (which are bar plots with vertical bars) in one plot with the function “grid.arrange()” from the “gridExtra” package.
First we show the bar plot of the variable “cap_gain”, grouped by the values of “income”:
# Bar plot of cap_gain by income:
lg_cap.gain <- lapply(X = levels(db.adult$income), FUN = function(v){
df <- subset(db.adult, db.adult$income == v)
df <- within(df, cap_gain <- factor(cap_gain,
levels = names(sort(table(cap_gain),
decreasing = TRUE))))
ggplot(data = df,
aes(x = cap_gain,
fill = cap_gain)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1) +
labs(x = "Capital Gain",
y = "",
fill = "Capital Gain") +
theme(legend.position = 'none') +
ggtitle(paste("Income", v, sep = "")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg_cap.gain, ncol = 2)From the bar plot we see that 0.0% of the people who earn less than 50K a year have a high capital gain. In order to check if this result is due to some round-off error, below we display the number of individuals with income of less than 50K and high capital gain:
nrow(subset(db.adult, db.adult$cap_gain == " High" &
db.adult$income == " <=50K"))[1] 6
As we can see from the output above, there are, indeed, no people with low annual income and high capital gain. The bar plot also shows that the proportion of people who have medium and high capital gain is much bigger within the group of people with income of more than 50K a year compared to the respective proportion within the group of people with income of less than 50K. Therefore, we can conclude that there is a realtionship between “cap_gain” and “income”.
Further, we consider the variable “cap_loss”:
# Bar plot of cap_loss by income:
lg_cap.loss <- lapply(levels(db.adult$income), function(v){
df <- subset(db.adult, db.adult$income == v)
df <- within(df, cap_loss <- factor(cap_loss,
levels = names(sort(table(cap_loss),
decreasing = TRUE))))
ggplot(data = df,
aes(x = cap_loss,
fill = cap_loss)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1) +
labs(x = "Capital Loss",
y = "",
fill = "Capital Loss") +
theme(legend.position = 'none') +
ggtitle(paste("Income", v, sep="")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg_cap.loss, ncol = 2)Here we observe the same trend as in the case of the variable “cap_gain”, which in this case might be due (as we already mentioned) to the more and bigger (i.e. riskier) investments that people with higher income tend to make.
We begin with displaying the summary of age as well as its IQR:
summary(db.adult$age) Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 28.00 37.00 38.44 47.00 90.00
IQR(db.adult$age)[1] 19
The median age is 37 years and the mean age is 38 years. The summary shows that at least 50% of the people in the study are between 28 and 47 years old, which makes sense since the participants in the survey should be of working age. Of course, there are some outliers, such as individuals being between 75 and 90 years old. To visualize the summary statistic we also show a box plot of the variable “age”:
ggplot(mapping = aes(x = factor(0), y = age),
data = db.adult) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
coord_cartesian(ylim = c(10, 100)) +
scale_y_continuous(breaks = seq(10, 100, 5)) +
ylab("Age") +
xlab("") +
ggtitle("Box Plot of Age") +
scale_x_discrete(breaks = NULL)From the histogram of “age” below we can see that the bulk of individuals are bewtween 20 and 50 years old:
qplot(x = db.adult$age,
data = db.adult,
binwidth = 5,
color = I('black'),
fill = I('#F29025'),
xlab = "Age",
ylab = "Count",
main = "Histogram of Age") +
scale_x_continuous(breaks = seq(0, 95, 5)) +
scale_y_continuous(breaks = seq(0, 4500, 500))Next we display the empirical density of age grouped by income. From the plot we see that the majority of people earning more than 50K a year are between 33 and 55 years old, whereas the the greater number of people who earn less than 50K a year are between 18 and 45:
ggplot(data = db.adult, aes(age, fill = income)) +
geom_density(alpha = 0.2) +
scale_x_continuous(breaks = seq(0, 95, 5))The density plot above clearly shows that age and income are correlated – people of greater age have higher income. This can be also seen from the following histograms of age by income:
ggplot(data = db.adult, mapping = aes(x = age)) +
geom_histogram(binwidth = 5,
color = "black",
fill = "lightblue",
alpha = 0.6) +
scale_x_continuous(breaks = seq(0, 95, 5)) +
facet_wrap(~income) +
ggtitle("Histogram of Age by Income") Next we give the box plot of age grouped by income:
ggplot(aes(x = income, y = age),
data = db.adult) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = "point",
shape = 16,
cex = 2,
col = "red") +
coord_cartesian(ylim = c(10, 90))+
scale_y_continuous(breaks = seq(10, 90, 5)) +
ylab("Age") +
xlab("Income") +
ggtitle("Box Plot of Age by Income") In the box plot above, the filled red dot represents the mean value and the black horizontal line is the median. Once again we see the relationship between age and income. People who earn more that 50K a year are on average about 43-44 years old, whereas people who earn less than 50K a year, are younger on average - around 34-37. This can be also be seen by the summary statistics below:
summary(subset(db.adult$age, db.adult$income == " <=50K")) Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 26.00 34.00 36.61 45.00 90.00
summary(subset(db.adult$age, db.adult$income == " >50K")) Min. 1st Qu. Median Mean 3rd Qu. Max.
19.00 36.00 43.00 43.96 51.00 90.00
From the above output and the box plots we also notice that the first quartiles (25-th percentiles) for both groups differ significantly. The first quartile for people who have an income of more than 50K is equal to 36, whereas the first quartile for people earning less than 50K equals 26. This means that the elder an individual is, the bigger their chances of having a higher income are. This can be explained by the fact that the older one gets, the more experienced professionally they become, which usually means a higher salary.
Below we show one more plot in order to demonstrate the correlation between age and income. We display a conditional density plot of income versus age:
cd_plot(x = db.adult$age,
y = db.adult$income,
xlab = "Age",
ylab = "Income",
main = "Conditionl Density Plot of Income versus Age")From the plot we see that the probability of having an income greater than 50K is biggest (about 60%) for individuals in their 50s, and is smallest for people in their 20s.
Finally, we show a plot of mean working hours per week versus age, grouped by gender. From this graph we see that on average men work more hours per week than women at almost all ages. There is an exception for people between 77 and 80 and between 85 and 90, where women have more working hours per week on average. However, the latter observations are outliers (data which are highly unlikely for this study, because it is very unusual for people to work when they are above 80 years old), we have to consider these extreme data points with caution. We also notice that the mean number of woring hours per week for people between 25 and 60 years old (active working years) is 40 hours for women and 45 hours for men. For people above 65 years old the average working hours per week gradually decrease and vary between about 15 and 30 hours.
ggplot(aes(x = age, y = hours_per_week),
data = db.adult) +
geom_line(mapping = aes(color = sex),
stat = 'summary',
fun.y = mean) +
geom_smooth(mapping = aes(color = sex)) +
scale_x_continuous(breaks = seq(10, 100, 5)) +
scale_y_continuous(breaks = seq(0, 55, 5)) +
labs(x = "Age", y = "Mean Hours per Week") +
ggtitle("Age vs. Mean Hours per Week by Gender")Below is a summary statistic of the variable “hours_per_week”:
summary(db.adult$hours_per_week) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 40.00 40.00 40.93 45.00 99.00
IQR(db.adult$hours_per_week)[1] 5
The mean number of working hours per week is 41 and at least 50% of the participants in the study work between 40 and 45 hours a week. Next we show the box plot of “hours_per_week” which visualizes the summary statistic. We see that there are many outliers, i.e. there are many people who work less than 40 hours a week or more than 45 hours a week:
ggplot(aes(x = factor(0), y = hours_per_week),
data = db.adult) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = "point",
shape = 19,
color = "red",
cex = 2) +
coord_cartesian(ylim = c(10, 100)) +
scale_x_discrete(breaks = NULL) +
scale_y_continuous(breaks = seq(10, 100, 5)) +
ylab("Hours per Week") +
xlab("") +
ggtitle("Box plot of Hours per Week") # Zoomed box plot of hours per week:
ggplot(aes(x = factor(0), y = hours_per_week),
data = db.adult) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = "point",
color = "red",
shape = 19,
cex = 2) +
coord_cartesian(ylim = c(30, 50)) +
scale_x_discrete(breaks = NULL) +
scale_y_continuous(breaks = seq(30, 50, 1)) +
ylab("Hours per Week") +
xlab("") +
ggtitle("Box plot of Hours per Week") Since we are interested in the relationship between income and working hours per week, we display the box plot of hours per week grouped by income:
ggplot(aes(x = income, y = hours_per_week),
data = db.adult) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
coord_cartesian(ylim = c(10, 100))+
scale_y_continuous(breaks = seq(10, 100, 10)) +
ylab("Hours per Week") +
xlab("Income") +
ggtitle("Box plot of Hours per Week by Income") # Zoom of the box plot of hours per week by income:
ggplot(aes(x = income, y = hours_per_week),
data = db.adult) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
coord_cartesian(ylim = c(30, 60))+
scale_y_continuous(breaks = seq(30, 60, 1)) +
ylab("Hours per Week") +
xlab("Income") +
ggtitle("Box Plot of Hours per Week by Income") As we can see from the produced box plot, the mean number of working hours per week is significantly greater for people who earn more than 50K a year compared to people earning less than 50K a year. We also notice that the two medians are equal, but for income<=50K the median coincides with the third quartile, whereas for the data corresponding to income>50K, the respective median coincides with the first quartile. The exact numbers can be observed better with the help of the “summary()” function:
summary(subset(db.adult$hours_per_week, db.adult$income == " <=50K")) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 38.00 40.00 39.35 40.00 99.00
summary(subset(db.adult$hours_per_week, db.adult$income == " >50K")) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 40.00 40.00 45.71 50.00 99.00
Both medians (50-th percentiles) are equal to 40 hours per week, but the mean value for income<=50K is approximately 39 compared to 46 for income>50K. Also, at least 50% of the people with income of less than 50K work between 38 and 40 hours a week. In contrast, at least 50% of the individuals earning more than 50K a year work between 40 and 50 hours a week, which is considerably more than to the group of people with income of less than 50K. This shows us that there is a pronounced correlation between income and working hours per week.
Finally we show a graph of the mean working hours per week versus age, grouped by income:
ggplot(mapping = aes(x = age, y = hours_per_week),
data = db.adult) +
geom_line(mapping = aes(color = income),
stat = 'summary',
fun.y = mean) +
geom_smooth(mapping = aes(color = income)) +
scale_x_continuous(breaks = seq(10, 100, 5)) +
labs(x = "Age",
y = "Mean Hours per Week") +
ggtitle("Mean Hours per Week vs. Age")From the above plot we see that for all age groups the mean number of working hours per week is greater for people with income>50K than for people with income<=50K.
From the summary statistic below we see that the bulk of people work between 40 and 45 hours a week, which we already determined.
summary(db.adult$hours_w) between_40_and_45 between_45_and_60 between_60_and_80
16606 5790 857
less_than_40 more_than_80
6714 195
Next we give a bar plot of “hours_w”, which shows the percentage of people belonging to each category of the factor variable. In order to display the bars in decreasing order based on the number of people belonging to each category of “hours_w”, we first rearrange the levels of the factor variable (which are ordered in alphabetical order by default):
db.adult <- within(db.adult, hours_w <- factor(hours_w, levels =
names(sort(table(hours_w),
decreasing = TRUE))))
ggplot(db.adult,
aes(x = db.adult$hours_w, fill = db.adult$hours_w)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = 0.3) +
labs(x = "Hours per week",
y = "",
fill = "Hours per week") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Bar Plot of Hours per Week") +
scale_y_continuous(labels = percent)We observe that 55.1% of all individuals work between 40 and 45 hours a week; 22.3% work less than 40 hours a week; 19.2% work between 45 and 60 hours a week; 2.8% work between 60 nad 80 hours a week and 0.6% work more than 80 hours a week.
Finally we show a bar plot of “hours_w” grouped by “income”:
lg_hpw <- lapply(levels(db.adult$income), function(v){
df <- subset(db.adult, db.adult$income == v)
df <- within(df, hours_w <- factor(hours_w,
levels = names(sort(table(hours_w),
decreasing = TRUE))))
ggplot(data = df,
aes(x = hours_w,
fill = hours_w)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1,
size = 3) +
labs(x = "Hours per week",
y = "",
fill = "Hours per week") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle(paste("Income", v, sep="")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg_hpw, ncol = 2)We see that approximately the same percentage of people in both categories of “income” work between 40 and 45 hours a week, but the proportion of people with income>50K who work between 45 and 60 hours a week is 33.2% compared to 14.6% for that of people with income<=50K.
We start with a summary statistic:
summary(db.adult$native_region) Central-America Central-Asia East-Asia Europe-East
1226 142 304 85
Europe-West Outlying-US South-America United-States
408 380 113 27504
The majority of people come from the United States and Central America.
Next we show a bar plot with the percentage of people belonging to each native region:
db.adult$native_region <- factor(db.adult$native_region,
levels =
names(sort(table(db.adult$native_region), decreasing = TRUE)))
ggplot(db.adult,
aes(x = db.adult$native_region, fill = db.adult$native_region)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1) +
labs(x = "Region",
y = "",
fill = "Regions") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent)As we can see from the graphs above, 91.2% of the participants in the study come from the USA. This means that we have a small number of people from each of the other native regions (especially for East Europe - only 85 people, South America -113 people and Central Asia - 142 people) leading to random samples which might not be representative for the respective population. Therefore further analysis must be carried out with caution.
Lastly we display the percentage of people earning less than 50K and more than 50K among all individuals belonging to a given native region:
lp_region <- lapply(levels(db.adult$native_region), function(v){
df <- subset(db.adult, db.adult$native_region == v)
ggplot(data = df,
aes(x = income,
fill = income)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(2, -0.1),
size = 4) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(v) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lp_region[1:4], ncol = 2)grid.arrange(grobs = lp_region[5:8], ncol = 2)As we can see from the graphics above, 40.8% of the people with Central-Asian origin earn more than 50K a year, which is the highest percentage among all native regions. However, we cannot rely completely on this inference, because there are only 142 participants with Central-Asian origin compared to 27504 people with domestic (US) roots. If we ignore the small number of individuals which comprise the samples for the rest of the native regions, we can make similar conclusions. Namely, the second place is for individuals of West-European descent - 31.6%, followed by Outlying-US with 28.9% and East Asia - with 28%. The fifth place is for people from the United States - 25.4%, followed by East-European descendents with 23.5% and the last place is for individuals with South-American origin - 7.1%. If we disregard the small number of observations for the non-US native regions, the latter results indicate that “income” is dependent on “native_region”.
The summary statistic of “workclass” is given below:
table(db.adult$workclass)
Federal-gov Local-gov Private Self-emp-inc
943 2067 22286 1074
Self-emp-not-inc State-gov Without-pay
2499 1279 14
We notice that the majority of people are employed in the private sector. We also see that there are no people in the category “Never-worked”, and only 14 of the participants in the study, or 0.05%, are without pay.
Next we show a graph with the percentage of people belonging to each category of “workclass”:
db.adult$workclass <- factor(db.adult$workclass,
levels =
names(sort(table(db.adult$workclass), decreasing = TRUE)))
ggplot(db.adult,
aes(x = db.adult$workclass, fill = db.adult$workclass)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1,
size = 3.5) +
labs(x = "Employment type",
y = "",
fill = "Employment type") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent)Based on the summary statistic of “workclass” we observed that there are no people in the category " Never-worked“:
nrow(subset(db.adult , db.adult$workclass == " Never-worked"))[1] 0
We also notice that there are no people in the category “Without-pay” who earn more than 50K a year, which seems obvious and logical:
nrow(subset(db.adult , db.adult$workclass == " Without-pay" &
db.adult$income == " >50K"))[1] 0
Therefore, in order to make the plots more readable and meaningful, we exclude the factor levels " Never-worked" and " Without-pay“. We do this in the following way:
First we create the character vector “modified.work”, whose elements are the levels of the nominal variable “workclass”;
Then we transform the vector by excluding from it the elements " Never-worked" and " Without-pay“.
modified.work <- levels(db.adult$workclass)
modified.work[1] " Private" " Self-emp-not-inc" " Local-gov"
[4] " State-gov" " Self-emp-inc" " Federal-gov"
[7] " Without-pay"
modified.work <- modified.work[!is.element(modified.work,
c(" Never-worked",
" Without-pay"))]
modified.work[1] " Private" " Self-emp-not-inc" " Local-gov"
[4] " State-gov" " Self-emp-inc" " Federal-gov"
Finally we plot the percentage of people earning less than 50K and more than 50K based on their employment status:
lg.workclass.mod <- lapply(modified.work, function(v){
ggplot(data = subset(db.adult, db.adult$workclass == v),
aes(x = subset(db.adult, db.adult$workclass == v)$income,
fill = subset(db.adult, db.adult$workclass == v)$income)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(2, 1.5)) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(v) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg.workclass.mod[1:3], ncol = 2)grid.arrange(grobs = lg.workclass.mod[4:6], ncol = 2)From the graphs above, we can see that the percentage of individuals having an income of more than 50K is biggest for the category “Self-emp-inc” (self-employed people with income) – 55.9%. The group with the second highest percentage of people earning more than 50K is that of federal government employees. There is no significant difference between the categories " Local-gov" (local government), " Self-emp-not-inc" (self-employed individuals without income) and " State-gov" (state government jobs) - about 27-28% of the workers in these branches have an annual pay of more than 50000$. In the private sector only about 22% of the people earn more than 50K a year. These results demonstrate that there is a relationship between the variables “income” and “workclass”.
We start with a summary of “education”:
summary(db.adult$education) 10th 11th 12th 1st-4th 5th-6th
820 1048 377 151 288
7th-8th 9th Assoc-acdm Assoc-voc Bachelors
557 455 1008 1307 5044
Doctorate HS-grad Masters Preschool Prof-school
375 9840 1627 45 542
Some-college
6678
The majority of people have a high school degree - 9840, college degreee - 6678 and bachelor degree - 5044. The bar plot below shows the percentage of people belonging to each category of “education”:
db.adult$education <- factor(db.adult$education,
levels =
names(sort(table(db.adult$education), decreasing = TRUE)))
ggplot(db.adult,
aes(x = db.adult$education, fill = db.adult$education)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1,
size = 3.5) +
labs(x = "Education",
y = "",
fill = "Education") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent)Since there are no people with education “Preschool” who earn more than 50K a year, as we can see below,
nrow(subset(db.adult, db.adult$education == " Preschool" &
db.adult$income == " >50K" ))[1] 0
we will remove the factor level “Preschool” before we continue further with the analysis. In order to do that we create a character vector “modified.edu” with elements equal to the factor levels of “education”, and then we alter the vector by removing the element " Preschool“:
modified.edu <- levels(db.adult$education)
modified.edu [1] " HS-grad" " Some-college" " Bachelors" " Masters"
[5] " Assoc-voc" " 11th" " Assoc-acdm" " 10th"
[9] " 7th-8th" " Prof-school" " 9th" " 12th"
[13] " Doctorate" " 5th-6th" " 1st-4th" " Preschool"
modified.edu <- modified.edu[!is.element(modified.edu, " Preschool")]
modified.edu [1] " HS-grad" " Some-college" " Bachelors" " Masters"
[5] " Assoc-voc" " 11th" " Assoc-acdm" " 10th"
[9] " 7th-8th" " Prof-school" " 9th" " 12th"
[13] " Doctorate" " 5th-6th" " 1st-4th"
After that, we display the bar plot of each education category grouped by income:
lg.mod.edu <- lapply(modified.edu, function(v){
ggplot(data = subset(db.adult, db.adult$education == v),
aes(x = subset(db.adult, db.adult$education == v)$income,
fill = subset(db.adult, db.adult$education == v)$income)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(2, 0.5),
size = 3) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(v) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg.mod.edu[1:4], ncol = 2)grid.arrange(grobs = lg.mod.edu[5:8], ncol = 2)grid.arrange(grobs = lg.mod.edu[9:12], ncol = 2)grid.arrange(grobs = lg.mod.edu[13:15], ncol = 2)The categories " 1st-4th“,” 5th-6th“,” 7th-8th“,” 9th“,” 10th“,” 11th" and " 12th" have a very small percentage of people with income greater than 50K a year. The percentage of people with a high school degree who earn more than 50K is also relatively small - 16.4%. 20% of the individuals in the category " Some-college" earn more than 50K. The biggest percentage of employees (74.9%), who have an annual income higher than 50K, belongs to the category " Prof-school“. The” Doctorate" group is next with 74.7%, followed by the categories " Masters" - 56.4% and " Bachelors" - 42.1%. These results clearly indicate that there is a relationship between education and income. In order to demonstrate this correlation visually, we show one more graph - a conditional density plot of income versus education number (“education_num”):
cd_plot(x = db.adult$education_num,
y = db.adult$income,
xlab = "Education number",
ylab = "Income",
main = "Conditionl Density Plot of Income vs. Education Number")Each number (from 1 to 16) in this integer variable corresponds to an education level from the factor variable “education”, starting from the lowest level (“Preschool”) and reaching the highest education level - “Doctorate”. We will investigate further the variable “education_num” and its relation to the variable “education” in the third part of the project – Predictive Analysis. As we can see from the conditional density plot above, the higher the education level, the greater the chances are an individual to earn more than 50K a year.
From the summary below we see how many people belong to each category of the variable “marital_status”:
summary(db.adult$marital_status) Divorced Married-AF-spouse Married-civ-spouse
4214 21 14065
Married-spouse-absent Never-married Separated
370 9726 939
Widowed
827
The biggest number of people are married to a civilian spouse - 14065. A significant number of individuals belong to the group " Never-married" - 9726, followd by divorced people - 4214. A very small number of participants in the study are married to an army spouse - 21. Below we visualize the percentage of people belonging to each category:
db.adult$marital_status <- factor(db.adult$marital_status,
levels =
names(sort(table(db.adult$marital_status), decreasing = TRUE)))
ggplot(db.adult,
aes(x = db.adult$marital_status, fill = db.adult$marital_status)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1,
size = 3.5) +
labs(x = "Marital Status",
y = "",
fill = "Marital Status") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent)Below we give the bar plots of income grouped by marital status:
lp_marital <- lapply(levels(db.adult$marital_status), function(v){
ggplot(data = subset(db.adult, db.adult$marital_status == v),
aes(x = subset(db.adult, db.adult$marital_status == v)$income,
fill = subset(db.adult, db.adult$marital_status == v)$income)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(2, -0.1)) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(v) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lp_marital[1:3], ncol = 2)grid.arrange(grobs = lp_marital[4:7], ncol = 2)As we see from the graphs above, the biggest percentage of employees with income higher than 50K are those from the category " Married-AF-spouse“. However, since there are only 25 observations in this category, we cannot draw trustworthy conclusions regarding the income of the individuals belonging to this group. On the other hand, the random sample for the category” Married-civ-spouse" amounts to 14065 individuals and can be considered representative. For this category, the percentage of people with income of more than 50K is very high - 45.5%. The same cannot be said for the groups " Divorced“,” Never-married“,” Married-spouse-absent“,” Separated" and " Widowed“, where the percentage of people with income higher than 50K varies between 4.8% and 10.7%. One explanation as to why people who never got married earn less than married people is that the former group probably contains mostly young individuals who work part-time (for example, students saving for college), as well as younger people as a whole, who are in the beginning of their professional career. This conclusion is also in agreement with the results for the variable”age“, where we noticed that the greater the age of an individual, the higher their income. However, the same logic cannot be apllied to the other categories with low percentage of individuals with income greater than 50K –” Divorced“,” Married-spouse-absent“,” Separated" and " Widowed“. Therefore these results provide evidence that there is a correlation between income and marital status, which cannot be explained only with the confounding”age" variable.
First we show the summary statistic of “occupation”:
summary(db.adult$occupation) Adm-clerical Armed-Forces Craft-repair
3721 9 4030
Exec-managerial Farming-fishing Handlers-cleaners
3992 989 1350
Machine-op-inspct Other-service Priv-house-serv
1966 3212 143
Prof-specialty Protective-serv Sales
4038 644 3584
Tech-support Transport-moving
912 1572
With the exception of the category " Armed-Forces“, where there are 9 people, all other categories have a relatively large number of individuals and hence can be considered representative random samples. Next we visualize the percentage of people belonging to each category of the factor variable”occupation“:
db.adult$occupation <- factor(db.adult$occupation,
levels =
names(sort(table(db.adult$occupation), decreasing = TRUE)))
ggplot(db.adult,
aes(x = db.adult$occupation, fill = db.adult$occupation)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1,
size = 3.5) +
labs(x = "Occupation",
y = "Percentage",
fill = "Occupation") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent)We notice (see the code below) that there are no women working in the " Armed-Forces" and there are no men working in the " Priv-house-serv" sector who are earning more than 50K a year.
nrow(subset(db.adult, db.adult$sex == " Female" &
db.adult$occupation == " Armed-Forces"))[1] 0
nrow(subset(db.adult, db.adult$sex == " Male" &
db.adult$occupation == " Priv-house-serv" &
db.adult$income == " >50K"))[1] 0
Hence, for the following bar plots with percentages of women earning less than 50K and more than 50K for each type of occupation, we exclude " Armed-Forces“. Below we create a character vector containig all levels of”occupation" except for " Armed-Forces“.
modified.occup.f <- levels(db.adult$occupation)
modified.occup.f [1] " Prof-specialty" " Craft-repair" " Exec-managerial"
[4] " Adm-clerical" " Sales" " Other-service"
[7] " Machine-op-inspct" " Transport-moving" " Handlers-cleaners"
[10] " Farming-fishing" " Tech-support" " Protective-serv"
[13] " Priv-house-serv" " Armed-Forces"
modified.occup.f <- modified.occup.f[!is.element(modified.occup.f,
c(" Armed-Forces"))]
modified.occup.f [1] " Prof-specialty" " Craft-repair" " Exec-managerial"
[4] " Adm-clerical" " Sales" " Other-service"
[7] " Machine-op-inspct" " Transport-moving" " Handlers-cleaners"
[10] " Farming-fishing" " Tech-support" " Protective-serv"
[13] " Priv-house-serv"
Below we display bar plots of women’s income grouped by occupation:
lp.occupation.f <- lapply(modified.occup.f, function(v){
ggplot(data = subset(db.adult, db.adult$occupation == v
& db.adult$sex == " Female"),
aes(x = subset(db.adult, db.adult$occupation == v
& db.adult$sex == " Female")$income,
fill = subset(db.adult, db.adult$occupation == v
& db.adult$sex == " Female")$income)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(2, -0.1)) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(paste("Women in \n", v, sep="")) +
theme(legend.position = 'none',
plot.title = element_text(size = 10, face = "bold")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lp.occupation.f[1:4], ncol = 2)grid.arrange(grobs = lp.occupation.f[5:9], ncol = 3)grid.arrange(grobs = lp.occupation.f[10:13], ncol = 3)As we can see from the output above, the biggest percentage of women with income greater than 50K is in the category " Prof-specialty" - 25.5%, followed by 24.2% in the category " Exec-managerial“. In the rest of the categories this percentage is less than 10%, with the exception of” Protective-serv" - 13.2%, and " Tech-support" - 12.9%. Below we give a summary statistic for the number of women belonging to each category of “occupation”:
summary(db.adult[db.adult$sex == " Female" , ]$occupation) Prof-specialty Craft-repair Exec-managerial
1491 216 1143
Adm-clerical Sales Other-service
2512 1248 1758
Machine-op-inspct Transport-moving Handlers-cleaners
543 90 164
Farming-fishing Tech-support Protective-serv
65 341 76
Priv-house-serv Armed-Forces
135 0
Judging by the numbers above, the groups " Adm-clerical“,” Exec-managerial“,” Machine-op-inspct“,” Other-service“,” Prof-specialty“,” Sales" and " Tech-support" can be conisdered as representative random samples, whereas the statistical inferences for the rest of the categories should be viewed with caution.
Since no men who are working in the " Priv-house-serv" sector are earning more than 50K a year, we leave out " Armed-Forces" and " Priv-house-serv" when displaying the bar plots of income for men grouped by occupation:
modified.occup.m <- levels(db.adult$occupation)
modified.occup.m [1] " Prof-specialty" " Craft-repair" " Exec-managerial"
[4] " Adm-clerical" " Sales" " Other-service"
[7] " Machine-op-inspct" " Transport-moving" " Handlers-cleaners"
[10] " Farming-fishing" " Tech-support" " Protective-serv"
[13] " Priv-house-serv" " Armed-Forces"
modified.occup.m <- modified.occup.m[!is.element(modified.occup.m,
" Priv-house-serv")]
modified.occup.m [1] " Prof-specialty" " Craft-repair" " Exec-managerial"
[4] " Adm-clerical" " Sales" " Other-service"
[7] " Machine-op-inspct" " Transport-moving" " Handlers-cleaners"
[10] " Farming-fishing" " Tech-support" " Protective-serv"
[13] " Armed-Forces"
lp.occupation.m <- lapply(modified.occup.m, function(v){
ggplot(data = subset(db.adult, db.adult$occupation == v
& db.adult$sex == " Male"),
aes(x = subset(db.adult, db.adult$occupation == v
& db.adult$sex == " Male")$income,
fill = subset(db.adult, db.adult$occupation == v
& db.adult$sex == " Male")$income)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(2, 1),
size = 3) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(paste("Men in", v, sep="")) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lp.occupation.m[1:4], ncol = 2)grid.arrange(grobs = lp.occupation.m[5:8], ncol = 2)grid.arrange(grobs = lp.occupation.m[9:10], ncol = 2)grid.arrange(grobs = lp.occupation.m[11:13], ncol = 2)58.3% of the male employees with executive-managerial position have an income of more than 50K a year, which is the highest percentage among the categories of the variable “occupation”. The second place is for the category " Prof-specialty" - 56.2%, followed by " Tech-support" - 41%. Next come the categories " Sales" and " Protective-serv" with 37.8% and 35.2%, respectively. From the summary statistic below we see the number of men in each category of the variable “occupation”. All groups, except " Armed-Forces" and " Priv-house-serv“, are representative random samples with enough number of observations.
summary(db.adult[db.adult$sex == " Male" , ]$occupation) Prof-specialty Craft-repair Exec-managerial
2547 3814 2849
Adm-clerical Sales Other-service
1209 2336 1454
Machine-op-inspct Transport-moving Handlers-cleaners
1423 1482 1186
Farming-fishing Tech-support Protective-serv
924 571 568
Priv-house-serv Armed-Forces
8 9
We observe the tendency that jobs requiring highly qualified specialists with college or university degree are paid better than jobs that are not. This conclusion seems reasonable and, indeed, reflects the real world job market.
This variable is closely realted to the variable " marital_status" and they should be considered together. From the summary below we notice that the majority of people are married because they identified themselves as " Husband" and " Wife“. This is in accordance with the summary statistic of” marital_status“, which we saw earlier.
summary(db.adult$relationship) Husband Not-in-family Other-relative Own-child
12463 7726 889 4466
Unmarried Wife
3212 1406
Next we show the percentage of people belonging to each category of “relationship”:
db.adult$relationship <- factor(db.adult$relationship,
levels =
names(sort(table(db.adult$relationship), decreasing = TRUE)))
ggplot(db.adult,
aes(x = db.adult$relationship, fill = db.adult$relationship)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1) +
labs(x = "Relationship",
y = "",
fill = "Relationship") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent)Naturally, the distribution of people in each category of " relationship" is connected to that of " marital_status" given below:
ggplot(db.adult,
aes(x = db.adult$marital_status, fill = db.adult$marital_status)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1,
size = 3.5) +
labs(x = "Marital Status",
y = "",
fill = "Marital Status") +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent)Below we give a summary statistic of the marital status of each level of the factor variable “relationship”:
summary(db.adult[db.adult$relationship == " Not-in-family", ]$marital_status) Married-civ-spouse Never-married Divorced
14 4448 2268
Separated Widowed Married-spouse-absent
383 432 181
Married-AF-spouse
0
summary(db.adult[db.adult$relationship == " Husband", ]$marital_status) Married-civ-spouse Never-married Divorced
12454 0 0
Separated Widowed Married-spouse-absent
0 0 0
Married-AF-spouse
9
summary(db.adult[db.adult$relationship == " Other-relative", ]$marital_status) Married-civ-spouse Never-married Divorced
118 548 103
Separated Widowed Married-spouse-absent
53 40 26
Married-AF-spouse
1
summary(db.adult[db.adult$relationship == " Own-child", ]$marital_status) Married-civ-spouse Never-married Divorced
83 3929 308
Separated Widowed Married-spouse-absent
90 12 43
Married-AF-spouse
1
summary(db.adult[db.adult$relationship == " Unmarried", ]$marital_status) Married-civ-spouse Never-married Divorced
0 801 1535
Separated Widowed Married-spouse-absent
413 343 120
Married-AF-spouse
0
summary(db.adult[db.adult$relationship == " Wife", ]$marital_status) Married-civ-spouse Never-married Divorced
1396 0 0
Separated Widowed Married-spouse-absent
0 0 0
Married-AF-spouse
10
As should be expected, most of the results are in agreement with the variable “marital_status”. For example, in the category " Not-in-family“, there are only 14 people who did not identified themselves correctly and said they were married and not in family at the same time. All the other individuals of the category” Not-in-family" have a proper marital status, i.e. " Divorced“,” Married-spouse-absent“,” Never-married“,” Separated" or " Widowed“. The situation is similar with the categories” Husband" and " Wife“, where there are no discrepancies between their marital and relationship status. Another interesting observation that can be made is that people with marital status” Never-married" recognized themselves as belonging to one of the four relationship categories - " Not-in-family“,” Unmarried“,” Other-relative" and " Own-child“.
Next we show bar plots of income by relationship status. First we give bar plots with counts and then with percentages. Below we give multiple bar plots side by side with the counts of people earning less than or more than 50K for each category of the variable “relationship”:
ggplot(db.adult, aes(x=db.adult$relationship, fill=db.adult$income)) +
geom_bar(position=position_dodge()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Income",
y = "Number of people",
fill = "Income") +
ggtitle("Income grouped by relationship") +
scale_y_continuous(breaks = seq(0,7000,500))Then we give bar plots with percentage of individuals having an income lower and higher than 50K for each realtionship level:
lg.relationship <- lapply(levels(db.adult$relationship), function(v){
ggplot(data = subset(db.adult, db.adult$relationship == v),
aes(x = subset(db.adult, db.adult$relationship == v)$income,
fill = subset(db.adult, db.adult$relationship == v)$income)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(2, -0.1),
size = 3) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(paste(v)) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg.relationship[1:3], ncol = 2)grid.arrange(grobs = lg.relationship[4:6], ncol = 2)The results above are in agreement with the results for income grouped by marital status. Namely, the categories " Wife" and " Husband" are with the highest percentage of people with income greater than 50K, as was the category " Married-civ-spouse" of the variable “marital_status”, to which most of the individuals of " Wife" and " Husband" should belong. And, as was the case with “marital_status”, we observe a relationship between “income” and “relationship”.
Again we start with a summary statistic:
summary(db.adult$race) Amer-Indian-Eskimo Asian-Pac-Islander Black
286 895 2817
Other White
231 25933
The major part of individuals belong to the category “White”, followed by the category “Black”. As we can see from the graph below, 86% of the participants in the study are “White”, 9.3% – “Black”, 3% – “Asian-Pac-Islander”, 0.9% – “Amer-Indian-Eskimo” and 0.8% – " Other“:
db.adult$race <- factor(db.adult$race,
levels =
names(sort(table(db.adult$race), decreasing = TRUE)))
ggplot(db.adult,
aes(x = db.adult$race, fill = db.adult$race)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(-0.2, -0.2, -0.2, -0.2, 3)) +
labs(x = "Race",
y = "",
fill = "Race") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent)Next we show bar plots of income by race:
lg.race <- lapply(levels(db.adult$race), function(v){
ggplot(data = subset(db.adult, db.adult$race == v),
aes(x = subset(db.adult, db.adult$race == v)$income,
fill = subset(db.adult, db.adult$race == v)$income)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = c(2, -0.1)) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(paste(v)) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold")) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg.race, ncol = 3)The highest percentage of people earning more than 50K a year is within the category " Asian-Pac-Islander" - 27.7%. followed by the category " White" - 26.4%.
From the output of the code below we see that there are 20380 men and 9782 women who took part in the survey:
summary(db.adult$sex) Female Male
9782 20380
In percentages this makes 67.6% male participants and more than two times less female participants - 32.4%:
ggplot(db.adult,
aes(x = db.adult$sex, fill = db.adult$sex)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1) +
labs(x = "Gender",
y = "Percentage",
fill = "Gender") +
scale_y_continuous(labels = percent)We continue with bar plots of income grouped by gender. First we show in one graph two bar plots with the number of men and women who earn less than 50K and more than 50K a year:
ggplot(db.adult, aes(x = db.adult$sex, fill = db.adult$income)) +
geom_bar(position = position_dodge()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Income",
y = "Number of people",
fill = "Income") +
ggtitle("Income grouped by gender") +
scale_y_continuous(breaks = seq(0,14500,1000))However, we find it easier to interpret bar plots with percentages rather than raw counts as shown below:
gender.income <- lapply(levels(db.adult$sex), function(v){
ggplot(data = subset(db.adult, db.adult$sex == v),
aes(x = subset(db.adult, db.adult$sex == v)$income,
fill = subset(db.adult, db.adult$sex == v)$income))+
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -0.1,
size = 3) +
labs(x = "Income",
y = "",
fill = "Income") +
ggtitle(paste(v)) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold"),
axis.text.x = element_text(hjust = 1)) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = gender.income, ncol = 2)From the latter graph we observe that the proportion of women earning more than 50K a year is only 11.4% compared to that of men, which is 31.4%. Below we show several ways to obtain the same information as that shown above without plotting graphs:
# Number of women earning less than 50K and more than 50K:
summary(db.adult$income[db.adult$sex == " Female"]) <=50K >50K
8670 1112
# Number of men earning less than 50K and more than 50K:
summary(db.adult$income[db.adult$sex == " Male"]) <=50K >50K
13984 6396
# Number of men and women earning less than 50K and more than 50K:
table(db.adult$sex, db.adult$income)
<=50K >50K
Female 8670 1112
Male 13984 6396
# Number of men and women earning less than 50K and more than 50K:
by(data = db.adult$income,
INDICES = db.adult$sex,
FUN = summary)db.adult$sex: Female
<=50K >50K
8670 1112
--------------------------------------------------------
db.adult$sex: Male
<=50K >50K
13984 6396
# Proportion of men and women with income <50K and >=50K:
prop.table(table(db.adult$sex, db.adult$income), margin = 1)
<=50K >50K
Female 0.8863218 0.1136782
Male 0.6861629 0.3138371
Next we give the bar plots of workclass grouped by gender:
lg.gender.workclass <- lapply(levels(db.adult$sex), function(v){
df <- subset(db.adult, db.adult$sex == v)
df <- within(df, workclass <- factor(workclass,
levels = names(sort(table(workclass),
decreasing = TRUE))))
ggplot(data = df,
aes(x = df$workclass,
fill = df$workclass))+
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -0.1,
size = 3) +
labs(x = "Workclass",
y = "",
fill = "Workclass") +
ggtitle(paste(v)) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg.gender.workclass, ncol = 2)We see that the biggest percentage of both female and male individuals are employed in the private sector - 78.1% of the women and 71.9% of the men. The bar plots of occupation by gender are shown next:
lg.gender.occupation <- lapply(levels(db.adult$sex), function(v){
df <- subset(db.adult, db.adult$sex == v)
df <- within(df, occupation <- factor(occupation,
levels = names(sort(table(occupation),
decreasing = TRUE))))
ggplot(data = df,
aes(x = df$occupation,
fill = subset(db.adult, db.adult$sex == v)$occupation))+
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -0.1,
hjust = 0.3,
size = 3) +
labs(x = "Occupation",
y = "",
fill = "Occupation") +
ggtitle(paste(v)) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg.gender.occupation, ncol = 2)The biggest percentage of women work in " Adm-clerical" - 25.7%, followed by " Other-service" - 18%, " Prof-specialty" - 15.2%, " Sales" - 12.8% and " Exec-managerial" - 11.7%. The remaining 16.6% are distributed among the rest of the ctaegories of “occupation”. The occupation which is most widespread among men is " Craft-repair" with 18.7%. It is followed by the categories " Exec-managerial" - with 14%, " Prof-specialty" - 12.5%, " Sales" - 11.5% and " Transport-moving" - 7.3%. The rest 36% of the male employees belong to some of the remaining 8 occupation categories. We notice that there is an overlap between the first five most popular occupation categories among men and women - these are the categories " Prof-specialty“,” Exec-managerial" and " Sales“. Below we continue with bar plots of education grouped by gender:
lg.gender.education <- lapply(levels(db.adult$sex), function(v){
df <- subset(db.adult, db.adult$sex == v)
df <- within(df, education <- factor(education,
levels = names(sort(table(education),
decreasing = TRUE))))
ggplot(data = df,
aes(x = df$education,
fill = subset(db.adult, db.adult$sex == v)$education))+
geom_bar(aes(y = (..count..)/sum(..count..))) +
geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -0.1,
size = 2.5) +
labs(x = "Education",
y = "",
fill = "Education") +
ggtitle(paste(v)) +
theme(legend.position = 'none',
plot.title = element_text(size = 11, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = percent) })
grid.arrange(grobs = lg.gender.education, ncol = 2)From the plots above we can see that the percentages of women and men belonging to each level of “education” are very similar. An exception makes the category " Doctorate“, in which there are almost two times more men than women.
We will test the independence of the categorical variables two by two with the Pearson’s Chi Square Test of Independence. The test checks the following null hypothesis
\(H_0\): The two categorical variables are independent in the considered population
against the alternative hypothesis
\(H_A\): The two categorical variables are dependent (related) in the considered population.
In order to calculate the Pearson’s chi square test statistic, we need to build a two-way contingency table with the observed counts of the categorical variables based on the sample data. Then we have to compute the expected counts under the null hypothesis. We do this using the marginal totals and the overall total: \[ E=\dfrac{\textrm{row total}\times\textrm{column total}}{\textrm{sample size}} \] The two-way contingency table for the categorical variables \(Var_1\) with \(n_1\) levels and \(Var_2\) with \(n_2\) levels is given by:
| \(level_1\) | \(level_2\) | \(\dots\) | \(level_{n_1}\) | \(\textbf{Row Total}\) | |
|---|---|---|---|---|---|
| \(level_1\) | \(O_{11}\) | \(O_{12}\) | \(\dots\) | \(O_{1,n_1}\) | \(\sum\limits_{i=1}^{n_1} O_{1i}\) |
| \(level_2\) | \(O_{21}\) | \(O_{22}\) | \(\dots\) | \(O_{2,n_1}\) | \(\sum\limits_{i=1}^{n_1} O_{2i}\) |
| \(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) | \(\dots\) |
| \(level_{n_2}\) | \(O_{n_2,1}\) | \(O_{n_2,2}\) | \(\dots\) | \(O_{n_2 n_1}\) | \(\sum\limits_{i=1}^{n_1} O_{n_2,i}\) |
| \(\textbf{Column Total}\) | \(\sum\limits_{j=1}^{n_2} O_{j1}\) | \(\sum\limits_{j=1}^{n_2} O_{j2}\) | \(\dots\) | \(\sum\limits_{j=1}^{n_2} O_{j n_1}\) | \(\textbf{Total}=\sum\limits_{i=1}^{n_1}\sum\limits_{j=1}^{n_2} O_{ij}\) |
where with \(O_{ij}\) we denote the observed count in the \((i,j)\)-th cell, and \(i=1,2,\dots, n_2\) and \(j=1,2,\dots, n_1\). Then for the expected counts we have: \[ E_{ij}=\dfrac{\left( \sum\limits_{l=1}^{n_1} O_{il} \right)\left( \sum\limits_{l=1}^{n_2} O_{lj} \right)}{\sum\limits_{i=1}^{n_1}\sum\limits_{j=1}^{n_2} O_{ij}}, \] where \(i=1,2,\dots, n_2\) and \(j=1,2,\dots, n_1\).
Our goal is to assess if the observed counts differ significantly from the expected counts. If the discrepancy between the observed and expected counts is big, this means that the two categorical variables are related. In order to evaluate this difference, we compute the following chi-square test statistic: \[ X^2 = \sum\limits_{i=1}^{n} \dfrac{\left( O_i - E_i \right)^2}{E_i}, \] where \(n=n_1\times n_2\) is the number of cells in the two-way contingency table, \(O_i\) is the observed count of the \(i\)-th cell, and \(E_i\) is the expected count of the \(i\)-th cell. The Pearson chi-square test statistic asymptotically approaches (for a very large number of observations) a chi-squared distribution \(\chi^2_{df}\), where the number of degrees of freedom is equal to: \[ df=(n_1-1)(n_2-1). \] In the above formula \(n_1\) is the number of levels of one of the categorical variables and \(n_2\) is the number of levels of the other categorical variable. In order for the \(\chi^2_{df}\) approximation to be valid, a common rule of thumb is that the number of expected cell counts is at least 5 in all cells of the two-way contingency table. However, a lot of researchers consider this rule very restrictive.
The test statistic can be written also as: \[ X^2 = \sum\limits_{i=1}^{n_1}\sum\limits_{j=1}^{n_2} \dfrac{\left( O_{ij} - E_{ij} \right)^2}{E_{ij}} \] The Pearson chi-square test statistic then is used to calculate a p-value by comparing the value of the statistic to a chi-square distribution. If \(\chi^2_{\alpha}\) is the critical value for the chi-square test at significance level \(\alpha=0.05\), then we reject the null hypothesis if \(X^2 > \chi^2_{\alpha}\). This is equivalent to the p-value being smaller than \(\alpha=0.05\), where the p-value is the probability \[ \textrm{p-value} = P\left( \chi^2_{df} > X^2 \right) \] due to chance. In other words, the p-value is the probability, under the assumption that the null hypothesis is true, of obtaining a result equal to or more extreme, due to chance alone, than the observed. The smaller the p-value, the larger the significance because it means that we do not have enough evidence to accept the null hypothesis. Hence, if \(\textrm{p-value}<\alpha\), it is very unlikely that the difference between observed and expected cell counts is due to random variation, indicating that some association between the variables is present.
Next we use the Pearson’s chi-square test of independence to check whether the categorical variable income is related to some of the other categorical variables. We will use the R functions “CrossTable()” from the package “gmodels” and the function “chisq.test()” from the base installation. The first function gives the two-way contingency table together with the Pearson’s test statistic and corresponding p-value, whereas the second function only calculates the test statistic and the p-value. In fact, “CrossTable()” calls “chisq.test()” to calculate the Pearson’s chi-square test. We will show the contingency tables only for the first three variables to illustrate how they look like. Output of “CrossTable()”:
1) The variables “sex” and “income”:
# Two-way contingency table with Pearson's chi-square
# test of independence for the variables "sex" and "income":
CrossTable(db.adult$sex, db.adult$income,
prop.chisq = TRUE,
chisq = TRUE)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 30162
| db.adult$income
db.adult$sex | <=50K | >50K | Row Total |
-------------|-----------|-----------|-----------|
Female | 8670 | 1112 | 9782 |
| 238.221 | 718.789 | |
| 0.886 | 0.114 | 0.324 |
| 0.383 | 0.148 | |
| 0.287 | 0.037 | |
-------------|-----------|-----------|-----------|
Male | 13984 | 6396 | 20380 |
| 114.342 | 345.005 | |
| 0.686 | 0.314 | 0.676 |
| 0.617 | 0.852 | |
| 0.464 | 0.212 | |
-------------|-----------|-----------|-----------|
Column Total | 22654 | 7508 | 30162 |
| 0.751 | 0.249 | |
-------------|-----------|-----------|-----------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 1416.357 d.f. = 1 p = 5.862415e-310
Pearson's Chi-squared test with Yates' continuity correction
------------------------------------------------------------
Chi^2 = 1415.286 d.f. = 1 p = 1.001553e-309
# Pearson's chi-square test of independence for the variables
# "sex" and "income"
chisq.test(db.adult$sex, db.adult$income)
Pearson's Chi-squared test with Yates' continuity correction
data: db.adult$sex and db.adult$income
X-squared = 1415.3, df = 1, p-value < 2.2e-16
The p-value is less than 0.05, which means that at the 0.05 significance level we fail to accept the null hypothesis that the two categorical variables are independent.
2) The variables “race” and “income”:
chisq.test(db.adult$race, db.adult$income)
Pearson's Chi-squared test
data: db.adult$race and db.adult$income
X-squared = 304.24, df = 4, p-value < 2.2e-16
CrossTable(db.adult$race, db.adult$income,
prop.chisq = TRUE,
chisq = TRUE)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 30162
| db.adult$income
db.adult$race | <=50K | >50K | Row Total |
--------------------|-----------|-----------|-----------|
White | 19094 | 6839 | 25933 |
| 7.558 | 22.806 | |
| 0.736 | 0.264 | 0.860 |
| 0.843 | 0.911 | |
| 0.633 | 0.227 | |
--------------------|-----------|-----------|-----------|
Black | 2451 | 366 | 2817 |
| 53.110 | 160.249 | |
| 0.870 | 0.130 | 0.093 |
| 0.108 | 0.049 | |
| 0.081 | 0.012 | |
--------------------|-----------|-----------|-----------|
Asian-Pac-Islander | 647 | 248 | 895 |
| 0.946 | 2.854 | |
| 0.723 | 0.277 | 0.030 |
| 0.029 | 0.033 | |
| 0.021 | 0.008 | |
--------------------|-----------|-----------|-----------|
Amer-Indian-Eskimo | 252 | 34 | 286 |
| 6.439 | 19.430 | |
| 0.881 | 0.119 | 0.009 |
| 0.011 | 0.005 | |
| 0.008 | 0.001 | |
--------------------|-----------|-----------|-----------|
Other | 210 | 21 | 231 |
| 7.679 | 23.171 | |
| 0.909 | 0.091 | 0.008 |
| 0.009 | 0.003 | |
| 0.007 | 0.001 | |
--------------------|-----------|-----------|-----------|
Column Total | 22654 | 7508 | 30162 |
| 0.751 | 0.249 | |
--------------------|-----------|-----------|-----------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 304.2414 d.f. = 4 p = 1.317829e-64
We reject the null hypothesis at the 0.05 significance level, because the p-values is less than 0.05. This means there is strong indication that “race” and “income” are correlated.
3) The variables “workclass” and “income”:
chisq.test(table(db.adult$workclass, db.adult$income)) Warning in chisq.test(table(db.adult$workclass, db.adult$income)): Chi-
squared approximation may be incorrect
Pearson's Chi-squared test
data: table(db.adult$workclass, db.adult$income)
X-squared = 804.16, df = 6, p-value < 2.2e-16
We get a warning message, because the contingency table probably has cells with expected cell counts less than 5:
chisq.test(table(db.adult$workclass, db.adult$income))$expectedWarning in chisq.test(table(db.adult$workclass, db.adult$income)): Chi-
squared approximation may be incorrect
<=50K >50K
Private 16738.51349 5547.486506
Self-emp-not-inc 1876.94271 622.057291
Local-gov 1552.47722 514.522777
State-gov 960.62814 318.371859
Self-emp-inc 806.65725 267.342749
Federal-gov 708.26610 234.733904
Without-pay 10.51509 3.484915
We notice that, indeed, there are two cells ([“Never-worked”, “<=50K”] and [“Never-worked”, “>50K”]) with expected cell counts equal to 0 and one cell (the cell [“Without-pay”, “>50K”]) with expected cell count equal to 3.5<5. If we look at the observed counts of the levels of “workclass”
table(db.adult$workclass)
Private Self-emp-not-inc Local-gov State-gov
22286 2499 2067 1279
Self-emp-inc Federal-gov Without-pay
1074 943 14
we see that there are no participants in the study who identify themselves as belonging to the category " Never-worked“. Therefore we remove this unused factor level from the categorical variable”workclass" with the help of the function “droplevels()”:
db.adult$workclass <- droplevels(db.adult$workclass)
levels(db.adult$workclass)[1] " Private" " Self-emp-not-inc" " Local-gov"
[4] " State-gov" " Self-emp-inc" " Federal-gov"
[7] " Without-pay"
summary(db.adult$workclass) Private Self-emp-not-inc Local-gov State-gov
22286 2499 2067 1279
Self-emp-inc Federal-gov Without-pay
1074 943 14
After we removed the cells with zero expected counts, we perform again the Pearson’s chi-square test:
CrossTable(db.adult$workclass, db.adult$income,
prop.chisq = TRUE,
chisq = TRUE)Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
may be incorrect
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 30162
| db.adult$income
db.adult$workclass | <=50K | >50K | Row Total |
-------------------|-----------|-----------|-----------|
Private | 17410 | 4876 | 22286 |
| 26.938 | 81.279 | |
| 0.781 | 0.219 | 0.739 |
| 0.769 | 0.649 | |
| 0.577 | 0.162 | |
-------------------|-----------|-----------|-----------|
Self-emp-not-inc | 1785 | 714 | 2499 |
| 4.504 | 13.590 | |
| 0.714 | 0.286 | 0.083 |
| 0.079 | 0.095 | |
| 0.059 | 0.024 | |
-------------------|-----------|-----------|-----------|
Local-gov | 1458 | 609 | 2067 |
| 5.749 | 17.348 | |
| 0.705 | 0.295 | 0.069 |
| 0.064 | 0.081 | |
| 0.048 | 0.020 | |
-------------------|-----------|-----------|-----------|
State-gov | 935 | 344 | 1279 |
| 0.684 | 2.063 | |
| 0.731 | 0.269 | 0.042 |
| 0.041 | 0.046 | |
| 0.031 | 0.011 | |
-------------------|-----------|-----------|-----------|
Self-emp-inc | 474 | 600 | 1074 |
| 137.184 | 413.929 | |
| 0.441 | 0.559 | 0.036 |
| 0.021 | 0.080 | |
| 0.016 | 0.020 | |
-------------------|-----------|-----------|-----------|
Federal-gov | 578 | 365 | 943 |
| 23.959 | 72.291 | |
| 0.613 | 0.387 | 0.031 |
| 0.026 | 0.049 | |
| 0.019 | 0.012 | |
-------------------|-----------|-----------|-----------|
Without-pay | 14 | 0 | 14 |
| 1.155 | 3.485 | |
| 1.000 | 0.000 | 0.000 |
| 0.001 | 0.000 | |
| 0.000 | 0.000 | |
-------------------|-----------|-----------|-----------|
Column Total | 22654 | 7508 | 30162 |
| 0.751 | 0.249 | |
-------------------|-----------|-----------|-----------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 804.1575 d.f. = 6 p = 1.946096e-170
We get a warning message once more because there is still one cell with expected cell count less than 5. However, we ignore the warning message and interpret with caution the results. The p-value is very small (it is approximately equal to 0), which means that we reject the null hypothesis at the 0.05 significance level.
4) The variables “occupation” and “income”:
chisq.test(db.adult$occupation, db.adult$income)Warning in chisq.test(db.adult$occupation, db.adult$income): Chi-squared
approximation may be incorrect
Pearson's Chi-squared test
data: db.adult$occupation and db.adult$income
X-squared = 3687.6, df = 13, p-value < 2.2e-16
We get a warning message again. Therefore we check if there are any cells with expected count less than 5:
chisq.test(db.adult$occupation, db.adult$income)$expected Warning in chisq.test(db.adult$occupation, db.adult$income): Chi-squared
approximation may be incorrect
db.adult$income
db.adult$occupation <=50K >50K
Prof-specialty 3032.851005 1005.148995
Craft-repair 3026.842384 1003.157616
Exec-managerial 2998.301439 993.698561
Adm-clerical 2794.759432 926.240568
Sales 2691.861813 892.138187
Other-service 2412.460977 799.539023
Machine-op-inspct 1476.618394 489.381606
Transport-moving 1180.693853 391.306147
Handlers-cleaners 1013.954645 336.045355
Farming-fishing 742.815662 246.184338
Tech-support 684.982693 227.017307
Protective-serv 483.693920 160.306080
Priv-house-serv 107.404085 35.595915
Armed-Forces 6.759698 2.240302
It turns out that there is one cell with expected count of 2.2 and that is the cell [“Armed-Forces”, “>50K”]. Since there is only one problematic cell, we consider the Pearson’s test trustworthy. The calculated p-value is very small and hence we fail to accept the null hypothesis, which means that there is strong indication that the categorical variables “occupation” and “income” are dependent.
5) The variables “education” and “income”:
chisq.test(db.adult$education, db.adult$income)
Pearson's Chi-squared test
data: db.adult$education and db.adult$income
X-squared = 4070.4, df = 15, p-value < 2.2e-16
6) The variables “marital_status” and “income”:
chisq.test(db.adult$marital_status, db.adult$income)
Pearson's Chi-squared test
data: db.adult$marital_status and db.adult$income
X-squared = 6061.7, df = 6, p-value < 2.2e-16
7) The variables “relationship” and “income”:
chisq.test(db.adult$relationship, db.adult$income)
Pearson's Chi-squared test
data: db.adult$relationship and db.adult$income
X-squared = 6233.8, df = 5, p-value < 2.2e-16
8) The variables “native_region” and “income”:
chisq.test(db.adult$native_region, db.adult$income)
Pearson's Chi-squared test
data: db.adult$native_region and db.adult$income
X-squared = 235.96, df = 7, p-value < 2.2e-16
9) The variables “hours_w” and “income”:
chisq.test(db.adult$hours_w, db.adult$income)
Pearson's Chi-squared test
data: db.adult$hours_w and db.adult$income
X-squared = 1940.4, df = 4, p-value < 2.2e-16
All of the rest Pearson’s chi-square tests give very small p-values, which means that it is very unlikely for the considered categorical variables - “education”, “marital_status”, “relationship”, “native_region” and “hours_w” not to be related with “income”.