1 Introduction

This is part 2 of the Census Data Project. In this project we analyze a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. The project is divided into four parts. The first part of the project is Cleaning and Preprocessing the Data. The third part is Predictive Analysis and the last part is Theoretical Background. Our final goal is to build a model, which can predict whether the income of a random adult American citizen is less or greater than 50000$ a year based on given features, such as age, education, occupation, gender, race, etc. In the first part of the project – Cleaning and Preprocessing the Data, we introduced the U.S. census dataset and we cleaned the data and transformed it into a form suitable for further analysis. In this part of the project we perform exploratory data analysis. We investigate the predictor variables as well as their relationship with the response variable “income”.

First we load the necessary packages:

library(ggplot2)
library(plyr)
library(gridExtra)
library(gmodels)
library(grid)
library(vcd)
library(scales)
library(ggthemes)

2 Reading the Preprocessed Dataset

In the first part of the project we cleaned and preprocessed the data, and we wrote the transformed datasets into csv files. Below we set the working directory and then we read the preprocessed training dataset into the “db.adult” data frame:

setwd("D:/Data_Science_Projects/Census_DataSet")

db.adult <- read.csv("adult_df.csv")

3 Analysis of the Variables and Their Correlation with Income

3.1 The variable “income”

We start with the response variable (called also dependent variable). In the next part of the project - Predictive Analysis, our goal will be to build a model that predicts if a person earns more than 50,000 dollars a year. We recall that “income” is a factor variable with two levels:

class(db.adult$income)
[1] "factor"
levels(db.adult$income)
[1] " <=50K" " >50K" 

First we show a summary statisitc:

summary(db.adult$income)
 <=50K   >50K 
 22654   7508 

and then we visualize the above results with a bar plot:

ggplot(data = db.adult, 
       mapping = aes(x = db.adult$income, fill = db.adult$income)) + 
  geom_bar(mapping = aes(y = (..count..)/sum(..count..))) +
  geom_text(mapping = aes(label = scales::percent((..count..)/sum(..count..)),
                      y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1) +
  labs(x = "Income", 
       y = "",
       fill = "Income") +
  scale_y_continuous(labels = percent)

The graph above shows us the percentage of people earning less than 50K a year and more than 50K. We see that 75.1% of the participants in the study are paid less than 50K and 24.9% are paid more than 50K.

3.2 The variables “capital_gain”/“cap_gain” and “capital_loss”/“cap_loss”

3.2.1 Nonzero “capital_gain” and “capital_loss”

Below we show the box plots only of the nonzero capital gain and loss grouped by income. The mean value is depicted with a filled red dot and the black horizontal line inside the boxes is the median.

First we consider the variable “capital_gain”. We can see that for people earning more than 50K a year, the bulk of the values (50% of the data points) as well as the median, and the mean value of the capital gain are significantly greater than these of people earning less than 50K:

ggplot(mapping = aes(x = income, y = capital_gain),
       data = subset(db.adult, db.adult$capital_gain > 0)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean,
               geom = 'point', 
               shape = 19,
               color = "red",
               cex = 2) +
  labs(x = "Income", 
       y = "Capital Gain") +
  ggtitle("Box Plot of Nonzero Capital Gain by Income") 

# Zoomed version:
ggplot(mapping = aes(x = income, y = capital_gain),
       data = subset(db.adult, db.adult$capital_gain > 0)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19,
               color = "red",
               cex = 2) +
  coord_cartesian(ylim = c(0, 30000)) +
  scale_y_continuous(breaks = seq(0, 30000, 1500)) +
  labs(x = "Income", 
       y = "Capital Gain") +
  ggtitle("Box Plot of Nonzero Capital Gain by Income")  

Next we give the box plot of nonzero capital loss grouped by income. Here we observe the same tendency as for capital gain: the mean and median for people having an income of more than 50K a year is bigger than that of people earning less than 50K a year. This might be due to the fact that people with higher income are more prone to invest more money more often. This leads to higher chances of not only good but also bad investments resulting in bigger losses.

ggplot(mapping = aes(x = income, y = capital_loss),
       data = subset(db.adult, db.adult$capital_loss > 0)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19,
               color = "red",
               cex = 2) +
  labs(x = "Income",
       y = "Capital Loss") +
  ggtitle("Box Plot of Nonzero Capital Loss by Income") 

# Zoomed version:
ggplot(mapping = aes(x = income, y = capital_loss),
       data = subset(db.adult, db.adult$capital_loss > 0)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19,
               color = "red",
               cex = 2) +
  coord_cartesian(ylim = c(0, 3000))+
  scale_y_continuous(breaks = seq(0, 3000, 200)) +
  labs(x = "Income", 
       y = "Capital Loss") +
  ggtitle("Box Plot of Nonzero Capital Loss by Income")  

As a conclusion, we can say that there is evidence for strong relationship between the nonzero values of “capital_gain” and “capital_loss”, and “income”. However, we will not include these variables in the predictive model because of the extremely high number of zeros among their values. Let us recall that, as we discussed in part 1 of the project – Cleaning and Preprocessing the Data, the number of zeros in these variables is more than 90%. This means that less than 10% of the participants in the survey make any investments. This gives rise to the question of whether to include these variables in the predictive model at the first place. For the predictive analysis in the second part of the project we will include the categorical variables “cap_gain” and “cap_loss” as covariates, but we will also fit a predictive model without them. If “cap_gain” and “cap_loss” do not contribute to the prediction accuracy of the model, we will drop them.

3.2.2 “cap_gain” and “cap_loss”

Next, we will examine the relationship between the factor variables “cap_gain” and “cap_loss” and the categorical variable “income”. We will show bar plots of the two variables grouped by income. We do this with the help of the functions “lapply()” and “grid.arrange()”. As the R documentation states: “lapply(X, FUN, …) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X, where X is a vector or list.” In our case we create a list called “lg_cap.gain”, which consists of 2 graphical objects (2 = number of levels of the factor variable “income”). After that we plot the two graphical objects (which are bar plots with vertical bars) in one plot with the function “grid.arrange()” from the “gridExtra” package.

First we show the bar plot of the variable “cap_gain”, grouped by the values of “income”:

# Bar plot of cap_gain by income:

lg_cap.gain <- lapply(X = levels(db.adult$income), FUN = function(v){
  
  df <- subset(db.adult, db.adult$income == v)    
  
  df <- within(df, cap_gain <- factor(cap_gain, 
                                      levels = names(sort(table(cap_gain), 
                                                          decreasing = TRUE))))
    
  ggplot(data = df, 
         aes(x = cap_gain, 
             fill = cap_gain)) + 
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
              stat = "count",
              vjust = -.1) +
    labs(x = "Capital Gain", 
         y = "",
         fill = "Capital Gain") +
    theme(legend.position = 'none') +
    ggtitle(paste("Income", v, sep = "")) +  
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg_cap.gain, ncol = 2)

From the bar plot we see that 0.0% of the people who earn less than 50K a year have a high capital gain. In order to check if this result is due to some round-off error, below we display the number of individuals with income of less than 50K and high capital gain:

nrow(subset(db.adult, db.adult$cap_gain == " High" &
                      db.adult$income == " <=50K"))
[1] 6

As we can see from the output above, there are, indeed, no people with low annual income and high capital gain. The bar plot also shows that the proportion of people who have medium and high capital gain is much bigger within the group of people with income of more than 50K a year compared to the respective proportion within the group of people with income of less than 50K. Therefore, we can conclude that there is a realtionship between “cap_gain” and “income”.

Further, we consider the variable “cap_loss”:

# Bar plot of cap_loss by income:

lg_cap.loss <- lapply(levels(db.adult$income), function(v){
    
  df <- subset(db.adult, db.adult$income == v)    
  
  df <- within(df, cap_loss <- factor(cap_loss, 
                                      levels = names(sort(table(cap_loss), 
                                                          decreasing = TRUE))))    
  
  ggplot(data = df, 
         aes(x = cap_loss, 
             fill = cap_loss)) + 
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
              stat = "count",
              vjust = -.1) +
    labs(x = "Capital Loss", 
         y = "",
         fill = "Capital Loss") +
    theme(legend.position = 'none') +
    ggtitle(paste("Income", v, sep="")) +  
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg_cap.loss, ncol = 2)

Here we observe the same trend as in the case of the variable “cap_gain”, which in this case might be due (as we already mentioned) to the more and bigger (i.e. riskier) investments that people with higher income tend to make.

3.3 The variable “age”

We begin with displaying the summary of age as well as its IQR:

summary(db.adult$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.00   28.00   37.00   38.44   47.00   90.00 
IQR(db.adult$age)
[1] 19

The median age is 37 years and the mean age is 38 years. The summary shows that at least 50% of the people in the study are between 28 and 47 years old, which makes sense since the participants in the survey should be of working age. Of course, there are some outliers, such as individuals being between 75 and 90 years old. To visualize the summary statistic we also show a box plot of the variable “age”:

ggplot(mapping = aes(x = factor(0), y = age),
       data = db.adult) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19,
               color = "red",
               cex = 2) +
  coord_cartesian(ylim = c(10, 100)) +
  scale_y_continuous(breaks = seq(10, 100, 5)) +
  ylab("Age") +
  xlab("") +  
  ggtitle("Box Plot of Age") +
  scale_x_discrete(breaks = NULL)

From the histogram of “age” below we can see that the bulk of individuals are bewtween 20 and 50 years old:

qplot(x = db.adult$age, 
      data = db.adult, 
      binwidth = 5, 
      color = I('black'), 
      fill = I('#F29025'),
      xlab = "Age",
      ylab = "Count",
      main = "Histogram of Age") +
  scale_x_continuous(breaks = seq(0, 95, 5)) +   
  scale_y_continuous(breaks = seq(0, 4500, 500))

Next we display the empirical density of age grouped by income. From the plot we see that the majority of people earning more than 50K a year are between 33 and 55 years old, whereas the the greater number of people who earn less than 50K a year are between 18 and 45:

ggplot(data = db.adult, aes(age, fill = income)) + 
  geom_density(alpha = 0.2) +
  scale_x_continuous(breaks = seq(0, 95, 5))

The density plot above clearly shows that age and income are correlated – people of greater age have higher income. This can be also seen from the following histograms of age by income:

ggplot(data = db.adult, mapping = aes(x = age)) + 
  geom_histogram(binwidth = 5,
                 color = "black",
                 fill = "lightblue",
                 alpha = 0.6) +
  scale_x_continuous(breaks = seq(0, 95, 5)) + 
  facet_wrap(~income) +
  ggtitle("Histogram of Age by Income") 

Next we give the box plot of age grouped by income:

ggplot(aes(x = income, y = age),
       data = db.adult) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = "point", 
               shape = 16, 
               cex = 2, 
               col = "red") +
  coord_cartesian(ylim = c(10, 90))+
  scale_y_continuous(breaks = seq(10, 90, 5)) +
  ylab("Age") +
  xlab("Income") +  
  ggtitle("Box Plot of Age by Income") 

In the box plot above, the filled red dot represents the mean value and the black horizontal line is the median. Once again we see the relationship between age and income. People who earn more that 50K a year are on average about 43-44 years old, whereas people who earn less than 50K a year, are younger on average - around 34-37. This can be also be seen by the summary statistics below:

summary(subset(db.adult$age, db.adult$income == " <=50K"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.00   26.00   34.00   36.61   45.00   90.00 
summary(subset(db.adult$age, db.adult$income == " >50K"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  19.00   36.00   43.00   43.96   51.00   90.00 

From the above output and the box plots we also notice that the first quartiles (25-th percentiles) for both groups differ significantly. The first quartile for people who have an income of more than 50K is equal to 36, whereas the first quartile for people earning less than 50K equals 26. This means that the elder an individual is, the bigger their chances of having a higher income are. This can be explained by the fact that the older one gets, the more experienced professionally they become, which usually means a higher salary.

Below we show one more plot in order to demonstrate the correlation between age and income. We display a conditional density plot of income versus age:

cd_plot(x = db.adult$age, 
        y = db.adult$income,
        xlab = "Age",
        ylab = "Income",
        main = "Conditionl Density Plot of Income versus Age")

From the plot we see that the probability of having an income greater than 50K is biggest (about 60%) for individuals in their 50s, and is smallest for people in their 20s.

Finally, we show a plot of mean working hours per week versus age, grouped by gender. From this graph we see that on average men work more hours per week than women at almost all ages. There is an exception for people between 77 and 80 and between 85 and 90, where women have more working hours per week on average. However, the latter observations are outliers (data which are highly unlikely for this study, because it is very unusual for people to work when they are above 80 years old), we have to consider these extreme data points with caution. We also notice that the mean number of woring hours per week for people between 25 and 60 years old (active working years) is 40 hours for women and 45 hours for men. For people above 65 years old the average working hours per week gradually decrease and vary between about 15 and 30 hours.

ggplot(aes(x = age, y = hours_per_week),
       data = db.adult) + 
  geom_line(mapping = aes(color = sex), 
            stat = 'summary', 
            fun.y = mean) +
  geom_smooth(mapping = aes(color = sex)) + 
  scale_x_continuous(breaks = seq(10, 100, 5)) +
  scale_y_continuous(breaks = seq(0, 55, 5)) +  
  labs(x = "Age", y = "Mean Hours per Week") +
  ggtitle("Age vs. Mean Hours per Week by Gender")

3.4 The variables “hours_per_week” and “hours_w”

3.4.1 “hours_per_week”

Below is a summary statistic of the variable “hours_per_week”:

summary(db.adult$hours_per_week)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   40.00   40.00   40.93   45.00   99.00 
IQR(db.adult$hours_per_week)
[1] 5

The mean number of working hours per week is 41 and at least 50% of the participants in the study work between 40 and 45 hours a week. Next we show the box plot of “hours_per_week” which visualizes the summary statistic. We see that there are many outliers, i.e. there are many people who work less than 40 hours a week or more than 45 hours a week:

ggplot(aes(x = factor(0), y = hours_per_week),
       data = db.adult) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = "point", 
               shape = 19,
               color = "red",
               cex = 2) +
  coord_cartesian(ylim = c(10, 100)) +
  scale_x_discrete(breaks = NULL) +
  scale_y_continuous(breaks = seq(10, 100, 5)) +
  ylab("Hours per Week") +
  xlab("") +  
  ggtitle("Box plot of Hours per Week") 

# Zoomed box plot of hours per week:
ggplot(aes(x = factor(0), y = hours_per_week),
       data = db.adult) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = "point", 
               color = "red",
               shape = 19,
               cex = 2) +
  coord_cartesian(ylim = c(30, 50)) +
  scale_x_discrete(breaks = NULL) +
  scale_y_continuous(breaks = seq(30, 50, 1)) +
  ylab("Hours per Week") +
  xlab("") +  
  ggtitle("Box plot of Hours per Week") 

Since we are interested in the relationship between income and working hours per week, we display the box plot of hours per week grouped by income:

ggplot(aes(x = income, y = hours_per_week),
       data = db.adult) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19, 
               color = "red", 
               cex = 2) +
  coord_cartesian(ylim = c(10, 100))+
  scale_y_continuous(breaks = seq(10, 100, 10)) +
  ylab("Hours per Week") +
  xlab("Income") +  
  ggtitle("Box plot of Hours per Week by Income") 

# Zoom of the box plot of hours per week by income:
ggplot(aes(x = income, y = hours_per_week),
       data = db.adult) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19, 
               color = "red", 
               cex = 2) +
  coord_cartesian(ylim = c(30, 60))+
  scale_y_continuous(breaks = seq(30, 60, 1)) +
  ylab("Hours per Week") +
  xlab("Income") +  
  ggtitle("Box Plot of Hours per Week by Income") 

As we can see from the produced box plot, the mean number of working hours per week is significantly greater for people who earn more than 50K a year compared to people earning less than 50K a year. We also notice that the two medians are equal, but for income<=50K the median coincides with the third quartile, whereas for the data corresponding to income>50K, the respective median coincides with the first quartile. The exact numbers can be observed better with the help of the “summary()” function:

summary(subset(db.adult$hours_per_week, db.adult$income == " <=50K"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   38.00   40.00   39.35   40.00   99.00 
summary(subset(db.adult$hours_per_week, db.adult$income == " >50K"))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   40.00   40.00   45.71   50.00   99.00 

Both medians (50-th percentiles) are equal to 40 hours per week, but the mean value for income<=50K is approximately 39 compared to 46 for income>50K. Also, at least 50% of the people with income of less than 50K work between 38 and 40 hours a week. In contrast, at least 50% of the individuals earning more than 50K a year work between 40 and 50 hours a week, which is considerably more than to the group of people with income of less than 50K. This shows us that there is a pronounced correlation between income and working hours per week.

Finally we show a graph of the mean working hours per week versus age, grouped by income:

ggplot(mapping = aes(x = age, y = hours_per_week),
       data = db.adult) + 
  geom_line(mapping = aes(color = income), 
            stat = 'summary', 
            fun.y = mean) +
  geom_smooth(mapping = aes(color = income)) +
  scale_x_continuous(breaks = seq(10, 100, 5)) +
  labs(x = "Age", 
       y = "Mean Hours per Week") +
  ggtitle("Mean Hours per Week vs. Age")

From the above plot we see that for all age groups the mean number of working hours per week is greater for people with income>50K than for people with income<=50K.

3.4.2 “hours_w”

From the summary statistic below we see that the bulk of people work between 40 and 45 hours a week, which we already determined.

summary(db.adult$hours_w)
 between_40_and_45  between_45_and_60  between_60_and_80 
             16606               5790                857 
      less_than_40       more_than_80 
              6714                195 

Next we give a bar plot of “hours_w”, which shows the percentage of people belonging to each category of the factor variable. In order to display the bars in decreasing order based on the number of people belonging to each category of “hours_w”, we first rearrange the levels of the factor variable (which are ordered in alphabetical order by default):

db.adult <- within(db.adult, hours_w <- factor(hours_w, levels = 
                                                   names(sort(table(hours_w), 
                                                         decreasing = TRUE))))

ggplot(db.adult, 
       aes(x = db.adult$hours_w, fill = db.adult$hours_w)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = 0.3) +
  labs(x = "Hours per week", 
       y = "",
       fill = "Hours per week") +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Bar Plot of Hours per Week") +  
  scale_y_continuous(labels = percent)

We observe that 55.1% of all individuals work between 40 and 45 hours a week; 22.3% work less than 40 hours a week; 19.2% work between 45 and 60 hours a week; 2.8% work between 60 nad 80 hours a week and 0.6% work more than 80 hours a week.

Finally we show a bar plot of “hours_w” grouped by “income”:

lg_hpw <- lapply(levels(db.adult$income), function(v){
    
    df <- subset(db.adult, db.adult$income == v)  
    
    df <- within(df, hours_w <- factor(hours_w, 
                                       levels = names(sort(table(hours_w), 
                                                           decreasing = TRUE))))
  
    ggplot(data = df, 
         aes(x = hours_w, 
             fill = hours_w)) + 
      geom_bar(aes(y = (..count..)/sum(..count..))) +
      geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                    y = (..count..)/sum(..count..) ), 
                stat = "count",
                vjust = -.1,
                size = 3) +
      labs(x = "Hours per week", 
           y = "",
           fill = "Hours per week") +
      theme(legend.position = 'none',
            axis.text.x = element_text(angle = 45, hjust = 1)) +
      ggtitle(paste("Income", v, sep="")) + 
      scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg_hpw, ncol = 2)

We see that approximately the same percentage of people in both categories of “income” work between 40 and 45 hours a week, but the proportion of people with income>50K who work between 45 and 60 hours a week is 33.2% compared to 14.6% for that of people with income<=50K.

3.5 The variable “native_region”

We start with a summary statistic:

summary(db.adult$native_region)
 Central-America     Central-Asia        East-Asia      Europe-East 
            1226              142              304               85 
     Europe-West      Outlying-US    South-America    United-States 
             408              380              113            27504 

The majority of people come from the United States and Central America.

Next we show a bar plot with the percentage of people belonging to each native region:

db.adult$native_region <- factor(db.adult$native_region, 
                                 levels = 
                                     names(sort(table(db.adult$native_region),                                                 decreasing = TRUE)))

ggplot(db.adult, 
       aes(x = db.adult$native_region, fill = db.adult$native_region)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1) +
  labs(x = "Region", 
       y = "",
       fill = "Regions") +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 45, hjust = 1)) +  
  scale_y_continuous(labels = percent)

As we can see from the graphs above, 91.2% of the participants in the study come from the USA. This means that we have a small number of people from each of the other native regions (especially for East Europe - only 85 people, South America -113 people and Central Asia - 142 people) leading to random samples which might not be representative for the respective population. Therefore further analysis must be carried out with caution.

Lastly we display the percentage of people earning less than 50K and more than 50K among all individuals belonging to a given native region:

lp_region <- lapply(levels(db.adult$native_region), function(v){
    
    df <- subset(db.adult, db.adult$native_region == v) 
  
    ggplot(data = df, 
           aes(x = income, 
               fill = income)) + 
      geom_bar(aes(y = (..count..)/sum(..count..))) +
      geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                    y = (..count..)/sum(..count..) ), 
                stat = "count",
                vjust = c(2, -0.1),
                size = 4) +
      labs(x = "Income", 
           y = "",
           fill = "Income") +
      ggtitle(v) +    
      theme(legend.position = 'none',
            plot.title = element_text(size = 11, face = "bold")) +     
      scale_y_continuous(labels = percent) })


grid.arrange(grobs = lp_region[1:4], ncol = 2)

grid.arrange(grobs = lp_region[5:8], ncol = 2)

As we can see from the graphics above, 40.8% of the people with Central-Asian origin earn more than 50K a year, which is the highest percentage among all native regions. However, we cannot rely completely on this inference, because there are only 142 participants with Central-Asian origin compared to 27504 people with domestic (US) roots. If we ignore the small number of individuals which comprise the samples for the rest of the native regions, we can make similar conclusions. Namely, the second place is for individuals of West-European descent - 31.6%, followed by Outlying-US with 28.9% and East Asia - with 28%. The fifth place is for people from the United States - 25.4%, followed by East-European descendents with 23.5% and the last place is for individuals with South-American origin - 7.1%. If we disregard the small number of observations for the non-US native regions, the latter results indicate that “income” is dependent on “native_region”.

3.6 The variable “workclass”

The summary statistic of “workclass” is given below:

table(db.adult$workclass)

      Federal-gov         Local-gov           Private      Self-emp-inc 
              943              2067             22286              1074 
 Self-emp-not-inc         State-gov       Without-pay 
             2499              1279                14 

We notice that the majority of people are employed in the private sector. We also see that there are no people in the category “Never-worked”, and only 14 of the participants in the study, or 0.05%, are without pay.

Next we show a graph with the percentage of people belonging to each category of “workclass”:

db.adult$workclass <- factor(db.adult$workclass, 
                             levels = 
                                     names(sort(table(db.adult$workclass),                                                    decreasing = TRUE)))

ggplot(db.adult, 
       aes(x = db.adult$workclass, fill = db.adult$workclass)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1,
            size = 3.5) +
  labs(x = "Employment type", 
       y = "",
       fill = "Employment type") +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 45, hjust = 1)) +   
  scale_y_continuous(labels = percent)

Based on the summary statistic of “workclass” we observed that there are no people in the category " Never-worked“:

nrow(subset(db.adult , db.adult$workclass == " Never-worked"))
[1] 0

We also notice that there are no people in the category “Without-pay” who earn more than 50K a year, which seems obvious and logical:

nrow(subset(db.adult , db.adult$workclass == " Without-pay" & 
                       db.adult$income == " >50K"))
[1] 0

Therefore, in order to make the plots more readable and meaningful, we exclude the factor levels " Never-worked" and " Without-pay“. We do this in the following way:

  1. First we create the character vector “modified.work”, whose elements are the levels of the nominal variable “workclass”;

  2. Then we transform the vector by excluding from it the elements " Never-worked" and " Without-pay“.

modified.work <- levels(db.adult$workclass)

modified.work
[1] " Private"          " Self-emp-not-inc" " Local-gov"       
[4] " State-gov"        " Self-emp-inc"     " Federal-gov"     
[7] " Without-pay"     
modified.work <- modified.work[!is.element(modified.work, 
                                           c(" Never-worked",
                                             " Without-pay"))]

modified.work
[1] " Private"          " Self-emp-not-inc" " Local-gov"       
[4] " State-gov"        " Self-emp-inc"     " Federal-gov"     

Finally we plot the percentage of people earning less than 50K and more than 50K based on their employment status:

lg.workclass.mod <- lapply(modified.work, function(v){
  
  ggplot(data = subset(db.adult, db.adult$workclass == v), 
         aes(x = subset(db.adult, db.adult$workclass == v)$income, 
             fill = subset(db.adult, db.adult$workclass == v)$income)) +
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
              stat = "count",
              vjust = c(2, 1.5)) +
    labs(x = "Income", 
         y = "",
         fill = "Income") +
    ggtitle(v) +  
    theme(legend.position = 'none',
          plot.title = element_text(size = 11, face = "bold")) +
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg.workclass.mod[1:3], ncol = 2)

grid.arrange(grobs = lg.workclass.mod[4:6], ncol = 2)

From the graphs above, we can see that the percentage of individuals having an income of more than 50K is biggest for the category “Self-emp-inc” (self-employed people with income) – 55.9%. The group with the second highest percentage of people earning more than 50K is that of federal government employees. There is no significant difference between the categories " Local-gov" (local government), " Self-emp-not-inc" (self-employed individuals without income) and " State-gov" (state government jobs) - about 27-28% of the workers in these branches have an annual pay of more than 50000$. In the private sector only about 22% of the people earn more than 50K a year. These results demonstrate that there is a relationship between the variables “income” and “workclass”.

3.7 The variable “education”

We start with a summary of “education”:

summary(db.adult$education)
         10th          11th          12th       1st-4th       5th-6th 
          820          1048           377           151           288 
      7th-8th           9th    Assoc-acdm     Assoc-voc     Bachelors 
          557           455          1008          1307          5044 
    Doctorate       HS-grad       Masters     Preschool   Prof-school 
          375          9840          1627            45           542 
 Some-college 
         6678 

The majority of people have a high school degree - 9840, college degreee - 6678 and bachelor degree - 5044. The bar plot below shows the percentage of people belonging to each category of “education”:

db.adult$education <- factor(db.adult$education, 
                                 levels = 
                                     names(sort(table(db.adult$education),                                                    decreasing = TRUE)))

ggplot(db.adult, 
       aes(x = db.adult$education, fill = db.adult$education)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1,
            size = 3.5) +
  labs(x = "Education", 
       y = "",
       fill = "Education") +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = percent)

Since there are no people with education “Preschool” who earn more than 50K a year, as we can see below,

nrow(subset(db.adult, db.adult$education == " Preschool" &
                      db.adult$income == " >50K" ))
[1] 0

we will remove the factor level “Preschool” before we continue further with the analysis. In order to do that we create a character vector “modified.edu” with elements equal to the factor levels of “education”, and then we alter the vector by removing the element " Preschool“:

modified.edu <- levels(db.adult$education)

modified.edu
 [1] " HS-grad"      " Some-college" " Bachelors"    " Masters"     
 [5] " Assoc-voc"    " 11th"         " Assoc-acdm"   " 10th"        
 [9] " 7th-8th"      " Prof-school"  " 9th"          " 12th"        
[13] " Doctorate"    " 5th-6th"      " 1st-4th"      " Preschool"   
modified.edu <- modified.edu[!is.element(modified.edu, " Preschool")]

modified.edu
 [1] " HS-grad"      " Some-college" " Bachelors"    " Masters"     
 [5] " Assoc-voc"    " 11th"         " Assoc-acdm"   " 10th"        
 [9] " 7th-8th"      " Prof-school"  " 9th"          " 12th"        
[13] " Doctorate"    " 5th-6th"      " 1st-4th"     

After that, we display the bar plot of each education category grouped by income:

lg.mod.edu <- lapply(modified.edu, function(v){
  
  ggplot(data = subset(db.adult, db.adult$education == v), 
         aes(x = subset(db.adult, db.adult$education == v)$income, 
             fill = subset(db.adult, db.adult$education == v)$income)) +
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                  y = (..count..)/sum(..count..) ), 
             stat = "count",
             vjust =  c(2, 0.5),
             size = 3) +
    labs(x = "Income", 
         y = "",
         fill = "Income") +
    ggtitle(v) +  
    theme(legend.position = 'none',
          plot.title = element_text(size = 11, face = "bold")) +    
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg.mod.edu[1:4], ncol = 2)

grid.arrange(grobs = lg.mod.edu[5:8], ncol = 2)

grid.arrange(grobs = lg.mod.edu[9:12], ncol = 2)

grid.arrange(grobs = lg.mod.edu[13:15], ncol = 2)

The categories " 1st-4th“,” 5th-6th“,” 7th-8th“,” 9th“,” 10th“,” 11th" and " 12th" have a very small percentage of people with income greater than 50K a year. The percentage of people with a high school degree who earn more than 50K is also relatively small - 16.4%. 20% of the individuals in the category " Some-college" earn more than 50K. The biggest percentage of employees (74.9%), who have an annual income higher than 50K, belongs to the category " Prof-school“. The” Doctorate" group is next with 74.7%, followed by the categories " Masters" - 56.4% and " Bachelors" - 42.1%. These results clearly indicate that there is a relationship between education and income. In order to demonstrate this correlation visually, we show one more graph - a conditional density plot of income versus education number (“education_num”):

cd_plot(x = db.adult$education_num, 
        y = db.adult$income,
        xlab = "Education number",
        ylab = "Income",
        main = "Conditionl Density Plot of Income vs. Education Number")

Each number (from 1 to 16) in this integer variable corresponds to an education level from the factor variable “education”, starting from the lowest level (“Preschool”) and reaching the highest education level - “Doctorate”. We will investigate further the variable “education_num” and its relation to the variable “education” in the third part of the project – Predictive Analysis. As we can see from the conditional density plot above, the higher the education level, the greater the chances are an individual to earn more than 50K a year.

3.8 The variable “marital_status”

From the summary below we see how many people belong to each category of the variable “marital_status”:

summary(db.adult$marital_status)
              Divorced      Married-AF-spouse     Married-civ-spouse 
                  4214                     21                  14065 
 Married-spouse-absent          Never-married              Separated 
                   370                   9726                    939 
               Widowed 
                   827 

The biggest number of people are married to a civilian spouse - 14065. A significant number of individuals belong to the group " Never-married" - 9726, followd by divorced people - 4214. A very small number of participants in the study are married to an army spouse - 21. Below we visualize the percentage of people belonging to each category:

db.adult$marital_status <- factor(db.adult$marital_status, 
                                 levels = 
                                     names(sort(table(db.adult$marital_status),                                               decreasing = TRUE)))

ggplot(db.adult, 
       aes(x = db.adult$marital_status, fill = db.adult$marital_status)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1,
            size = 3.5) +
  labs(x = "Marital Status", 
       y = "",
       fill = "Marital Status") +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = percent)

Below we give the bar plots of income grouped by marital status:

lp_marital <- lapply(levels(db.adult$marital_status), function(v){

  ggplot(data = subset(db.adult, db.adult$marital_status == v),
         aes(x = subset(db.adult, db.adult$marital_status == v)$income,
             fill = subset(db.adult, db.adult$marital_status == v)$income)) +   
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                  y = (..count..)/sum(..count..) ),
              stat = "count",
              vjust = c(2, -0.1)) +
    labs(x = "Income",
         y = "",
         fill = "Income") +
    ggtitle(v) +
    theme(legend.position = 'none',
          plot.title = element_text(size = 11, face = "bold")) +
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lp_marital[1:3], ncol = 2)

grid.arrange(grobs = lp_marital[4:7], ncol = 2)

As we see from the graphs above, the biggest percentage of employees with income higher than 50K are those from the category " Married-AF-spouse“. However, since there are only 25 observations in this category, we cannot draw trustworthy conclusions regarding the income of the individuals belonging to this group. On the other hand, the random sample for the category” Married-civ-spouse" amounts to 14065 individuals and can be considered representative. For this category, the percentage of people with income of more than 50K is very high - 45.5%. The same cannot be said for the groups " Divorced“,” Never-married“,” Married-spouse-absent“,” Separated" and " Widowed“, where the percentage of people with income higher than 50K varies between 4.8% and 10.7%. One explanation as to why people who never got married earn less than married people is that the former group probably contains mostly young individuals who work part-time (for example, students saving for college), as well as younger people as a whole, who are in the beginning of their professional career. This conclusion is also in agreement with the results for the variable”age“, where we noticed that the greater the age of an individual, the higher their income. However, the same logic cannot be apllied to the other categories with low percentage of individuals with income greater than 50K –” Divorced“,” Married-spouse-absent“,” Separated" and " Widowed“. Therefore these results provide evidence that there is a correlation between income and marital status, which cannot be explained only with the confounding”age" variable.

3.9 The variable “occupation”

First we show the summary statistic of “occupation”:

summary(db.adult$occupation)
      Adm-clerical       Armed-Forces       Craft-repair 
              3721                  9               4030 
   Exec-managerial    Farming-fishing  Handlers-cleaners 
              3992                989               1350 
 Machine-op-inspct      Other-service    Priv-house-serv 
              1966               3212                143 
    Prof-specialty    Protective-serv              Sales 
              4038                644               3584 
      Tech-support   Transport-moving 
               912               1572 

With the exception of the category " Armed-Forces“, where there are 9 people, all other categories have a relatively large number of individuals and hence can be considered representative random samples. Next we visualize the percentage of people belonging to each category of the factor variable”occupation“:

db.adult$occupation <- factor(db.adult$occupation, 
                                 levels = 
                                     names(sort(table(db.adult$occupation),                                                   decreasing = TRUE)))

ggplot(db.adult,
       aes(x = db.adult$occupation, fill = db.adult$occupation)) +
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ),
            stat = "count",
            vjust = -.1,
            size = 3.5) +
  labs(x = "Occupation",
       y = "Percentage",
       fill = "Occupation") +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = percent)

We notice (see the code below) that there are no women working in the " Armed-Forces" and there are no men working in the " Priv-house-serv" sector who are earning more than 50K a year.

nrow(subset(db.adult, db.adult$sex == " Female" &
                      db.adult$occupation == " Armed-Forces"))
[1] 0
nrow(subset(db.adult, db.adult$sex == " Male" &
                      db.adult$occupation == " Priv-house-serv" &
                      db.adult$income == " >50K"))
[1] 0

Hence, for the following bar plots with percentages of women earning less than 50K and more than 50K for each type of occupation, we exclude " Armed-Forces“. Below we create a character vector containig all levels of”occupation" except for " Armed-Forces“.

modified.occup.f <- levels(db.adult$occupation)
modified.occup.f
 [1] " Prof-specialty"    " Craft-repair"      " Exec-managerial"  
 [4] " Adm-clerical"      " Sales"             " Other-service"    
 [7] " Machine-op-inspct" " Transport-moving"  " Handlers-cleaners"
[10] " Farming-fishing"   " Tech-support"      " Protective-serv"  
[13] " Priv-house-serv"   " Armed-Forces"     
modified.occup.f <- modified.occup.f[!is.element(modified.occup.f, 
                                             c(" Armed-Forces"))]
modified.occup.f
 [1] " Prof-specialty"    " Craft-repair"      " Exec-managerial"  
 [4] " Adm-clerical"      " Sales"             " Other-service"    
 [7] " Machine-op-inspct" " Transport-moving"  " Handlers-cleaners"
[10] " Farming-fishing"   " Tech-support"      " Protective-serv"  
[13] " Priv-house-serv"  

Below we display bar plots of women’s income grouped by occupation:

lp.occupation.f <- lapply(modified.occup.f, function(v){
  
  ggplot(data = subset(db.adult, db.adult$occupation == v 
                       & db.adult$sex == " Female"), 
         aes(x = subset(db.adult, db.adult$occupation == v 
                        & db.adult$sex == " Female")$income, 
             fill = subset(db.adult, db.adult$occupation == v 
                           & db.adult$sex == " Female")$income)) + 
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                  y = (..count..)/sum(..count..) ), 
              stat = "count",
              vjust = c(2, -0.1)) +
    labs(x = "Income", 
         y = "",
         fill = "Income") +
    ggtitle(paste("Women in \n", v, sep="")) +
    theme(legend.position = 'none',
          plot.title = element_text(size = 10, face = "bold")) + 
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lp.occupation.f[1:4], ncol = 2)

grid.arrange(grobs = lp.occupation.f[5:9], ncol = 3)

grid.arrange(grobs = lp.occupation.f[10:13], ncol = 3)

As we can see from the output above, the biggest percentage of women with income greater than 50K is in the category " Prof-specialty" - 25.5%, followed by 24.2% in the category " Exec-managerial“. In the rest of the categories this percentage is less than 10%, with the exception of” Protective-serv" - 13.2%, and " Tech-support" - 12.9%. Below we give a summary statistic for the number of women belonging to each category of “occupation”:

summary(db.adult[db.adult$sex == " Female" , ]$occupation)
    Prof-specialty       Craft-repair    Exec-managerial 
              1491                216               1143 
      Adm-clerical              Sales      Other-service 
              2512               1248               1758 
 Machine-op-inspct   Transport-moving  Handlers-cleaners 
               543                 90                164 
   Farming-fishing       Tech-support    Protective-serv 
                65                341                 76 
   Priv-house-serv       Armed-Forces 
               135                  0 

Judging by the numbers above, the groups " Adm-clerical“,” Exec-managerial“,” Machine-op-inspct“,” Other-service“,” Prof-specialty“,” Sales" and " Tech-support" can be conisdered as representative random samples, whereas the statistical inferences for the rest of the categories should be viewed with caution.

Since no men who are working in the " Priv-house-serv" sector are earning more than 50K a year, we leave out " Armed-Forces" and " Priv-house-serv" when displaying the bar plots of income for men grouped by occupation:

modified.occup.m <- levels(db.adult$occupation)
modified.occup.m
 [1] " Prof-specialty"    " Craft-repair"      " Exec-managerial"  
 [4] " Adm-clerical"      " Sales"             " Other-service"    
 [7] " Machine-op-inspct" " Transport-moving"  " Handlers-cleaners"
[10] " Farming-fishing"   " Tech-support"      " Protective-serv"  
[13] " Priv-house-serv"   " Armed-Forces"     
modified.occup.m <- modified.occup.m[!is.element(modified.occup.m, 
                                                 " Priv-house-serv")]
modified.occup.m
 [1] " Prof-specialty"    " Craft-repair"      " Exec-managerial"  
 [4] " Adm-clerical"      " Sales"             " Other-service"    
 [7] " Machine-op-inspct" " Transport-moving"  " Handlers-cleaners"
[10] " Farming-fishing"   " Tech-support"      " Protective-serv"  
[13] " Armed-Forces"     
lp.occupation.m <- lapply(modified.occup.m, function(v){
  
  ggplot(data = subset(db.adult, db.adult$occupation == v 
                       & db.adult$sex == " Male"), 
         aes(x = subset(db.adult, db.adult$occupation == v 
                        & db.adult$sex == " Male")$income, 
             fill = subset(db.adult, db.adult$occupation == v 
                           & db.adult$sex == " Male")$income)) + 
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                  y = (..count..)/sum(..count..) ), 
              stat = "count",
              vjust = c(2, 1),
              size = 3) +
    labs(x = "Income", 
         y = "",
         fill = "Income") +
    ggtitle(paste("Men in", v, sep="")) +  
    theme(legend.position = 'none',
          plot.title = element_text(size = 11, face = "bold")) +     
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lp.occupation.m[1:4], ncol = 2)

grid.arrange(grobs = lp.occupation.m[5:8], ncol = 2)

grid.arrange(grobs = lp.occupation.m[9:10], ncol = 2)

grid.arrange(grobs = lp.occupation.m[11:13], ncol = 2)

58.3% of the male employees with executive-managerial position have an income of more than 50K a year, which is the highest percentage among the categories of the variable “occupation”. The second place is for the category " Prof-specialty" - 56.2%, followed by " Tech-support" - 41%. Next come the categories " Sales" and " Protective-serv" with 37.8% and 35.2%, respectively. From the summary statistic below we see the number of men in each category of the variable “occupation”. All groups, except " Armed-Forces" and " Priv-house-serv“, are representative random samples with enough number of observations.

summary(db.adult[db.adult$sex == " Male" , ]$occupation)
    Prof-specialty       Craft-repair    Exec-managerial 
              2547               3814               2849 
      Adm-clerical              Sales      Other-service 
              1209               2336               1454 
 Machine-op-inspct   Transport-moving  Handlers-cleaners 
              1423               1482               1186 
   Farming-fishing       Tech-support    Protective-serv 
               924                571                568 
   Priv-house-serv       Armed-Forces 
                 8                  9 

We observe the tendency that jobs requiring highly qualified specialists with college or university degree are paid better than jobs that are not. This conclusion seems reasonable and, indeed, reflects the real world job market.

3.10 The variable “relationship”

This variable is closely realted to the variable " marital_status" and they should be considered together. From the summary below we notice that the majority of people are married because they identified themselves as " Husband" and " Wife“. This is in accordance with the summary statistic of” marital_status“, which we saw earlier.

summary(db.adult$relationship)
        Husband   Not-in-family  Other-relative       Own-child 
          12463            7726             889            4466 
      Unmarried            Wife 
           3212            1406 

Next we show the percentage of people belonging to each category of “relationship”:

db.adult$relationship <- factor(db.adult$relationship, 
                                 levels = 
                                     names(sort(table(db.adult$relationship),                                                 decreasing = TRUE)))

ggplot(db.adult, 
       aes(x = db.adult$relationship, fill = db.adult$relationship)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1) +
  labs(x = "Relationship", 
       y = "",
       fill = "Relationship") +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 45, hjust = 1)) +  
  scale_y_continuous(labels = percent)

Naturally, the distribution of people in each category of " relationship" is connected to that of " marital_status" given below:

ggplot(db.adult, 
       aes(x = db.adult$marital_status, fill = db.adult$marital_status)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1,
            size = 3.5) +
  labs(x = "Marital Status", 
       y = "",
       fill = "Marital Status") +
  theme(legend.position = 'none',
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = percent)

Below we give a summary statistic of the marital status of each level of the factor variable “relationship”:

summary(db.adult[db.adult$relationship == " Not-in-family", ]$marital_status)
    Married-civ-spouse          Never-married               Divorced 
                    14                   4448                   2268 
             Separated                Widowed  Married-spouse-absent 
                   383                    432                    181 
     Married-AF-spouse 
                     0 
summary(db.adult[db.adult$relationship == " Husband", ]$marital_status)
    Married-civ-spouse          Never-married               Divorced 
                 12454                      0                      0 
             Separated                Widowed  Married-spouse-absent 
                     0                      0                      0 
     Married-AF-spouse 
                     9 
summary(db.adult[db.adult$relationship == " Other-relative", ]$marital_status)
    Married-civ-spouse          Never-married               Divorced 
                   118                    548                    103 
             Separated                Widowed  Married-spouse-absent 
                    53                     40                     26 
     Married-AF-spouse 
                     1 
summary(db.adult[db.adult$relationship == " Own-child", ]$marital_status)
    Married-civ-spouse          Never-married               Divorced 
                    83                   3929                    308 
             Separated                Widowed  Married-spouse-absent 
                    90                     12                     43 
     Married-AF-spouse 
                     1 
summary(db.adult[db.adult$relationship == " Unmarried", ]$marital_status)
    Married-civ-spouse          Never-married               Divorced 
                     0                    801                   1535 
             Separated                Widowed  Married-spouse-absent 
                   413                    343                    120 
     Married-AF-spouse 
                     0 
summary(db.adult[db.adult$relationship == " Wife", ]$marital_status)
    Married-civ-spouse          Never-married               Divorced 
                  1396                      0                      0 
             Separated                Widowed  Married-spouse-absent 
                     0                      0                      0 
     Married-AF-spouse 
                    10 

As should be expected, most of the results are in agreement with the variable “marital_status”. For example, in the category " Not-in-family“, there are only 14 people who did not identified themselves correctly and said they were married and not in family at the same time. All the other individuals of the category” Not-in-family" have a proper marital status, i.e. " Divorced“,” Married-spouse-absent“,” Never-married“,” Separated" or " Widowed“. The situation is similar with the categories” Husband" and " Wife“, where there are no discrepancies between their marital and relationship status. Another interesting observation that can be made is that people with marital status” Never-married" recognized themselves as belonging to one of the four relationship categories - " Not-in-family“,” Unmarried“,” Other-relative" and " Own-child“.

Next we show bar plots of income by relationship status. First we give bar plots with counts and then with percentages. Below we give multiple bar plots side by side with the counts of people earning less than or more than 50K for each category of the variable “relationship”:

ggplot(db.adult, aes(x=db.adult$relationship, fill=db.adult$income)) +
  geom_bar(position=position_dodge()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Income", 
       y = "Number of people",
       fill = "Income") +
  ggtitle("Income grouped by relationship") +   
  scale_y_continuous(breaks = seq(0,7000,500))

Then we give bar plots with percentage of individuals having an income lower and higher than 50K for each realtionship level:

lg.relationship <- lapply(levels(db.adult$relationship), function(v){
  
ggplot(data = subset(db.adult, db.adult$relationship == v), 
         aes(x = subset(db.adult, db.adult$relationship == v)$income, 
             fill = subset(db.adult, db.adult$relationship == v)$income)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
           stat = "count",
           vjust = c(2, -0.1),
           size = 3) +
  labs(x = "Income", 
       y = "",
       fill = "Income") +
  ggtitle(paste(v)) +  
  theme(legend.position = 'none',
        plot.title = element_text(size = 11, face = "bold")) +     
  scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg.relationship[1:3], ncol = 2)

grid.arrange(grobs = lg.relationship[4:6], ncol = 2)

The results above are in agreement with the results for income grouped by marital status. Namely, the categories " Wife" and " Husband" are with the highest percentage of people with income greater than 50K, as was the category " Married-civ-spouse" of the variable “marital_status”, to which most of the individuals of " Wife" and " Husband" should belong. And, as was the case with “marital_status”, we observe a relationship between “income” and “relationship”.

3.11 The variable “race”

Again we start with a summary statistic:

summary(db.adult$race) 
 Amer-Indian-Eskimo  Asian-Pac-Islander               Black 
                286                 895                2817 
              Other               White 
                231               25933 

The major part of individuals belong to the category “White”, followed by the category “Black”. As we can see from the graph below, 86% of the participants in the study are “White”, 9.3% – “Black”, 3% – “Asian-Pac-Islander”, 0.9% – “Amer-Indian-Eskimo” and 0.8% – " Other“:

db.adult$race <- factor(db.adult$race, 
                                 levels = 
                                     names(sort(table(db.adult$race),                                                         decreasing = TRUE)))

ggplot(db.adult, 
       aes(x = db.adult$race, fill = db.adult$race)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = c(-0.2, -0.2, -0.2, -0.2, 3)) +
  labs(x = "Race", 
       y = "",
       fill = "Race") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  scale_y_continuous(labels = percent)

Next we show bar plots of income by race:

lg.race <- lapply(levels(db.adult$race), function(v){
  
  ggplot(data = subset(db.adult, db.adult$race == v), 
         aes(x = subset(db.adult, db.adult$race == v)$income, 
             fill = subset(db.adult, db.adult$race == v)$income)) + 
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                  y = (..count..)/sum(..count..) ), 
              stat = "count",
              vjust = c(2, -0.1)) +
    labs(x = "Income", 
         y = "",
         fill = "Income") +
    ggtitle(paste(v)) +  
    theme(legend.position = 'none',
          plot.title = element_text(size = 11, face = "bold")) +     
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg.race, ncol = 3)

The highest percentage of people earning more than 50K a year is within the category " Asian-Pac-Islander" - 27.7%. followed by the category " White" - 26.4%.

3.12 The variable “sex”

From the output of the code below we see that there are 20380 men and 9782 women who took part in the survey:

summary(db.adult$sex)
 Female    Male 
   9782   20380 

In percentages this makes 67.6% male participants and more than two times less female participants - 32.4%:

ggplot(db.adult, 
       aes(x = db.adult$sex, fill = db.adult$sex)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1) +
  labs(x = "Gender", 
       y = "Percentage",
       fill = "Gender") +
  scale_y_continuous(labels = percent)

We continue with bar plots of income grouped by gender. First we show in one graph two bar plots with the number of men and women who earn less than 50K and more than 50K a year:

ggplot(db.adult, aes(x = db.adult$sex, fill = db.adult$income)) +
  geom_bar(position = position_dodge()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Income", 
       y = "Number of people",
       fill = "Income") +
  ggtitle("Income grouped by gender") +   
  scale_y_continuous(breaks = seq(0,14500,1000))

However, we find it easier to interpret bar plots with percentages rather than raw counts as shown below:

gender.income <- lapply(levels(db.adult$sex), function(v){
  
ggplot(data = subset(db.adult, db.adult$sex == v), 
         aes(x = subset(db.adult, db.adult$sex == v)$income, 
             fill = subset(db.adult, db.adult$sex == v)$income))+
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -0.1,
            size = 3) +
  labs(x = "Income", 
       y = "",
       fill = "Income") +
  ggtitle(paste(v)) +  
  theme(legend.position = 'none',
        plot.title = element_text(size = 11, face = "bold"),
        axis.text.x = element_text(hjust = 1)) +     
  scale_y_continuous(labels = percent) })

grid.arrange(grobs = gender.income, ncol = 2)

From the latter graph we observe that the proportion of women earning more than 50K a year is only 11.4% compared to that of men, which is 31.4%. Below we show several ways to obtain the same information as that shown above without plotting graphs:

# Number of women earning less than 50K and more than 50K:
summary(db.adult$income[db.adult$sex == " Female"])
 <=50K   >50K 
  8670   1112 
# Number of men earning less than 50K and more than 50K:
summary(db.adult$income[db.adult$sex == " Male"])
 <=50K   >50K 
 13984   6396 
# Number of men and women earning less than 50K and more than 50K:
table(db.adult$sex, db.adult$income)
         
           <=50K  >50K
   Female   8670  1112
   Male    13984  6396
# Number of men and women earning less than 50K and more than 50K:
by(data = db.adult$income, 
   INDICES = db.adult$sex, 
   FUN = summary)
db.adult$sex:  Female
 <=50K   >50K 
  8670   1112 
-------------------------------------------------------- 
db.adult$sex:  Male
 <=50K   >50K 
 13984   6396 
# Proportion of men and women with income <50K and >=50K:
prop.table(table(db.adult$sex, db.adult$income), margin = 1)
         
              <=50K      >50K
   Female 0.8863218 0.1136782
   Male   0.6861629 0.3138371

Next we give the bar plots of workclass grouped by gender:

lg.gender.workclass <- lapply(levels(db.adult$sex), function(v){
    
    df <- subset(db.adult, db.adult$sex == v)  
    
    df <- within(df, workclass <- factor(workclass, 
                                       levels = names(sort(table(workclass), 
                                                           decreasing = TRUE))))
  
  
ggplot(data = df, 
       aes(x = df$workclass, 
           fill = df$workclass))+
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -0.1,
            size = 3) +
  labs(x = "Workclass", 
       y = "",
       fill = "Workclass") +
  ggtitle(paste(v)) +  
  theme(legend.position = 'none',
        plot.title = element_text(size = 11, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1)) +     
  scale_y_continuous(labels = percent) })

grid.arrange(grobs = lg.gender.workclass, ncol = 2)

We see that the biggest percentage of both female and male individuals are employed in the private sector - 78.1% of the women and 71.9% of the men. The bar plots of occupation by gender are shown next:

lg.gender.occupation <- lapply(levels(db.adult$sex), function(v){
    
    df <- subset(db.adult, db.adult$sex == v)  
    
    df <- within(df, occupation <- factor(occupation, 
                                       levels = names(sort(table(occupation), 
                                                           decreasing = TRUE))))
  
   ggplot(data = df, 
          aes(x = df$occupation, 
              fill = subset(db.adult, db.adult$sex == v)$occupation))+
    geom_bar(aes(y = (..count..)/sum(..count..))) +
    geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                  y = (..count..)/sum(..count..) ), 
              stat = "count",
              vjust = -0.1,
              hjust = 0.3,
              size = 3) +
    labs(x = "Occupation", 
         y = "",
         fill = "Occupation") +
    ggtitle(paste(v)) +  
    theme(legend.position = 'none',
          plot.title = element_text(size = 11, face = "bold"),
          axis.text.x = element_text(angle = 45, hjust = 1)) +     
    scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg.gender.occupation, ncol = 2)

The biggest percentage of women work in " Adm-clerical" - 25.7%, followed by " Other-service" - 18%, " Prof-specialty" - 15.2%, " Sales" - 12.8% and " Exec-managerial" - 11.7%. The remaining 16.6% are distributed among the rest of the ctaegories of “occupation”. The occupation which is most widespread among men is " Craft-repair" with 18.7%. It is followed by the categories " Exec-managerial" - with 14%, " Prof-specialty" - 12.5%, " Sales" - 11.5% and " Transport-moving" - 7.3%. The rest 36% of the male employees belong to some of the remaining 8 occupation categories. We notice that there is an overlap between the first five most popular occupation categories among men and women - these are the categories " Prof-specialty“,” Exec-managerial" and " Sales“. Below we continue with bar plots of education grouped by gender:

lg.gender.education <- lapply(levels(db.adult$sex), function(v){
    
    df <- subset(db.adult, db.adult$sex == v)  
    
    df <- within(df, education <- factor(education, 
                                       levels = names(sort(table(education), 
                                                           decreasing = TRUE))))
      
  
   ggplot(data = df, 
         aes(x = df$education, 
             fill = subset(db.adult, db.adult$sex == v)$education))+
     geom_bar(aes(y = (..count..)/sum(..count..))) +
     geom_text(aes(label = scales::percent((..count..)/sum(..count..)),
                   y = (..count..)/sum(..count..) ), 
               stat = "count",
               vjust = -0.1,
               size = 2.5) +
     labs(x = "Education", 
          y = "",
          fill = "Education") +
     ggtitle(paste(v)) +  
     theme(legend.position = 'none',
           plot.title = element_text(size = 11, face = "bold"),
           axis.text.x = element_text(angle = 45, hjust = 1)) +     
     scale_y_continuous(labels = percent) })


grid.arrange(grobs = lg.gender.education, ncol = 2)

From the plots above we can see that the percentages of women and men belonging to each level of “education” are very similar. An exception makes the category " Doctorate“, in which there are almost two times more men than women.

4 Tests for Independence of the Variables

We will test the independence of the categorical variables two by two with the Pearson’s Chi Square Test of Independence. The test checks the following null hypothesis

\(H_0\): The two categorical variables are independent in the considered population

against the alternative hypothesis

\(H_A\): The two categorical variables are dependent (related) in the considered population.

In order to calculate the Pearson’s chi square test statistic, we need to build a two-way contingency table with the observed counts of the categorical variables based on the sample data. Then we have to compute the expected counts under the null hypothesis. We do this using the marginal totals and the overall total: \[ E=\dfrac{\textrm{row total}\times\textrm{column total}}{\textrm{sample size}} \] The two-way contingency table for the categorical variables \(Var_1\) with \(n_1\) levels and \(Var_2\) with \(n_2\) levels is given by:

\(level_1\) \(level_2\) \(\dots\) \(level_{n_1}\) \(\textbf{Row Total}\)
\(level_1\) \(O_{11}\) \(O_{12}\) \(\dots\) \(O_{1,n_1}\) \(\sum\limits_{i=1}^{n_1} O_{1i}\)
\(level_2\) \(O_{21}\) \(O_{22}\) \(\dots\) \(O_{2,n_1}\) \(\sum\limits_{i=1}^{n_1} O_{2i}\)
\(\dots\) \(\dots\) \(\dots\) \(\dots\) \(\dots\) \(\dots\)
\(level_{n_2}\) \(O_{n_2,1}\) \(O_{n_2,2}\) \(\dots\) \(O_{n_2 n_1}\) \(\sum\limits_{i=1}^{n_1} O_{n_2,i}\)
\(\textbf{Column Total}\) \(\sum\limits_{j=1}^{n_2} O_{j1}\) \(\sum\limits_{j=1}^{n_2} O_{j2}\) \(\dots\) \(\sum\limits_{j=1}^{n_2} O_{j n_1}\) \(\textbf{Total}=\sum\limits_{i=1}^{n_1}\sum\limits_{j=1}^{n_2} O_{ij}\)

where with \(O_{ij}\) we denote the observed count in the \((i,j)\)-th cell, and \(i=1,2,\dots, n_2\) and \(j=1,2,\dots, n_1\). Then for the expected counts we have: \[ E_{ij}=\dfrac{\left( \sum\limits_{l=1}^{n_1} O_{il} \right)\left( \sum\limits_{l=1}^{n_2} O_{lj} \right)}{\sum\limits_{i=1}^{n_1}\sum\limits_{j=1}^{n_2} O_{ij}}, \] where \(i=1,2,\dots, n_2\) and \(j=1,2,\dots, n_1\).

Our goal is to assess if the observed counts differ significantly from the expected counts. If the discrepancy between the observed and expected counts is big, this means that the two categorical variables are related. In order to evaluate this difference, we compute the following chi-square test statistic: \[ X^2 = \sum\limits_{i=1}^{n} \dfrac{\left( O_i - E_i \right)^2}{E_i}, \] where \(n=n_1\times n_2\) is the number of cells in the two-way contingency table, \(O_i\) is the observed count of the \(i\)-th cell, and \(E_i\) is the expected count of the \(i\)-th cell. The Pearson chi-square test statistic asymptotically approaches (for a very large number of observations) a chi-squared distribution \(\chi^2_{df}\), where the number of degrees of freedom is equal to: \[ df=(n_1-1)(n_2-1). \] In the above formula \(n_1\) is the number of levels of one of the categorical variables and \(n_2\) is the number of levels of the other categorical variable. In order for the \(\chi^2_{df}\) approximation to be valid, a common rule of thumb is that the number of expected cell counts is at least 5 in all cells of the two-way contingency table. However, a lot of researchers consider this rule very restrictive.

The test statistic can be written also as: \[ X^2 = \sum\limits_{i=1}^{n_1}\sum\limits_{j=1}^{n_2} \dfrac{\left( O_{ij} - E_{ij} \right)^2}{E_{ij}} \] The Pearson chi-square test statistic then is used to calculate a p-value by comparing the value of the statistic to a chi-square distribution. If \(\chi^2_{\alpha}\) is the critical value for the chi-square test at significance level \(\alpha=0.05\), then we reject the null hypothesis if \(X^2 > \chi^2_{\alpha}\). This is equivalent to the p-value being smaller than \(\alpha=0.05\), where the p-value is the probability \[ \textrm{p-value} = P\left( \chi^2_{df} > X^2 \right) \] due to chance. In other words, the p-value is the probability, under the assumption that the null hypothesis is true, of obtaining a result equal to or more extreme, due to chance alone, than the observed. The smaller the p-value, the larger the significance because it means that we do not have enough evidence to accept the null hypothesis. Hence, if \(\textrm{p-value}<\alpha\), it is very unlikely that the difference between observed and expected cell counts is due to random variation, indicating that some association between the variables is present.

Next we use the Pearson’s chi-square test of independence to check whether the categorical variable income is related to some of the other categorical variables. We will use the R functions “CrossTable()” from the package “gmodels” and the function “chisq.test()” from the base installation. The first function gives the two-way contingency table together with the Pearson’s test statistic and corresponding p-value, whereas the second function only calculates the test statistic and the p-value. In fact, “CrossTable()” calls “chisq.test()” to calculate the Pearson’s chi-square test. We will show the contingency tables only for the first three variables to illustrate how they look like. Output of “CrossTable()”:

1) The variables “sex” and “income”:

# Two-way contingency table with Pearson's chi-square 
# test of independence for the variables "sex" and "income":
CrossTable(db.adult$sex, db.adult$income, 
           prop.chisq = TRUE,
           chisq = TRUE)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  30162 

 
             | db.adult$income 
db.adult$sex |     <=50K |      >50K | Row Total | 
-------------|-----------|-----------|-----------|
      Female |      8670 |      1112 |      9782 | 
             |   238.221 |   718.789 |           | 
             |     0.886 |     0.114 |     0.324 | 
             |     0.383 |     0.148 |           | 
             |     0.287 |     0.037 |           | 
-------------|-----------|-----------|-----------|
        Male |     13984 |      6396 |     20380 | 
             |   114.342 |   345.005 |           | 
             |     0.686 |     0.314 |     0.676 | 
             |     0.617 |     0.852 |           | 
             |     0.464 |     0.212 |           | 
-------------|-----------|-----------|-----------|
Column Total |     22654 |      7508 |     30162 | 
             |     0.751 |     0.249 |           | 
-------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  1416.357     d.f. =  1     p =  5.862415e-310 

Pearson's Chi-squared test with Yates' continuity correction 
------------------------------------------------------------
Chi^2 =  1415.286     d.f. =  1     p =  1.001553e-309 

 
# Pearson's chi-square test of independence for the variables
# "sex" and "income"
chisq.test(db.adult$sex, db.adult$income)

    Pearson's Chi-squared test with Yates' continuity correction

data:  db.adult$sex and db.adult$income
X-squared = 1415.3, df = 1, p-value < 2.2e-16

The p-value is less than 0.05, which means that at the 0.05 significance level we fail to accept the null hypothesis that the two categorical variables are independent.

2) The variables “race” and “income”:

chisq.test(db.adult$race, db.adult$income)

    Pearson's Chi-squared test

data:  db.adult$race and db.adult$income
X-squared = 304.24, df = 4, p-value < 2.2e-16
CrossTable(db.adult$race, db.adult$income, 
           prop.chisq = TRUE,
           chisq = TRUE)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  30162 

 
                    | db.adult$income 
      db.adult$race |     <=50K |      >50K | Row Total | 
--------------------|-----------|-----------|-----------|
              White |     19094 |      6839 |     25933 | 
                    |     7.558 |    22.806 |           | 
                    |     0.736 |     0.264 |     0.860 | 
                    |     0.843 |     0.911 |           | 
                    |     0.633 |     0.227 |           | 
--------------------|-----------|-----------|-----------|
              Black |      2451 |       366 |      2817 | 
                    |    53.110 |   160.249 |           | 
                    |     0.870 |     0.130 |     0.093 | 
                    |     0.108 |     0.049 |           | 
                    |     0.081 |     0.012 |           | 
--------------------|-----------|-----------|-----------|
 Asian-Pac-Islander |       647 |       248 |       895 | 
                    |     0.946 |     2.854 |           | 
                    |     0.723 |     0.277 |     0.030 | 
                    |     0.029 |     0.033 |           | 
                    |     0.021 |     0.008 |           | 
--------------------|-----------|-----------|-----------|
 Amer-Indian-Eskimo |       252 |        34 |       286 | 
                    |     6.439 |    19.430 |           | 
                    |     0.881 |     0.119 |     0.009 | 
                    |     0.011 |     0.005 |           | 
                    |     0.008 |     0.001 |           | 
--------------------|-----------|-----------|-----------|
              Other |       210 |        21 |       231 | 
                    |     7.679 |    23.171 |           | 
                    |     0.909 |     0.091 |     0.008 | 
                    |     0.009 |     0.003 |           | 
                    |     0.007 |     0.001 |           | 
--------------------|-----------|-----------|-----------|
       Column Total |     22654 |      7508 |     30162 | 
                    |     0.751 |     0.249 |           | 
--------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  304.2414     d.f. =  4     p =  1.317829e-64 


 

We reject the null hypothesis at the 0.05 significance level, because the p-values is less than 0.05. This means there is strong indication that “race” and “income” are correlated.

3) The variables “workclass” and “income”:

chisq.test(table(db.adult$workclass, db.adult$income)) 
Warning in chisq.test(table(db.adult$workclass, db.adult$income)): Chi-
squared approximation may be incorrect

    Pearson's Chi-squared test

data:  table(db.adult$workclass, db.adult$income)
X-squared = 804.16, df = 6, p-value < 2.2e-16

We get a warning message, because the contingency table probably has cells with expected cell counts less than 5:

chisq.test(table(db.adult$workclass, db.adult$income))$expected
Warning in chisq.test(table(db.adult$workclass, db.adult$income)): Chi-
squared approximation may be incorrect
                   
                          <=50K        >50K
   Private          16738.51349 5547.486506
   Self-emp-not-inc  1876.94271  622.057291
   Local-gov         1552.47722  514.522777
   State-gov          960.62814  318.371859
   Self-emp-inc       806.65725  267.342749
   Federal-gov        708.26610  234.733904
   Without-pay         10.51509    3.484915

We notice that, indeed, there are two cells ([“Never-worked”, “<=50K”] and [“Never-worked”, “>50K”]) with expected cell counts equal to 0 and one cell (the cell [“Without-pay”, “>50K”]) with expected cell count equal to 3.5<5. If we look at the observed counts of the levels of “workclass”

table(db.adult$workclass)

          Private  Self-emp-not-inc         Local-gov         State-gov 
            22286              2499              2067              1279 
     Self-emp-inc       Federal-gov       Without-pay 
             1074               943                14 

we see that there are no participants in the study who identify themselves as belonging to the category " Never-worked“. Therefore we remove this unused factor level from the categorical variable”workclass" with the help of the function “droplevels()”:

db.adult$workclass <- droplevels(db.adult$workclass)

levels(db.adult$workclass)
[1] " Private"          " Self-emp-not-inc" " Local-gov"       
[4] " State-gov"        " Self-emp-inc"     " Federal-gov"     
[7] " Without-pay"     
summary(db.adult$workclass)
          Private  Self-emp-not-inc         Local-gov         State-gov 
            22286              2499              2067              1279 
     Self-emp-inc       Federal-gov       Without-pay 
             1074               943                14 

After we removed the cells with zero expected counts, we perform again the Pearson’s chi-square test:

CrossTable(db.adult$workclass, db.adult$income, 
           prop.chisq = TRUE,
           chisq = TRUE)
Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation
may be incorrect

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  30162 

 
                   | db.adult$income 
db.adult$workclass |     <=50K |      >50K | Row Total | 
-------------------|-----------|-----------|-----------|
           Private |     17410 |      4876 |     22286 | 
                   |    26.938 |    81.279 |           | 
                   |     0.781 |     0.219 |     0.739 | 
                   |     0.769 |     0.649 |           | 
                   |     0.577 |     0.162 |           | 
-------------------|-----------|-----------|-----------|
  Self-emp-not-inc |      1785 |       714 |      2499 | 
                   |     4.504 |    13.590 |           | 
                   |     0.714 |     0.286 |     0.083 | 
                   |     0.079 |     0.095 |           | 
                   |     0.059 |     0.024 |           | 
-------------------|-----------|-----------|-----------|
         Local-gov |      1458 |       609 |      2067 | 
                   |     5.749 |    17.348 |           | 
                   |     0.705 |     0.295 |     0.069 | 
                   |     0.064 |     0.081 |           | 
                   |     0.048 |     0.020 |           | 
-------------------|-----------|-----------|-----------|
         State-gov |       935 |       344 |      1279 | 
                   |     0.684 |     2.063 |           | 
                   |     0.731 |     0.269 |     0.042 | 
                   |     0.041 |     0.046 |           | 
                   |     0.031 |     0.011 |           | 
-------------------|-----------|-----------|-----------|
      Self-emp-inc |       474 |       600 |      1074 | 
                   |   137.184 |   413.929 |           | 
                   |     0.441 |     0.559 |     0.036 | 
                   |     0.021 |     0.080 |           | 
                   |     0.016 |     0.020 |           | 
-------------------|-----------|-----------|-----------|
       Federal-gov |       578 |       365 |       943 | 
                   |    23.959 |    72.291 |           | 
                   |     0.613 |     0.387 |     0.031 | 
                   |     0.026 |     0.049 |           | 
                   |     0.019 |     0.012 |           | 
-------------------|-----------|-----------|-----------|
       Without-pay |        14 |         0 |        14 | 
                   |     1.155 |     3.485 |           | 
                   |     1.000 |     0.000 |     0.000 | 
                   |     0.001 |     0.000 |           | 
                   |     0.000 |     0.000 |           | 
-------------------|-----------|-----------|-----------|
      Column Total |     22654 |      7508 |     30162 | 
                   |     0.751 |     0.249 |           | 
-------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  804.1575     d.f. =  6     p =  1.946096e-170 


 

We get a warning message once more because there is still one cell with expected cell count less than 5. However, we ignore the warning message and interpret with caution the results. The p-value is very small (it is approximately equal to 0), which means that we reject the null hypothesis at the 0.05 significance level.

4) The variables “occupation” and “income”:

chisq.test(db.adult$occupation, db.adult$income)
Warning in chisq.test(db.adult$occupation, db.adult$income): Chi-squared
approximation may be incorrect

    Pearson's Chi-squared test

data:  db.adult$occupation and db.adult$income
X-squared = 3687.6, df = 13, p-value < 2.2e-16

We get a warning message again. Therefore we check if there are any cells with expected count less than 5:

chisq.test(db.adult$occupation, db.adult$income)$expected 
Warning in chisq.test(db.adult$occupation, db.adult$income): Chi-squared
approximation may be incorrect
                    db.adult$income
db.adult$occupation        <=50K        >50K
   Prof-specialty    3032.851005 1005.148995
   Craft-repair      3026.842384 1003.157616
   Exec-managerial   2998.301439  993.698561
   Adm-clerical      2794.759432  926.240568
   Sales             2691.861813  892.138187
   Other-service     2412.460977  799.539023
   Machine-op-inspct 1476.618394  489.381606
   Transport-moving  1180.693853  391.306147
   Handlers-cleaners 1013.954645  336.045355
   Farming-fishing    742.815662  246.184338
   Tech-support       684.982693  227.017307
   Protective-serv    483.693920  160.306080
   Priv-house-serv    107.404085   35.595915
   Armed-Forces         6.759698    2.240302

It turns out that there is one cell with expected count of 2.2 and that is the cell [“Armed-Forces”, “>50K”]. Since there is only one problematic cell, we consider the Pearson’s test trustworthy. The calculated p-value is very small and hence we fail to accept the null hypothesis, which means that there is strong indication that the categorical variables “occupation” and “income” are dependent.

5) The variables “education” and “income”:

chisq.test(db.adult$education, db.adult$income)

    Pearson's Chi-squared test

data:  db.adult$education and db.adult$income
X-squared = 4070.4, df = 15, p-value < 2.2e-16

6) The variables “marital_status” and “income”:

chisq.test(db.adult$marital_status, db.adult$income) 

    Pearson's Chi-squared test

data:  db.adult$marital_status and db.adult$income
X-squared = 6061.7, df = 6, p-value < 2.2e-16

7) The variables “relationship” and “income”:

chisq.test(db.adult$relationship, db.adult$income) 

    Pearson's Chi-squared test

data:  db.adult$relationship and db.adult$income
X-squared = 6233.8, df = 5, p-value < 2.2e-16

8) The variables “native_region” and “income”:

chisq.test(db.adult$native_region, db.adult$income)

    Pearson's Chi-squared test

data:  db.adult$native_region and db.adult$income
X-squared = 235.96, df = 7, p-value < 2.2e-16

9) The variables “hours_w” and “income”:

chisq.test(db.adult$hours_w, db.adult$income)

    Pearson's Chi-squared test

data:  db.adult$hours_w and db.adult$income
X-squared = 1940.4, df = 4, p-value < 2.2e-16

All of the rest Pearson’s chi-square tests give very small p-values, which means that it is very unlikely for the considered categorical variables - “education”, “marital_status”, “relationship”, “native_region” and “hours_w” not to be related with “income”.