In this post, we are going to creat a consistent US Census Data

Data Set Information:

Extraction was done by Barry Becker from the 1994 Census database.

The dataset was downloaded from: “https://www.kaggle.com/uciml/incomecensus-census-income

*** Response variable A. income: Categorical variable that contains yearly income of the respondent (“<=$50K” or “>50K”).

*** Independent variables B. age: Numerical variable that contains the age of the respondent.

C. workclass : Categorical variable that contains the type of employer of the respondent.

D. fnlwgt: Numerical variable that contains the number of respondents that each row of the data set represents.

E. education: Categorical variable that represents the level of education of the respondent (Doctorate, Prof-school, Masters, Bachelors, Assoc-acdm, Assoc-voc, Some-college, HS-grad, 12th, 11th, 10th, 9th, 7th-8th, 5th-6th, 1st-4th, Preschool).

F. education.num: Numerical variable that represents the education variable.

G. marital.status: The marital status of the respondent.

H. occupation: Categorical variable that represents the type of employment of the respondent (?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving).

I. relationship: Categorical variable that represents the position in the family of the respondent (Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife).

J. race: Categorical variable that represents the race of the respondent (Amer-Indian-Eskimo, Asian-Pac-Islander, Black,Other, White).

K. sex: Categorical variable that represent the sex of the respondent (Female, Male).

L. capital.gain: Numerical variable that represents the income gained by the respondent from sources other than salary/wages.

M. capital.loss: Numerical variable that represents the income lost by the respondent from sources other than salary/wages.

N. hours.per.week: Numerical variable that represents the hours worked per week by the respondent.

O. native.country: Categorical variable that represents the native country of the respondent.

  1. Importing data

We will read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().

library(readr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v dplyr   0.8.4
## v tibble  2.1.3     v stringr 1.4.0
## v tidyr   1.0.2     v forcats 0.4.0
## v purrr   0.3.3
## Warning: package 'ggplot2' was built under R version 3.6.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dplyr)
library(DataExplorer)
library(pander)
incomecensus <- read_csv("adultincome.csv")
## Parsed with column specification:
## cols(
##   age = col_double(),
##   workclass = col_character(),
##   fnlwgt = col_double(),
##   education = col_character(),
##   education.num = col_double(),
##   marital.status = col_character(),
##   occupation = col_character(),
##   relationship = col_character(),
##   race = col_character(),
##   sex = col_character(),
##   capital.gain = col_double(),
##   capital.loss = col_double(),
##   hours.per.week = col_double(),
##   native.country = col_character(),
##   income = col_character()
## )
  1. Inspecting the data frame

First look at the data

str(incomecensus)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of  15 variables:
##  $ age           : num  90 82 66 54 41 34 38 74 68 41 ...
##  $ workclass     : chr  "?" "Private" "?" "Private" ...
##  $ fnlwgt        : num  77053 132870 186061 140359 264663 ...
##  $ education     : chr  "HS-grad" "HS-grad" "Some-college" "7th-8th" ...
##  $ education.num : num  9 9 10 4 10 9 6 16 9 10 ...
##  $ marital.status: chr  "Widowed" "Widowed" "Widowed" "Divorced" ...
##  $ occupation    : chr  "?" "Exec-managerial" "?" "Machine-op-inspct" ...
##  $ relationship  : chr  "Not-in-family" "Not-in-family" "Unmarried" "Unmarried" ...
##  $ race          : chr  "White" "White" "Black" "White" ...
##  $ sex           : chr  "Female" "Female" "Female" "Female" ...
##  $ capital.gain  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : num  4356 4356 4356 3900 3900 ...
##  $ hours.per.week: num  40 18 40 40 40 45 40 20 40 60 ...
##  $ native.country: chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   workclass = col_character(),
##   ..   fnlwgt = col_double(),
##   ..   education = col_character(),
##   ..   education.num = col_double(),
##   ..   marital.status = col_character(),
##   ..   occupation = col_character(),
##   ..   relationship = col_character(),
##   ..   race = col_character(),
##   ..   sex = col_character(),
##   ..   capital.gain = col_double(),
##   ..   capital.loss = col_double(),
##   ..   hours.per.week = col_double(),
##   ..   native.country = col_character(),
##   ..   income = col_character()
##   .. )
incomecensus <- incomecensus %>% 
  select(-c(fnlwgt))
head(incomecensus)
## # A tibble: 6 x 14
##     age workclass education education.num marital.status occupation relationship
##   <dbl> <chr>     <chr>             <dbl> <chr>          <chr>      <chr>       
## 1    90 ?         HS-grad               9 Widowed        ?          Not-in-fami~
## 2    82 Private   HS-grad               9 Widowed        Exec-mana~ Not-in-fami~
## 3    66 ?         Some-col~            10 Widowed        ?          Unmarried   
## 4    54 Private   7th-8th               4 Divorced       Machine-o~ Unmarried   
## 5    41 Private   Some-col~            10 Separated      Prof-spec~ Own-child   
## 6    34 Private   HS-grad               9 Divorced       Other-ser~ Unmarried   
## # ... with 7 more variables: race <chr>, sex <chr>, capital.gain <dbl>,
## #   capital.loss <dbl>, hours.per.week <dbl>, native.country <chr>,
## #   income <chr>

Dealing with rows containing “?”

We can see that there are some rows with “?”.

For example

incomecensus %>% 
  group_by(workclass) %>% 
  summarise(count = n())
## # A tibble: 9 x 2
##   workclass        count
##   <chr>            <int>
## 1 ?                 1836
## 2 Federal-gov        960
## 3 Local-gov         2093
## 4 Never-worked         7
## 5 Private          22696
## 6 Self-emp-inc      1116
## 7 Self-emp-not-inc  2541
## 8 State-gov         1298
## 9 Without-pay         14

We will check if there are any other columns containings “?”

First, check for the records with any “?”

missing_count <- purrr::map_df(incomecensus, ~ stringr::str_detect(., pattern = "\\?")) %>%
  rowSums() %>%
  tbl_df() %>%
  filter(value > 0) %>%
  summarize(missing_count = n()) 

missing_count
## # A tibble: 1 x 1
##   missing_count
##           <int>
## 1          2399

In total, there are 2399 rows containing “?”

count.NA.per.col <- plyr::ldply(incomecensus, function(c) sum(c == "?"))
count.NA.per.col %>% 
  pander()
.id V1
age 0
workclass 1836
education 0
education.num 0
marital.status 0
occupation 1843
relationship 0
race 0
sex 0
capital.gain 0
capital.loss 0
hours.per.week 0
native.country 583
income 0

There are 3 columns that contain “?” as NA: workclass, occupation, and native.country

We will remove all of these records.

incomecensus <- incomecensus %>% 
  filter(!workclass=="?", !occupation=="?", !native.country=="?")

The resulting data contains 30162 rows.

Assigning correct R data types to each column

Converting character to factor.m

incomecensus <- incomecensus %>%
  mutate_if(is.character,as.factor)

str(incomecensus)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 30162 obs. of  14 variables:
##  $ age           : num  82 54 41 34 38 74 68 45 38 52 ...
##  $ workclass     : Factor w/ 7 levels "Federal-gov",..: 3 3 3 3 3 6 1 3 5 3 ...
##  $ education     : Factor w/ 16 levels "10th","11th",..: 12 6 16 12 1 11 12 11 15 10 ...
##  $ education.num : num  9 4 10 9 6 16 9 16 15 13 ...
##  $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 7 1 6 1 6 5 1 1 5 7 ...
##  $ occupation    : Factor w/ 14 levels "Adm-clerical",..: 4 7 10 8 1 10 10 10 10 8 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 5 4 5 5 3 2 5 2 2 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
##  $ capital.gain  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : num  4356 3900 3900 3770 3770 ...
##  $ hours.per.week: num  18 40 40 45 40 20 40 35 45 20 ...
##  $ native.country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 39 39 39 39 39 39 ...
##  $ income        : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...
  1. Detecting the inconsistency of the data

3.1. Missing Values

plot_missing(incomecensus)

We will remove the fnlwgt column

3.2. Checking duplicate records

    # to check for duplicate records
incomecensus %>%
  summarize(record_count = n(),
            distinct_records = n_distinct(.))
## # A tibble: 1 x 2
##   record_count distinct_records
##          <int>            <int>
## 1        30162            26904

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame.

If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

incomecensus <- incomecensus %>% 
  distinct()

The resulting data frame contain 26904 records.

3.3. Checking the consistence of the data

  1. Age and education
incomecensus %>%
  ggplot(aes(age, education)) + 
  geom_point(alpha = 0.3, col = "#00AFBB") + 
  geom_point(aes(col = (age<= 20 & education == 'Masters')), 
             alpha = 1, 
             size = 3) + 
  scale_colour_manual(values = setNames(c('blue','grey'),
                                        c(T, F))) +
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Education Level") +
  ggtitle("Age vs Education Level")

There are a few data points (in blue) are outlier datapoints.

It is highly unlikely that the people with age of 20 years and less can complete Masters degree. In reality, it might be possible.

We can take these data points out.

incomecensus <- incomecensus %>%
  filter(!(age <= 20 & education == 'Masters'))

B. sex and relationship status

table(incomecensus$sex, incomecensus$relationship)
##         
##          Husband Not-in-family Other-relative Own-child Unmarried  Wife
##   Female       1          3262            384      1576      2354  1365
##   Male     10808          3853            489      2077       732     1

There is one record where Husband is female

There is also one record where wife is female.

In 1994, this situation was not possible in the US

Where are these records ?

incomecensus %>%
  filter((sex == 'Male' & relationship == 'Wife')  | (sex == 'Female' & relationship == 'Husband'))
## # A tibble: 2 x 14
##     age workclass education education.num marital.status occupation relationship
##   <dbl> <fct>     <fct>             <dbl> <fct>          <fct>      <fct>       
## 1    29 Private   Bachelors            13 Married-civ-s~ Exec-mana~ Wife        
## 2    34 Private   HS-grad               9 Married-civ-s~ Sales      Husband     
## # ... with 7 more variables: race <fct>, sex <fct>, capital.gain <dbl>,
## #   capital.loss <dbl>, hours.per.week <dbl>, native.country <fct>,
## #   income <fct>

These points can be removed using the code below.

incomecensus <- incomecensus %>%
  filter(!((sex == 'Male' & relationship == 'Wife') | (sex == 'Female' & relationship == 'Husband')))

C. Relationship and marital status

Marital Status: The marital status classification identifies five major categories: never married, married, widowed, and divorced.

table(incomecensus$relationship, incomecensus$marital.status) 
##                 
##                  Divorced Married-AF-spouse Married-civ-spouse
##   Husband               0                 9              10799
##   Not-in-family      2153                 0                 14
##   Other-relative      102                 1                118
##   Own-child           305                 1                 83
##   Unmarried          1449                 0                  0
##   Wife                  0                10               1355
##                 
##                  Married-spouse-absent Never-married Separated Widowed
##   Husband                            0             0         0       0
##   Not-in-family                    181          3959       382     426
##   Other-relative                    26           533        53      40
##   Own-child                         43          3120        89      12
##   Unmarried                        120           774       404     339
##   Wife                               0             0         0       0

The group “other married, spouse absent” includes married people living apart because either the husband or wife was employed and living at a considerable distance from home, was serving away from home in the Armed Forces, had moved to another area, or had a different place of residence for any other reason except separation as defined above. (https://www.unmarried.org/government-terminology/)

incomecensus %>%
  filter(marital.status == 'Married-spouse-absent' & relationship == 'Unmarried') %>% 
  summarise(count = n())
## # A tibble: 1 x 1
##   count
##   <int>
## 1   120

There are 120 rows where people are ‘Married-spouse-absent’ and ‘Unmarried’ at the same time.

Thers records will be removed.

incomecensus <- incomecensus %>%
  filter(!(marital.status == 'Married-spouse-absent' & relationship == 'Unmarried'))

D. age and weekly working hour

incomecensus %>%
  ggplot(aes(age, hours.per.week)) + 
  geom_point(size = 2, shape = 23, color = "blue") + 
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  scale_y_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Weekly working hours") +
  ggtitle("Age vs Weekly working hours ")

Some people aged older than 70 and work more than 80 hours a week.

Even for people with 90 years of age, there are records of 100 hours per week.

To lighlight these data point

incomecensus %>%
  ggplot(aes(age, hours.per.week)) + 
  geom_point(alpha = 0.3, col = "#00AFBB") +
  geom_point(aes(col = (age >= 70 & hours.per.week >= 40)), 
             alpha = 1, 
             size = 3) + 
  scale_colour_manual(values = setNames(c('red','grey'),
                                        c(T, F))) +
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  scale_y_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Weekly working hours") +
  ggtitle("Age vs Weekly working hours ")

incomecensus %>%
  filter(age >= 70 & hours.per.week > 50) %>% 
  summarise(counts =n())
## # A tibble: 1 x 1
##   counts
##    <int>
## 1     25

Data points with age more than and equal to 70 years and working hours greater than 50 hours can be removed.

incomecensus <- incomecensus %>%
   filter(!(age >= 70 & hours.per.week > 50))

E. capital loss and capital gain

There are many points with 99999 as the Capital which seems suspicious and hence, should be eliminated from the analysis.

incomecensus %>%
  group_by(capital.gain) %>% 
  summarise(counts =n()) %>% 
  arrange(desc(capital.gain)) %>% 
  head(10)
## # A tibble: 10 x 2
##    capital.gain counts
##           <dbl>  <int>
##  1        99999    147
##  2        41310      2
##  3        34095      3
##  4        27828     32
##  5        25236     10
##  6        25124      1
##  7        22040      1
##  8        20051     31
##  9        18481      2
## 10        15831      6

We can remove all the records where the capital gain is equal to 99999.

incomecensus <- incomecensus %>%
   filter(!(capital.gain == 99999))
incomecensus %>%
  group_by(capital.loss) %>% 
  summarise(counts =n()) %>% 
  arrange(desc(capital.loss)) %>% 
  head(10)
## # A tibble: 10 x 2
##    capital.loss counts
##           <dbl>  <int>
##  1         4356      1
##  2         3900      2
##  3         3770      2
##  4         3683      2
##  5         3004      1
##  6         2824      8
##  7         2754      2
##  8         2603      4
##  9         2559     12
## 10         2547      4

F. Work class and occupation

table(incomecensus$occupation, incomecensus$workclass)
##                    
##                     Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc
##   Adm-clerical              308       269    2360           28               46
##   Armed-Forces                9         0       0            0                0
##   Craft-repair               62       138    2377           96              485
##   Exec-managerial           175       209    2296          360              369
##   Farming-fishing             8        27     427           51              415
##   Handlers-cleaners          21        44    1056            2               15
##   Machine-op-inspct          14        11    1571           10               35
##   Other-service              33       187    2349           27              171
##   Priv-house-serv             0         0     137            0                0
##   Prof-specialty            164       664    2001          138              346
##   Protective-serv            27       291     183            5                6
##   Sales                      14         7    2508          266              362
##   Tech-support               66        38     667            3               26
##   Transport-moving           24       112    1093           26              118
##                    
##                     State-gov Without-pay
##   Adm-clerical            245           3
##   Armed-Forces              0           0
##   Craft-repair             52           1
##   Exec-managerial         181           0
##   Farming-fishing          15           5
##   Handlers-cleaners         9           1
##   Machine-op-inspct        13           1
##   Other-service           119           0
##   Priv-house-serv           0           0
##   Prof-specialty          392           0
##   Protective-serv         112           0
##   Sales                    10           0
##   Tech-support             55           0
##   Transport-moving         40           1

We might remove the two levels: without-pay

We also might regroup these levels into 3 categoris : public, private, self-employeed

  1. Exploratory data analysis (EDA)

A. income Categorical variable that contains yearly income of the respondent (“<=$50K” or “>50K”).

incomecensus %>% 
  group_by(income) %>% 
  summarise(count = n())
## # A tibble: 2 x 2
##   income count
##   <fct>  <int>
## 1 <=50K  19888
## 2 >50K    6720
income_prop <- incomecensus %>% 
  group_by(income) %>% 
  summarise(count = n()) %>% 
  ungroup()%>% 
  arrange(desc(income)) %>%
  mutate(percentage = round(count/sum(count),4)*100,
         lab.pos = cumsum(percentage)-0.5*percentage)

ggplot(data = income_prop, 
       aes(x = "", 
           y = percentage, 
           fill = income))+
  geom_bar(stat = "identity")+
  coord_polar("y") +
  geom_text(aes(y = lab.pos, 
                label = paste(percentage,"%", sep = "")), col = "blue", size = 5) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  theme_void() +
  theme(legend.title = element_text(color = "black", size = 14),
        legend.text = element_text(color = "black", size = 14))

The summary of the data shows that 75 % of the observations have an income less than or equal to 50k dollars.

Specifically, 22654 persons have an income <=50k dollars, while 7508 people earn more than 50k.

B. Age

incomecensus %>% ggplot(aes(age)) + 
  geom_histogram(fill= "lightblue",
                 color = 'blue',
                 binwidth = 5) +   
  labs(title= "Age Distribution") +
  theme(plot.title = element_text(hjust = 0.5))

summary(incomecensus$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   38.00   38.97   48.00   90.00
incomecensus %>% ggplot(aes(age)) + 
  geom_histogram(aes(fill=income),
                 color = 'grey',
                 binwidth = 1) +   
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  labs(title= "Age Distribution for Income")+
  theme(plot.title = element_text(hjust = 0.5))

incomecensus %>% 
  ggplot(aes(age, 
             fill= income)) +
  geom_density(alpha= 0.7, color = 'blue') +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  labs(x = "Age", y = "Density", title = "Density graph of age distribution")

Older people tends to earn more.

On the other hand, majority of the people with an age around 25 years earns less than 50k per annum.

C. workclass

library(ggthemes)
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 3.6.3
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
incomecensus %>%
  group_by(workclass) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(workclass, -counts),
             y  = counts,
             fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.6) +
  scale_fill_gradient(low="skyblue1", high="royalblue4")+
  geom_text(aes(label = paste0(round(counts,1), "\n",Percentage,"%")), 
            vjust = -0.1, 
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,25000)) +
  theme_minimal() +
  labs(x = "Work class",
       y = "Frequency",
       caption = "Income census US 1994") 

Almost three-quarters of sample work in the private sector.

In this data, there are dictinction between Local-gov, State-gov, and Federal-gov.

We might group these three category if there is no difference in income level between them.

library(ggpubr)
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
p1 <- incomecensus %>%
  group_by(workclass) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(workclass, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n",Percentage,"%")), 
            vjust = 0.5, 
            hjust = -0.5,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Work class",y = "Frequency") + 
  coord_flip()

p2 <- incomecensus %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name= "Percentage", 
                     labels = percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(p1, p2, nrow = 1)

income_gov <- incomecensus %>% 
  filter(workclass %in% c("Local-gov", "State-gov", "Federal-gov")) %>% 
  group_by(workclass, income) %>%
  summarise(count = n()) %>% 
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

income_gov <-  income_gov %>% transmute(income, percent = count*100/sum(count))
income_gov 
## # A tibble: 6 x 3
## # Groups:   workclass [3]
##   workclass   income percent
##   <fct>       <fct>    <dbl>
## 1 State-gov   >50K      27.0
## 2 Local-gov   >50K      29.4
## 3 Federal-gov >50K      38.4
## 4 Federal-gov <=50K     61.6
## 5 Local-gov   <=50K     70.6
## 6 State-gov   <=50K     73.0
incomecensus %>% 
  filter(workclass %in% c("Local-gov", "State-gov", "Federal-gov")) %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank())

There is a slight difference between Local-gov and Stave-gov.

Working for the federal government can have a greater chance of getting an income higher than 50K.

incomecensus %>% 
  filter(workclass %in% c("Self-emp-not-inc", "Self-emp-inc", "Without-pay")) %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank())

education

incomecensus %>% 
  group_by(education) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts)) %>% 
  pander()
education counts
HS-grad 8201
Some-college 5853
Bachelors 4447
Masters 1534
Assoc-voc 1248
Assoc-acdm 990
11th 937
10th 763
7th-8th 534
Prof-school 486
9th 445
12th 356
Doctorate 349
5th-6th 275
1st-4th 146
Preschool 44
incomecensus %>% 
  group_by(education.num) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts)) %>% 
  pander()
education.num counts
9 8201
10 5853
13 4447
14 1534
11 1248
12 990
7 937
6 763
4 534
15 486
5 445
8 356
16 349
3 275
2 146
1 44

We can see that these two columns are identical.

The column education.num is simply converted from the column education.

If we keep the column education which is factor type, we might regroup some categories in this column.

If we keep the column education.num which is numerical type,, we can keep the column as it is (numeric).

eudcation_df <- incomecensus %>%
  group_by(education.num) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(education.num)

eudcation_df %>% 
  pander()
education.num counts Percentage
1 44 0.17
2 146 0.55
3 275 1.03
4 534 2.01
5 445 1.67
6 763 2.87
7 937 3.52
8 356 1.34
9 8201 30.82
10 5853 22
11 1248 4.69
12 990 3.72
13 4447 16.71
14 1534 5.77
15 486 1.83
16 349 1.31
ggplot(data = eudcation_df ,
           aes(x= education.num,
               y  = counts,
               fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.6) +
  scale_x_continuous(breaks = c(0:16)) +
  scale_fill_gradient(low="skyblue1", high="royalblue4")+
  geom_text(aes(label = paste0(round(counts,1), "\n","(",Percentage,"%)")), 
            vjust = -0.1, 
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,12000)) +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Frequency",
       caption = "Income census US 1994") +
  ggtitle("Distribution of years of education") +
    theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "right", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12))

About 50% of respondents have 9-10 years of education.

About 17% of people in the sample hold bachelor degree (13 years of education).

education.pct <- incomecensus %>%
  group_by(education.num, income) %>%
  summarize(count = n()) %>%
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

ggplot(education.pct, 
       aes(education.num, pct, fill = income)) + 
  geom_bar(stat="identity", position = "fill") + 
  geom_hline(yintercept = 0.2489, col = "blue") +
  scale_x_continuous(breaks = c(0:16)) +
  scale_y_continuous(labels=scales::percent) + 
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  ggtitle("Income by years of education") + 
  xlab("Education (years)") + 
  ylab("Percentage") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "right", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12)) 

Higher years of education led to a higher chance of having >50K.

marital status

incomecensus %>% 
  group_by(marital.status) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts))
## # A tibble: 7 x 2
##   marital.status        counts
##   <fct>                  <int>
## 1 Married-civ-spouse     12236
## 2 Never-married           8369
## 3 Divorced                3996
## 4 Separated                926
## 5 Widowed                  811
## 6 Married-spouse-absent    249
## 7 Married-AF-spouse         21

we can put all the married people in the same group

This column have 7 labels, three of them are:

These levels can be grouped into the group “married”.

Replace “Married-AF-spouse”, “Married-civ-spouse”, and “Married-spouse-absent” by “Married”

pat = c("Married-AF-spouse|Married-civ-spouse|Married-spouse-absent")
incomecensus <- incomecensus %>% 
  mutate(marital_status = stringr::str_replace_all(marital.status, pat, "Married"))
incomecensus <- incomecensus %>% 
  select(-marital.status)
marital_df <- incomecensus %>%
  group_by(marital_status) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),1)) %>% 
  arrange(desc(counts))

marital_df %>% 
  pander()
marital_status counts Percentage
Married 12506 47
Never-married 8369 31.5
Divorced 3996 15
Separated 926 3.5
Widowed 811 3

The proportion of married people that have an income higher than 50K is highest.

p3 <- incomecensus %>%
  group_by(marital_status) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(marital_status, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, 
            hjust = -0.5,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Marital status",y = "Frequency") + 
  coord_flip()

p4 <- incomecensus %>% 
  group_by(marital_status, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(marital_status, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(p3, p4, nrow = 1)

occupation

Category_df <- incomecensus %>%
  group_by(occupation) %>% 
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(counts)

ggplot(data = Category_df,
           aes(x= reorder(occupation, -counts),
               y  = counts,
               fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.7) +
  geom_text(aes(label = paste0(counts,", ", Percentage,"%")), 
            vjust = 0.2, 
            hjust = -0.1,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,5000)) +
  scale_fill_distiller(palette = "Spectral") +
  theme_minimal() +
  labs(x = "Category",
       y = "Frequency",
       caption = "Income census US 1994") +
  coord_flip()

occupation.pct <- incomecensus %>%
  group_by(occupation, income) %>%
  summarize(count = n()) %>%
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

occupation.pct$occupation <- factor(occupation.pct$occupation,
                                    levels = occupation.pct$occupation[1:(nrow(occupation.pct)/2)])

ggplot(occupation.pct, aes(reorder(occupation,-pct), pct, fill = income)) + 
  geom_bar(stat="identity", position = "fill") + 
  geom_hline(yintercept = 0.2489, col = "blue") +
  ggtitle("Income by occupation") + 
  xlab("Occupation") + 
  ylab("Percentage") + 
  scale_y_continuous(labels=scales::percent) + 
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "bottom", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12)) + 
  coord_flip()

About 25% of the sample has an income of more than 50K.

There are five categories of jobs that have a higher percentage: Protective service, tech-support, Sales, Exec-managerial, and Prof-speciality.

Question: Are these people work in the private sector?

incomecensus %>% 
  filter(occupation %in% c("Protective-serv", "Tech-support", "Sales", "Exec-managerial", "Prof-specialty")) %>% 
  group_by(workclass) %>% 
  summarise(freq =  n()) %>% 
  mutate(percentage = round(freq*100/sum(freq),1)) %>%
  arrange(desc(percentage)) %>% 
  pander()
workclass freq percentage
Private 7655 64.1
Local-gov 1209 10.1
Self-emp-not-inc 1109 9.3
Self-emp-inc 772 6.5
State-gov 750 6.3
Federal-gov 446 3.7

About two-third work in the private sector.

Governmental job accounts for 18.81 %.

Blue_Collar (Craft-repair,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Transport-moving)

White_Collar (Adm-clerical, Sales, Tech-support,Protective-serv),

Exec_mgr_prof (Exec-managerial,Prof-specialty),

and Service_other (Armed-Forces, Other-service, Priv-house-serv)

occupation_distr <- incomecensus %>%
  group_by(occupation) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(occupation, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, hjust = -0.5,color = "darkblue", size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Grouped Occupation", y = "Frequency") + 
  coord_flip()

occupation_pct <- incomecensus %>% 
  group_by(occupation, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(occupation, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(occupation_distr, occupation_pct, nrow = 1)

incomecensus %>% 
  ggplot(aes(income, color= income, fill= income)) +
  geom_bar( alpha = 0.8, width = 0.8) +
  facet_grid(~ relationship) + 
  labs(x = "Incomes", y = "Count", title = "Incomes by relationship")

This column contains categories that might be overlap with other feature.

For example, unmarried people in the relationship attribute is the same as the unmarried level in the marital column.

We will remove this column

Another way is to conver this column into categories :

having child (leaving _with_child), with other relative, in couple not in family and unmarried = living alone

relationship_distr <- incomecensus %>%
  group_by(relationship) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(relationship, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, hjust = -0.5,color = "darkblue", size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Grouped Occupation", y = "Frequency") + 
  coord_flip()

relationship_pct <- incomecensus %>% 
  group_by(relationship, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(relationship, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(relationship_distr, relationship_pct, nrow = 1)

This variable may be redundant as it collapse somehow with the feature marital_status.

incomecensus <- incomecensus %>% 
  select(-relationship)

race

incomecensus %>% 
  ggplot(aes(race, fill= income)) +
  geom_bar(position = "fill") +
  labs(x = "Race", 
       y = "Proportion", 
       title = "Incomes by race")+
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  geom_hline(yintercept = 0.2489, col="blue") +
  coord_flip()

Black and “Amer-Indio-Eskima” earn less than 50k more frequently than the general population.

However, the percent of whites and “Asian-Pac-Islander” that earn more than 50k is over the general population average.

Sex Here we can see that the vast majority of people having an income greater than 50000 dollars are males.

gender_prop <- incomecensus %>% 
  group_by(sex) %>% 
  summarise(count = n()) %>% 
  ungroup()%>% 
  arrange(desc(sex)) %>%
  mutate(percentage = round(count/sum(count),4)*100,
         lab.pos = cumsum(percentage)-0.5*percentage)

gender_distr <- ggplot(data = gender_prop, 
       aes(x = "", 
           y = percentage, 
           fill = sex))+
  geom_bar(stat = "identity")+
  coord_polar("y") +
  geom_text(aes(y = lab.pos, 
                label = paste(percentage,"%", sep = "")), col = "blue", size = 4) +
  scale_fill_manual(values=c("orange", "lightblue"),
                    name = "Gender") +
  theme_void() +
  theme(legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))


gender_prop <- incomecensus %>% 
  group_by(sex, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(sex, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(color = "black", size = 12),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))

ggarrange(gender_distr, gender_prop, nrow = 1)

Two-third of respondents are male.

The proportion of women that earn more than 50k is much lower than that of their male counterparts.

hour per week

Distribution of hours per week

incomecensus %>% ggplot(aes(hours.per.week)) + 
  geom_histogram(fill= "orange",
                 color = 'blue',
                 binwidth = 5) +   
  labs(title= "Age Distribution") +
  theme(plot.title = element_text(hjust = 0.5))

incomecensus %>% 
  group_by(income) %>%
  summarise("Mean hours per week" = mean(hours.per.week),
            "Standard deviatrion" = sd(hours.per.week))
## # A tibble: 2 x 3
##   income `Mean hours per week` `Standard deviatrion`
##   <fct>                  <dbl>                 <dbl>
## 1 <=50K                   39.5                  12.3
## 2 >50K                    45.8                  11.0

Contrusting box-plot

ggplot(data = incomecensus, 
       aes(income, 
           hours.per.week, 
           fill = income))+
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  geom_boxplot()+
  labs(x = "Incomes", 
       y = "Worked hours per week", 
       title = "Incomes by working hours")

There are many outliers.

Some people work for 100 hours a week (which is possible)

native.country

Incomes by origin

There are 40 different native countries and some of them have just a few cases, for instance, honduras have just 10 cases.

Therefore, the first attempt is to group the countries by continent.

incomecensus <- incomecensus %>% 
  mutate(native_continent = case_when(
    native.country %in% c("France", "Greece", "Hungary", "Italy", "Portugal",
                          "Scotland", "England", "Germany", "Holand-Netherlands",
                          "Ireland", "Poland", "Yugoslavia") ~ "Europe",
    native.country %in% c("Columbia", "Dominican-Republic", "El-Salvador", "Haiti",
                          "Honduras", "Mexico", "Outlying-US(Guam-USVI-etc)",
                          "Cuba", "Ecuador", "Guatemala", "Jamaica", "Nicaragua",
                          "Peru", "Puerto-Rico", "Trinadad&Tobago") ~ "Latin America", # "Trinadad&Tobago" and "Outlying-US(Guam-USVI-etc" 
    native.country %in% c("Iran", "Japan", "Philippines", "Taiwan", "Vietnam", "Cambodia",
                          "China", "Hong", "India", "Laos", "South", "Thailand") ~ "Asia",
    native.country %in% c("United-States","Canada") ~ "USA/Canada"))
incomecensus %>% 
  group_by(native_continent) %>% 
  count()
## # A tibble: 4 x 2
## # Groups:   native_continent [4]
##   native_continent     n
##   <chr>            <int>
## 1 Asia               693
## 2 Europe             490
## 3 Latin America     1314
## 4 USA/Canada       24111
incomecensus %>% 
  group_by(native_continent, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(native_continent, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "blue") +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(color = "black", size = 12),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))

We will remove some columns

incomecensus <- incomecensus %>% 
  select(-c('education',"capital.gain","capital.loss", "native.country"))

The final data

str(incomecensus)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 26608 obs. of  10 variables:
##  $ age             : num  82 54 41 34 38 74 68 45 38 52 ...
##  $ workclass       : Factor w/ 7 levels "Federal-gov",..: 3 3 3 3 3 6 1 3 5 3 ...
##  $ education.num   : num  9 4 10 9 6 16 9 16 15 13 ...
##  $ occupation      : Factor w/ 4 levels "White_collar",..: 4 3 4 2 1 4 4 4 4 2 ...
##  $ race            : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
##  $ sex             : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
##  $ hours.per.week  : num  18 40 40 45 40 20 40 35 45 20 ...
##  $ income          : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...
##  $ marital_status  : chr  "Widowed" "Divorced" "Separated" "Divorced" ...
##  $ native_continent: chr  "USA/Canada" "USA/Canada" "USA/Canada" "USA/Canada" ...

we need to convert two columns marital_status and native_continent in factor

incomecensus$marital_status <- as.factor(incomecensus$marital_status)
incomecensus$native_continent <- as.factor(incomecensus$native_continent)

we will save this data as income.csv

write.csv(incomecensus,file = “income.csv”, row.names= FALSE)