Ensuring the quality of data

In this post, we are going to creat a consistent US Census Data

Data Set Information:

Extraction was done by Barry Becker from the 1994 Census database.

The dataset was downloaded from: “https://www.kaggle.com/uciml/incomecensus-census-income”

*** Response variable A. income: Categorical variable that contains yearly income of the respondent (“<=$50K” or “>50K”).

*** Independent variables B. age: Numerical variable that contains the age of the respondent.

C. workclass : Categorical variable that contains the type of employer of the respondent.

D. fnlwgt: Numerical variable that contains the number of respondents that each row of the data set represents.

E. education: Categorical variable that represents the level of education of the respondent (Doctorate, Prof-school, Masters, Bachelors, Assoc-acdm, Assoc-voc, Some-college, HS-grad, 12th, 11th, 10th, 9th, 7th-8th, 5th-6th, 1st-4th, Preschool).

F. education.num: Numerical variable that represents the education variable.

G. marital.status: The marital status of the respondent.

H. occupation: Categorical variable that represents the type of employment of the respondent (?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving).

I. relationship: Categorical variable that represents the position in the family of the respondent (Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife).

J. race: Categorical variable that represents the race of the respondent (Amer-Indian-Eskimo, Asian-Pac-Islander, Black,Other, White).

K. sex: Categorical variable that represent the sex of the respondent (Female, Male).

L. capital.gain: Numerical variable that represents the income gained by the respondent from sources other than salary/wages.

M. capital.loss: Numerical variable that represents the income lost by the respondent from sources other than salary/wages.

N. hours.per.week: Numerical variable that represents the hours worked per week by the respondent.

O. native.country: Categorical variable that represents the native country of the respondent.

Importing data

We will read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().

library(readr)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v dplyr   0.8.4
## v tibble  2.1.3     v stringr 1.4.0
## v tidyr   1.0.2     v forcats 0.4.0
## v purrr   0.3.3

## Warning: package 'ggplot2' was built under R version 3.6.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(dplyr)
library(DataExplorer)
library(pander)

incomecensus <- read_csv("adultincome.csv")

## Parsed with column specification:
## cols(
##   age = col_double(),
##   workclass = col_character(),
##   fnlwgt = col_double(),
##   education = col_character(),
##   education.num = col_double(),
##   marital.status = col_character(),
##   occupation = col_character(),
##   relationship = col_character(),
##   race = col_character(),
##   sex = col_character(),
##   capital.gain = col_double(),
##   capital.loss = col_double(),
##   hours.per.week = col_double(),
##   native.country = col_character(),
##   income = col_character()
## )

Inspecting the data frame

First look at the data

str(incomecensus)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of  15 variables:
##  $ age           : num  90 82 66 54 41 34 38 74 68 41 ...
##  $ workclass     : chr  "?" "Private" "?" "Private" ...
##  $ fnlwgt        : num  77053 132870 186061 140359 264663 ...
##  $ education     : chr  "HS-grad" "HS-grad" "Some-college" "7th-8th" ...
##  $ education.num : num  9 9 10 4 10 9 6 16 9 10 ...
##  $ marital.status: chr  "Widowed" "Widowed" "Widowed" "Divorced" ...
##  $ occupation    : chr  "?" "Exec-managerial" "?" "Machine-op-inspct" ...
##  $ relationship  : chr  "Not-in-family" "Not-in-family" "Unmarried" "Unmarried" ...
##  $ race          : chr  "White" "White" "Black" "White" ...
##  $ sex           : chr  "Female" "Female" "Female" "Female" ...
##  $ capital.gain  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : num  4356 4356 4356 3900 3900 ...
##  $ hours.per.week: num  40 18 40 40 40 45 40 20 40 60 ...
##  $ native.country: chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   workclass = col_character(),
##   ..   fnlwgt = col_double(),
##   ..   education = col_character(),
##   ..   education.num = col_double(),
##   ..   marital.status = col_character(),
##   ..   occupation = col_character(),
##   ..   relationship = col_character(),
##   ..   race = col_character(),
##   ..   sex = col_character(),
##   ..   capital.gain = col_double(),
##   ..   capital.loss = col_double(),
##   ..   hours.per.week = col_double(),
##   ..   native.country = col_character(),
##   ..   income = col_character()
##   .. )

incomecensus <- incomecensus %>% 
  select(-c(fnlwgt))

head(incomecensus)

## # A tibble: 6 x 14
##     age workclass education education.num marital.status occupation relationship
##   <dbl> <chr>     <chr>             <dbl> <chr>          <chr>      <chr>       
## 1    90 ?         HS-grad               9 Widowed        ?          Not-in-fami~
## 2    82 Private   HS-grad               9 Widowed        Exec-mana~ Not-in-fami~
## 3    66 ?         Some-col~            10 Widowed        ?          Unmarried   
## 4    54 Private   7th-8th               4 Divorced       Machine-o~ Unmarried   
## 5    41 Private   Some-col~            10 Separated      Prof-spec~ Own-child   
## 6    34 Private   HS-grad               9 Divorced       Other-ser~ Unmarried   
## # ... with 7 more variables: race <chr>, sex <chr>, capital.gain <dbl>,
## #   capital.loss <dbl>, hours.per.week <dbl>, native.country <chr>,
## #   income <chr>

Dealing with rows containing “?”

We can see that there are some rows with “?”.

For example

incomecensus %>% 
  group_by(workclass) %>% 
  summarise(count = n())

## # A tibble: 9 x 2
##   workclass        count
##   <chr>            <int>
## 1 ?                 1836
## 2 Federal-gov        960
## 3 Local-gov         2093
## 4 Never-worked         7
## 5 Private          22696
## 6 Self-emp-inc      1116
## 7 Self-emp-not-inc  2541
## 8 State-gov         1298
## 9 Without-pay         14

We will check if there are any other columns containings “?”

First, check for the records with any “?”

missing_count <- purrr::map_df(incomecensus, ~ stringr::str_detect(., pattern = "\\?")) %>%
  rowSums() %>%
  tbl_df() %>%
  filter(value > 0) %>%
  summarize(missing_count = n()) 

missing_count

## # A tibble: 1 x 1
##   missing_count
##           <int>
## 1          2399

In total, there are 2399 rows containing “?”

count.NA.per.col <- plyr::ldply(incomecensus, function(c) sum(c == "?"))
count.NA.per.col %>% 
  pander()

.id	V1
age	0
workclass	1836
education	0
education.num	0
marital.status	0
occupation	1843
relationship	0
race	0
sex	0
capital.gain	0
capital.loss	0
hours.per.week	0
native.country	583
income	0

There are 3 columns that contain “?” as NA: workclass, occupation, and native.country

We will remove all of these records.

incomecensus <- incomecensus %>% 
  filter(!workclass=="?", !occupation=="?", !native.country=="?")

The resulting data contains 30162 rows.

Assigning correct R data types to each column

Converting character to factor.m

incomecensus <- incomecensus %>%
  mutate_if(is.character,as.factor)

str(incomecensus)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 30162 obs. of  14 variables:
##  $ age           : num  82 54 41 34 38 74 68 45 38 52 ...
##  $ workclass     : Factor w/ 7 levels "Federal-gov",..: 3 3 3 3 3 6 1 3 5 3 ...
##  $ education     : Factor w/ 16 levels "10th","11th",..: 12 6 16 12 1 11 12 11 15 10 ...
##  $ education.num : num  9 4 10 9 6 16 9 16 15 13 ...
##  $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 7 1 6 1 6 5 1 1 5 7 ...
##  $ occupation    : Factor w/ 14 levels "Adm-clerical",..: 4 7 10 8 1 10 10 10 10 8 ...
##  $ relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 5 4 5 5 3 2 5 2 2 ...
##  $ race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
##  $ sex           : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
##  $ capital.gain  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : num  4356 3900 3900 3770 3770 ...
##  $ hours.per.week: num  18 40 40 45 40 20 40 35 45 20 ...
##  $ native.country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 39 39 39 39 39 39 ...
##  $ income        : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...

Detecting the inconsistency of the data

3.1. Missing Values

plot_missing(incomecensus)

We will remove the fnlwgt column

3.2. Checking duplicate records

    # to check for duplicate records
incomecensus %>%
  summarize(record_count = n(),
            distinct_records = n_distinct(.))

## # A tibble: 1 x 2
##   record_count distinct_records
##          <int>            <int>
## 1        30162            26904

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame.

If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().

incomecensus <- incomecensus %>% 
  distinct()

The resulting data frame contain 26904 records.

3.3. Checking the consistence of the data

Age and education

incomecensus %>%
  ggplot(aes(age, education)) + 
  geom_point(alpha = 0.3, col = "#00AFBB") + 
  geom_point(aes(col = (age<= 20 & education == 'Masters')), 
             alpha = 1, 
             size = 3) + 
  scale_colour_manual(values = setNames(c('blue','grey'),
                                        c(T, F))) +
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Education Level") +
  ggtitle("Age vs Education Level")

There are a few data points (in blue) are outlier datapoints.

It is highly unlikely that the people with age of 20 years and less can complete Masters degree. In reality, it might be possible.

We can take these data points out.

incomecensus <- incomecensus %>%
  filter(!(age <= 20 & education == 'Masters'))

B. sex and relationship status

table(incomecensus$sex, incomecensus$relationship)

##         
##          Husband Not-in-family Other-relative Own-child Unmarried  Wife
##   Female       1          3262            384      1576      2354  1365
##   Male     10808          3853            489      2077       732     1

There is one record where Husband is female

There is also one record where wife is female.

In 1994, this situation was not possible in the US

Where are these records ?

incomecensus %>%
  filter((sex == 'Male' & relationship == 'Wife')  | (sex == 'Female' & relationship == 'Husband'))

## # A tibble: 2 x 14
##     age workclass education education.num marital.status occupation relationship
##   <dbl> <fct>     <fct>             <dbl> <fct>          <fct>      <fct>       
## 1    29 Private   Bachelors            13 Married-civ-s~ Exec-mana~ Wife        
## 2    34 Private   HS-grad               9 Married-civ-s~ Sales      Husband     
## # ... with 7 more variables: race <fct>, sex <fct>, capital.gain <dbl>,
## #   capital.loss <dbl>, hours.per.week <dbl>, native.country <fct>,
## #   income <fct>

These points can be removed using the code below.

incomecensus <- incomecensus %>%
  filter(!((sex == 'Male' & relationship == 'Wife') | (sex == 'Female' & relationship == 'Husband')))

C. Relationship and marital status

Marital Status: The marital status classification identifies five major categories: never married, married, widowed, and divorced.

table(incomecensus$relationship, incomecensus$marital.status)

##                 
##                  Divorced Married-AF-spouse Married-civ-spouse
##   Husband               0                 9              10799
##   Not-in-family      2153                 0                 14
##   Other-relative      102                 1                118
##   Own-child           305                 1                 83
##   Unmarried          1449                 0                  0
##   Wife                  0                10               1355
##                 
##                  Married-spouse-absent Never-married Separated Widowed
##   Husband                            0             0         0       0
##   Not-in-family                    181          3959       382     426
##   Other-relative                    26           533        53      40
##   Own-child                         43          3120        89      12
##   Unmarried                        120           774       404     339
##   Wife                               0             0         0       0

The group “other married, spouse absent” includes married people living apart because either the husband or wife was employed and living at a considerable distance from home, was serving away from home in the Armed Forces, had moved to another area, or had a different place of residence for any other reason except separation as defined above. (https://www.unmarried.org/government-terminology/)

incomecensus %>%
  filter(marital.status == 'Married-spouse-absent' & relationship == 'Unmarried') %>% 
  summarise(count = n())

## # A tibble: 1 x 1
##   count
##   <int>
## 1   120

There are 120 rows where people are ‘Married-spouse-absent’ and ‘Unmarried’ at the same time.

Thers records will be removed.

incomecensus <- incomecensus %>%
  filter(!(marital.status == 'Married-spouse-absent' & relationship == 'Unmarried'))

D. age and weekly working hour

incomecensus %>%
  ggplot(aes(age, hours.per.week)) + 
  geom_point(size = 2, shape = 23, color = "blue") + 
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  scale_y_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Weekly working hours") +
  ggtitle("Age vs Weekly working hours ")

Some people aged older than 70 and work more than 80 hours a week.

Even for people with 90 years of age, there are records of 100 hours per week.

To lighlight these data point

incomecensus %>%
  ggplot(aes(age, hours.per.week)) + 
  geom_point(alpha = 0.3, col = "#00AFBB") +
  geom_point(aes(col = (age >= 70 & hours.per.week >= 40)), 
             alpha = 1, 
             size = 3) + 
  scale_colour_manual(values = setNames(c('red','grey'),
                                        c(T, F))) +
  theme(legend.position="bottom") + 
  scale_x_continuous(breaks = seq(0,100,10))+ 
  scale_y_continuous(breaks = seq(0,100,10))+ 
  xlab("Age") + 
  ylab("Weekly working hours") +
  ggtitle("Age vs Weekly working hours ")

incomecensus %>%
  filter(age >= 70 & hours.per.week > 50) %>% 
  summarise(counts =n())

## # A tibble: 1 x 1
##   counts
##    <int>
## 1     25

Data points with age more than and equal to 70 years and working hours greater than 50 hours can be removed.

incomecensus <- incomecensus %>%
   filter(!(age >= 70 & hours.per.week > 50))

E. capital loss and capital gain

There are many points with 99999 as the Capital which seems suspicious and hence, should be eliminated from the analysis.

incomecensus %>%
  group_by(capital.gain) %>% 
  summarise(counts =n()) %>% 
  arrange(desc(capital.gain)) %>% 
  head(10)

## # A tibble: 10 x 2
##    capital.gain counts
##           <dbl>  <int>
##  1        99999    147
##  2        41310      2
##  3        34095      3
##  4        27828     32
##  5        25236     10
##  6        25124      1
##  7        22040      1
##  8        20051     31
##  9        18481      2
## 10        15831      6

We can remove all the records where the capital gain is equal to 99999.

incomecensus <- incomecensus %>%
   filter(!(capital.gain == 99999))

incomecensus %>%
  group_by(capital.loss) %>% 
  summarise(counts =n()) %>% 
  arrange(desc(capital.loss)) %>% 
  head(10)

## # A tibble: 10 x 2
##    capital.loss counts
##           <dbl>  <int>
##  1         4356      1
##  2         3900      2
##  3         3770      2
##  4         3683      2
##  5         3004      1
##  6         2824      8
##  7         2754      2
##  8         2603      4
##  9         2559     12
## 10         2547      4

F. Work class and occupation

table(incomecensus$occupation, incomecensus$workclass)

##                    
##                     Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc
##   Adm-clerical              308       269    2360           28               46
##   Armed-Forces                9         0       0            0                0
##   Craft-repair               62       138    2377           96              485
##   Exec-managerial           175       209    2296          360              369
##   Farming-fishing             8        27     427           51              415
##   Handlers-cleaners          21        44    1056            2               15
##   Machine-op-inspct          14        11    1571           10               35
##   Other-service              33       187    2349           27              171
##   Priv-house-serv             0         0     137            0                0
##   Prof-specialty            164       664    2001          138              346
##   Protective-serv            27       291     183            5                6
##   Sales                      14         7    2508          266              362
##   Tech-support               66        38     667            3               26
##   Transport-moving           24       112    1093           26              118
##                    
##                     State-gov Without-pay
##   Adm-clerical            245           3
##   Armed-Forces              0           0
##   Craft-repair             52           1
##   Exec-managerial         181           0
##   Farming-fishing          15           5
##   Handlers-cleaners         9           1
##   Machine-op-inspct        13           1
##   Other-service           119           0
##   Priv-house-serv           0           0
##   Prof-specialty          392           0
##   Protective-serv         112           0
##   Sales                    10           0
##   Tech-support             55           0
##   Transport-moving         40           1

We might remove the two levels: without-pay

We also might regroup these levels into 3 categoris : public, private, self-employeed

Exploratory data analysis (EDA)

A. income Categorical variable that contains yearly income of the respondent (“<=$50K” or “>50K”).

incomecensus %>% 
  group_by(income) %>% 
  summarise(count = n())

## # A tibble: 2 x 2
##   income count
##   <fct>  <int>
## 1 <=50K  19888
## 2 >50K    6720

income_prop <- incomecensus %>% 
  group_by(income) %>% 
  summarise(count = n()) %>% 
  ungroup()%>% 
  arrange(desc(income)) %>%
  mutate(percentage = round(count/sum(count),4)*100,
         lab.pos = cumsum(percentage)-0.5*percentage)

ggplot(data = income_prop, 
       aes(x = "", 
           y = percentage, 
           fill = income))+
  geom_bar(stat = "identity")+
  coord_polar("y") +
  geom_text(aes(y = lab.pos, 
                label = paste(percentage,"%", sep = "")), col = "blue", size = 5) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  theme_void() +
  theme(legend.title = element_text(color = "black", size = 14),
        legend.text = element_text(color = "black", size = 14))

The summary of the data shows that 75 % of the observations have an income less than or equal to 50k dollars.

Specifically, 22654 persons have an income <=50k dollars, while 7508 people earn more than 50k.

B. Age

incomecensus %>% ggplot(aes(age)) + 
  geom_histogram(fill= "lightblue",
                 color = 'blue',
                 binwidth = 5) +   
  labs(title= "Age Distribution") +
  theme(plot.title = element_text(hjust = 0.5))

summary(incomecensus$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   38.00   38.97   48.00   90.00

incomecensus %>% ggplot(aes(age)) + 
  geom_histogram(aes(fill=income),
                 color = 'grey',
                 binwidth = 1) +   
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  labs(title= "Age Distribution for Income")+
  theme(plot.title = element_text(hjust = 0.5))

incomecensus %>% 
  ggplot(aes(age, 
             fill= income)) +
  geom_density(alpha= 0.7, color = 'blue') +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  labs(x = "Age", y = "Density", title = "Density graph of age distribution")

Older people tends to earn more.

On the other hand, majority of the people with an age around 25 years earns less than 50k per annum.

C. workclass

library(ggthemes)
library(hrbrthemes)

## Warning: package 'hrbrthemes' was built under R version 3.6.3

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

incomecensus %>%
  group_by(workclass) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(workclass, -counts),
             y  = counts,
             fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.6) +
  scale_fill_gradient(low="skyblue1", high="royalblue4")+
  geom_text(aes(label = paste0(round(counts,1), "\n",Percentage,"%")), 
            vjust = -0.1, 
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,25000)) +
  theme_minimal() +
  labs(x = "Work class",
       y = "Frequency",
       caption = "Income census US 1994")

Almost three-quarters of sample work in the private sector.

In this data, there are dictinction between Local-gov, State-gov, and Federal-gov.

We might group these three category if there is no difference in income level between them.

library(ggpubr)

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

p1 <- incomecensus %>%
  group_by(workclass) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(workclass, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n",Percentage,"%")), 
            vjust = 0.5, 
            hjust = -0.5,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Work class",y = "Frequency") + 
  coord_flip()

p2 <- incomecensus %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name= "Percentage", 
                     labels = percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(p1, p2, nrow = 1)

income_gov <- incomecensus %>% 
  filter(workclass %in% c("Local-gov", "State-gov", "Federal-gov")) %>% 
  group_by(workclass, income) %>%
  summarise(count = n()) %>% 
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

income_gov <-  income_gov %>% transmute(income, percent = count*100/sum(count))
income_gov

## # A tibble: 6 x 3
## # Groups:   workclass [3]
##   workclass   income percent
##   <fct>       <fct>    <dbl>
## 1 State-gov   >50K      27.0
## 2 Local-gov   >50K      29.4
## 3 Federal-gov >50K      38.4
## 4 Federal-gov <=50K     61.6
## 5 Local-gov   <=50K     70.6
## 6 State-gov   <=50K     73.0

incomecensus %>% 
  filter(workclass %in% c("Local-gov", "State-gov", "Federal-gov")) %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank())

There is a slight difference between Local-gov and Stave-gov.

Working for the federal government can have a greater chance of getting an income higher than 50K.

incomecensus %>% 
  filter(workclass %in% c("Self-emp-not-inc", "Self-emp-inc", "Without-pay")) %>% 
  group_by(workclass, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(workclass, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank())

education

incomecensus %>% 
  group_by(education) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts)) %>% 
  pander()

education	counts
HS-grad	8201
Some-college	5853
Bachelors	4447
Masters	1534
Assoc-voc	1248
Assoc-acdm	990
11th	937
10th	763
7th-8th	534
Prof-school	486
9th	445
12th	356
Doctorate	349
5th-6th	275
1st-4th	146
Preschool	44

incomecensus %>% 
  group_by(education.num) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts)) %>% 
  pander()

education.num	counts
9	8201
10	5853
13	4447
14	1534
11	1248
12	990
7	937
6	763
4	534
15	486
5	445
8	356
16	349
3	275
2	146
1	44

We can see that these two columns are identical.

The column education.num is simply converted from the column education.

If we keep the column education which is factor type, we might regroup some categories in this column.

If we keep the column education.num which is numerical type,, we can keep the column as it is (numeric).

eudcation_df <- incomecensus %>%
  group_by(education.num) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(education.num)

eudcation_df %>% 
  pander()

education.num	counts	Percentage
1	44	0.17
2	146	0.55
3	275	1.03
4	534	2.01
5	445	1.67
6	763	2.87
7	937	3.52
8	356	1.34
9	8201	30.82
10	5853	22
11	1248	4.69
12	990	3.72
13	4447	16.71
14	1534	5.77
15	486	1.83
16	349	1.31

ggplot(data = eudcation_df ,
           aes(x= education.num,
               y  = counts,
               fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.6) +
  scale_x_continuous(breaks = c(0:16)) +
  scale_fill_gradient(low="skyblue1", high="royalblue4")+
  geom_text(aes(label = paste0(round(counts,1), "\n","(",Percentage,"%)")), 
            vjust = -0.1, 
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,12000)) +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Frequency",
       caption = "Income census US 1994") +
  ggtitle("Distribution of years of education") +
    theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "right", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12))

About 50% of respondents have 9-10 years of education.

About 17% of people in the sample hold bachelor degree (13 years of education).

education.pct <- incomecensus %>%
  group_by(education.num, income) %>%
  summarize(count = n()) %>%
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

ggplot(education.pct, 
       aes(education.num, pct, fill = income)) + 
  geom_bar(stat="identity", position = "fill") + 
  geom_hline(yintercept = 0.2489, col = "blue") +
  scale_x_continuous(breaks = c(0:16)) +
  scale_y_continuous(labels=scales::percent) + 
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  ggtitle("Income by years of education") + 
  xlab("Education (years)") + 
  ylab("Percentage") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "right", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12))

Higher years of education led to a higher chance of having >50K.

marital status

incomecensus %>% 
  group_by(marital.status) %>% 
  summarise(counts = n()) %>% 
  arrange(desc(counts))

## # A tibble: 7 x 2
##   marital.status        counts
##   <fct>                  <int>
## 1 Married-civ-spouse     12236
## 2 Never-married           8369
## 3 Divorced                3996
## 4 Separated                926
## 5 Widowed                  811
## 6 Married-spouse-absent    249
## 7 Married-AF-spouse         21

we can put all the married people in the same group

This column have 7 labels, three of them are:

Married-AF-spouse: Married armed forces spouse
Married-civ-spouse: Married civilian spouse
Married-spouse-absent

These levels can be grouped into the group “married”.

Replace “Married-AF-spouse”, “Married-civ-spouse”, and “Married-spouse-absent” by “Married”

pat = c("Married-AF-spouse|Married-civ-spouse|Married-spouse-absent")
incomecensus <- incomecensus %>% 
  mutate(marital_status = stringr::str_replace_all(marital.status, pat, "Married"))

incomecensus <- incomecensus %>% 
  select(-marital.status)

marital_df <- incomecensus %>%
  group_by(marital_status) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),1)) %>% 
  arrange(desc(counts))

marital_df %>% 
  pander()

marital_status	counts	Percentage
Married	12506	47
Never-married	8369	31.5
Divorced	3996	15
Separated	926	3.5
Widowed	811	3

The proportion of married people that have an income higher than 50K is highest.

p3 <- incomecensus %>%
  group_by(marital_status) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(marital_status, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, 
            hjust = -0.5,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Marital status",y = "Frequency") + 
  coord_flip()

p4 <- incomecensus %>% 
  group_by(marital_status, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(marital_status, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(p3, p4, nrow = 1)

occupation

Category_df <- incomecensus %>%
  group_by(occupation) %>% 
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(counts)

ggplot(data = Category_df,
           aes(x= reorder(occupation, -counts),
               y  = counts,
               fill = Percentage)) +
  geom_bar(stat = "identity",
           width = 0.7) +
  geom_text(aes(label = paste0(counts,", ", Percentage,"%")), 
            vjust = 0.2, 
            hjust = -0.1,
            color = "darkblue", 
            size = 4) +
  scale_y_continuous(limits = c(0,5000)) +
  scale_fill_distiller(palette = "Spectral") +
  theme_minimal() +
  labs(x = "Category",
       y = "Frequency",
       caption = "Income census US 1994") +
  coord_flip()

occupation.pct <- incomecensus %>%
  group_by(occupation, income) %>%
  summarize(count = n()) %>%
  mutate(pct = count/sum(count)) %>%
  arrange(desc(income), pct)

occupation.pct$occupation <- factor(occupation.pct$occupation,
                                    levels = occupation.pct$occupation[1:(nrow(occupation.pct)/2)])

ggplot(occupation.pct, aes(reorder(occupation,-pct), pct, fill = income)) + 
  geom_bar(stat="identity", position = "fill") + 
  geom_hline(yintercept = 0.2489, col = "blue") +
  ggtitle("Income by occupation") + 
  xlab("Occupation") + 
  ylab("Percentage") + 
  scale_y_continuous(labels=scales::percent) + 
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  theme(plot.title = element_text(hjust = 0.5), 
        axis.title = element_text(size = 12), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12),
        legend.position = "bottom", 
        legend.title = element_text(size = 12, face = "bold"),
        legend.text = element_text(colour="black", size = 12)) + 
  coord_flip()

About 25% of the sample has an income of more than 50K.

There are five categories of jobs that have a higher percentage: Protective service, tech-support, Sales, Exec-managerial, and Prof-speciality.

Question: Are these people work in the private sector?

incomecensus %>% 
  filter(occupation %in% c("Protective-serv", "Tech-support", "Sales", "Exec-managerial", "Prof-specialty")) %>% 
  group_by(workclass) %>% 
  summarise(freq =  n()) %>% 
  mutate(percentage = round(freq*100/sum(freq),1)) %>%
  arrange(desc(percentage)) %>% 
  pander()

workclass	freq	percentage
Private	7655	64.1
Local-gov	1209	10.1
Self-emp-not-inc	1109	9.3
Self-emp-inc	772	6.5
State-gov	750	6.3
Federal-gov	446	3.7

About two-third work in the private sector.

Governmental job accounts for 18.81 %.

Blue_Collar (Craft-repair,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Transport-moving)

White_Collar (Adm-clerical, Sales, Tech-support,Protective-serv),

Exec_mgr_prof (Exec-managerial,Prof-specialty),

and Service_other (Armed-Forces, Other-service, Priv-house-serv)

occupation_distr <- incomecensus %>%
  group_by(occupation) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(occupation, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, hjust = -0.5,color = "darkblue", size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Grouped Occupation", y = "Frequency") + 
  coord_flip()

occupation_pct <- incomecensus %>% 
  group_by(occupation, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(occupation, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(occupation_distr, occupation_pct, nrow = 1)

incomecensus %>% 
  ggplot(aes(income, color= income, fill= income)) +
  geom_bar( alpha = 0.8, width = 0.8) +
  facet_grid(~ relationship) + 
  labs(x = "Incomes", y = "Count", title = "Incomes by relationship")

This column contains categories that might be overlap with other feature.

For example, unmarried people in the relationship attribute is the same as the unmarried level in the marital column.

We will remove this column

Another way is to conver this column into categories :

having child (leaving _with_child), with other relative, in couple not in family and unmarried = living alone

relationship_distr <- incomecensus %>%
  group_by(relationship) %>%
  summarise(counts = n()) %>% 
  mutate(Percentage = round(counts*100/sum(counts),2)) %>% 
  arrange(desc(counts)) %>% 
  ggplot(aes(x= reorder(relationship, counts),
             y  = counts)) +
  geom_bar(stat = "identity",
           width = 0.6,
           fill = "steelblue") +
  geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")), 
            vjust = 0.4, hjust = -0.5,color = "darkblue", size = 4) +
  scale_y_continuous(limits = c(0,20000)) +
  theme_minimal() +
  labs(x = "Grouped Occupation", y = "Frequency") + 
  coord_flip()

relationship_pct <- incomecensus %>% 
  group_by(relationship, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(relationship, n), 
             y = pct/100, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Percentage", 
                     labels = scales::percent) +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) + 
  coord_flip()

ggarrange(relationship_distr, relationship_pct, nrow = 1)

This variable may be redundant as it collapse somehow with the feature marital_status.

incomecensus <- incomecensus %>% 
  select(-relationship)

race

incomecensus %>% 
  ggplot(aes(race, fill= income)) +
  geom_bar(position = "fill") +
  labs(x = "Race", 
       y = "Proportion", 
       title = "Incomes by race")+
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  geom_hline(yintercept = 0.2489, col="blue") +
  coord_flip()

Black and “Amer-Indio-Eskima” earn less than 50k more frequently than the general population.

However, the percent of whites and “Asian-Pac-Islander” that earn more than 50k is over the general population average.

Sex Here we can see that the vast majority of people having an income greater than 50000 dollars are males.

gender_prop <- incomecensus %>% 
  group_by(sex) %>% 
  summarise(count = n()) %>% 
  ungroup()%>% 
  arrange(desc(sex)) %>%
  mutate(percentage = round(count/sum(count),4)*100,
         lab.pos = cumsum(percentage)-0.5*percentage)

gender_distr <- ggplot(data = gender_prop, 
       aes(x = "", 
           y = percentage, 
           fill = sex))+
  geom_bar(stat = "identity")+
  coord_polar("y") +
  geom_text(aes(y = lab.pos, 
                label = paste(percentage,"%", sep = "")), col = "blue", size = 4) +
  scale_fill_manual(values=c("orange", "lightblue"),
                    name = "Gender") +
  theme_void() +
  theme(legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))


gender_prop <- incomecensus %>% 
  group_by(sex, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(sex, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "black") +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(color = "black", size = 12),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))

ggarrange(gender_distr, gender_prop, nrow = 1)

Two-third of respondents are male.

The proportion of women that earn more than 50k is much lower than that of their male counterparts.

hour per week

Distribution of hours per week

incomecensus %>% ggplot(aes(hours.per.week)) + 
  geom_histogram(fill= "orange",
                 color = 'blue',
                 binwidth = 5) +   
  labs(title= "Age Distribution") +
  theme(plot.title = element_text(hjust = 0.5))

incomecensus %>% 
  group_by(income) %>%
  summarise("Mean hours per week" = mean(hours.per.week),
            "Standard deviatrion" = sd(hours.per.week))

## # A tibble: 2 x 3
##   income `Mean hours per week` `Standard deviatrion`
##   <fct>                  <dbl>                 <dbl>
## 1 <=50K                   39.5                  12.3
## 2 >50K                    45.8                  11.0

Contrusting box-plot

ggplot(data = incomecensus, 
       aes(income, 
           hours.per.week, 
           fill = income))+
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) + 
  geom_boxplot()+
  labs(x = "Incomes", 
       y = "Worked hours per week", 
       title = "Incomes by working hours")

There are many outliers.

Some people work for 100 hours a week (which is possible)

native.country

Incomes by origin

There are 40 different native countries and some of them have just a few cases, for instance, honduras have just 10 cases.

Therefore, the first attempt is to group the countries by continent.

incomecensus <- incomecensus %>% 
  mutate(native_continent = case_when(
    native.country %in% c("France", "Greece", "Hungary", "Italy", "Portugal",
                          "Scotland", "England", "Germany", "Holand-Netherlands",
                          "Ireland", "Poland", "Yugoslavia") ~ "Europe",
    native.country %in% c("Columbia", "Dominican-Republic", "El-Salvador", "Haiti",
                          "Honduras", "Mexico", "Outlying-US(Guam-USVI-etc)",
                          "Cuba", "Ecuador", "Guatemala", "Jamaica", "Nicaragua",
                          "Peru", "Puerto-Rico", "Trinadad&Tobago") ~ "Latin America", # "Trinadad&Tobago" and "Outlying-US(Guam-USVI-etc" 
    native.country %in% c("Iran", "Japan", "Philippines", "Taiwan", "Vietnam", "Cambodia",
                          "China", "Hong", "India", "Laos", "South", "Thailand") ~ "Asia",
    native.country %in% c("United-States","Canada") ~ "USA/Canada"))

incomecensus %>% 
  group_by(native_continent) %>% 
  count()

## # A tibble: 4 x 2
## # Groups:   native_continent [4]
##   native_continent     n
##   <chr>            <int>
## 1 Asia               693
## 2 Europe             490
## 3 Latin America     1314
## 4 USA/Canada       24111

incomecensus %>% 
  group_by(native_continent, income) %>% 
  summarize(n = n()) %>% 
  mutate(pct = n*100/sum(n)) %>% 
  ggplot(aes(x = reorder(native_continent, n), 
             y = pct, 
             fill = income)) +
  geom_bar(stat = "identity", width = 0.6) + 
  scale_x_discrete(name = "") +
  scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
  geom_text(aes(label = paste0(round(pct,0),"%")), 
            position = position_stack(vjust = 0.5), 
            size = 4, 
            color = "blue") +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(color = "black", size = 12),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.title = element_text(color = "black", size = 12),
        legend.text = element_text(color = "black", size = 12))

We will remove some columns

incomecensus <- incomecensus %>% 
  select(-c('education',"capital.gain","capital.loss", "native.country"))

The final data

str(incomecensus)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 26608 obs. of  10 variables:
##  $ age             : num  82 54 41 34 38 74 68 45 38 52 ...
##  $ workclass       : Factor w/ 7 levels "Federal-gov",..: 3 3 3 3 3 6 1 3 5 3 ...
##  $ education.num   : num  9 4 10 9 6 16 9 16 15 13 ...
##  $ occupation      : Factor w/ 4 levels "White_collar",..: 4 3 4 2 1 4 4 4 4 2 ...
##  $ race            : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
##  $ sex             : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
##  $ hours.per.week  : num  18 40 40 45 40 20 40 35 45 20 ...
##  $ income          : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...
##  $ marital_status  : chr  "Widowed" "Divorced" "Separated" "Divorced" ...
##  $ native_continent: chr  "USA/Canada" "USA/Canada" "USA/Canada" "USA/Canada" ...

we need to convert two columns marital_status and native_continent in factor

incomecensus$marital_status <- as.factor(incomecensus$marital_status)
incomecensus$native_continent <- as.factor(incomecensus$native_continent)

we will save this data as income.csv

write.csv(incomecensus,file = “income.csv”, row.names= FALSE)

Ensuring the quality of data

Minh Trung DANG

27/05/2020