This Tidyverse EXTEND assignment was performed on the basis of the Tidyverse CREATE assignment by Noori Selina. I am adding more Graphical Representation of census-income data analysis of annotated codes to her existing examples.

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction

This vignette demonstrates how to perform data manipulation and analysis using the dplyr package. We will work with the Census Income dataset, which contains information on individuals’ demographics and income levels.The data set can be found here: https://www.kaggle.com/datasets/tawfikelmetwally/census-income-dataset/data

Data Loading

To load the data, I will be importing the data from a GitHub link.

census_data <- read_csv("https://raw.githubusercontent.com/NooriSelina/Data-607/main/censusincome.csv")

## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Workclass, Education, Marital Status, Occupation, Relationship, Rac...
## dbl (6): Age, Final Weight, EducationNum, Capital Gain, capital loss, Hours ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(census_data)

## # A tibble: 6 × 15
##     Age Workclass        `Final Weight` Education EducationNum `Marital Status` 
##   <dbl> <chr>                     <dbl> <chr>            <dbl> <chr>            
## 1    39 State-gov                 77516 Bachelors           13 Never-married    
## 2    50 Self-emp-not-inc          83311 Bachelors           13 Married-civ-spou…
## 3    38 Private                  215646 HS-grad              9 Divorced         
## 4    53 Private                  234721 11th                 7 Married-civ-spou…
## 5    28 Private                  338409 Bachelors           13 Married-civ-spou…
## 6    37 Private                  284582 Masters             14 Married-civ-spou…
## # ℹ 9 more variables: Occupation <chr>, Relationship <chr>, Race <chr>,
## #   Gender <chr>, `Capital Gain` <dbl>, `capital loss` <dbl>,
## #   `Hours per Week` <dbl>, `Native Country` <chr>, Income <chr>

Data Manipulation with dplyr

The dplyr package provides a set of functions for data manipulation. We will use the following functions to explore the dataset.

Filtering - Using the dplyr package, I will use the filter() function to filter the dataset based on specific criteria. In this case, we are interested in individuals with incomes greater than $50,000, so we filter the dataset to include only such individuals.

high_income_data <- census_data %>%
  filter(Income == ">50K")
head(high_income_data)

## # A tibble: 6 × 15
##     Age Workclass        `Final Weight` Education  EducationNum `Marital Status`
##   <dbl> <chr>                     <dbl> <chr>             <dbl> <chr>           
## 1    52 Self-emp-not-inc         209642 HS-grad               9 Married-civ-spo…
## 2    31 Private                   45781 Masters              14 Never-married   
## 3    42 Private                  159449 Bachelors            13 Married-civ-spo…
## 4    37 Private                  280464 Some-coll…           10 Married-civ-spo…
## 5    30 State-gov                141297 Bachelors            13 Married-civ-spo…
## 6    40 Private                  121772 Assoc-voc            11 Married-civ-spo…
## # ℹ 9 more variables: Occupation <chr>, Relationship <chr>, Race <chr>,
## #   Gender <chr>, `Capital Gain` <dbl>, `capital loss` <dbl>,
## #   `Hours per Week` <dbl>, `Native Country` <chr>, Income <chr>

Grouping and Summarizing - The group_by() and summarize() functions of the dplyr package are valuable for aggregating data. We will group the data by education level and calculate summary statistics for age and hours worked per week within each education category.

income_summary <- high_income_data %>%
  group_by(Income) %>%
  summarize(
    mean_age = mean(Age, na.rm = TRUE),
    median_hours = median(`Hours per Week`, na.rm = TRUE)
  )

print(income_summary)

## # A tibble: 1 × 3
##   Income mean_age median_hours
##   <chr>     <dbl>        <dbl>
## 1 >50K       44.2           40

Sorting - The arrange() function is used to sort the summarized data. In this example, we sort the education summary by mean age in descending order, which helps identify the education categories with the highest mean age.

high_income_data <- high_income_data %>%
  arrange(desc(Age))

print(head(high_income_data, 10))

## # A tibble: 10 × 15
##      Age Workclass    `Final Weight` Education    EducationNum `Marital Status` 
##    <dbl> <chr>                 <dbl> <chr>               <dbl> <chr>            
##  1    90 Local-gov            227796 Masters                14 Married-civ-spou…
##  2    90 Private               51744 Masters                14 Never-married    
##  3    90 Private               87372 Prof-school            15 Married-civ-spou…
##  4    90 Private               46786 Bachelors              13 Married-civ-spou…
##  5    90 Private              175491 HS-grad                 9 Married-civ-spou…
##  6    90 Private               88991 Bachelors              13 Married-civ-spou…
##  7    90 Private              206667 Masters                14 Married-civ-spou…
##  8    90 ?                    313986 HS-grad                 9 Married-civ-spou…
##  9    84 Self-emp-inc         172907 Some-college           10 Married-civ-spou…
## 10    83 Self-emp-inc         240150 10th                    6 Married-civ-spou…
## # ℹ 9 more variables: Occupation <chr>, Relationship <chr>, Race <chr>,
## #   Gender <chr>, `Capital Gain` <dbl>, `capital loss` <dbl>,
## #   `Hours per Week` <dbl>, `Native Country` <chr>, Income <chr>

Conclusion

In this vignette, the dplyr package from the tidyverse collection was used to efficiently manage and analyze the Census Income dataset. Specifically, the data was filtered using the filter() function to focus on high-income individuals, grouped by income levels using the group_by() function, and then sorted by mean age with the arrange() function.

This approach uncovered valuable insights about the demographics of high-income individuals. The dplyr package, along with other tools from the tidyverse, made data manipulation tasks straightforward and effective.

TidyVerse EXTEND: Graphical Representation of census-income data analysis

The first variable Age is a continuous variable. As an initial step, two histograms are plotted.

# histogram of age by income group
ggplot(census_data) + aes(x=as.numeric(Age), group=Income, fill=Income) + 
  geom_histogram(binwidth=1, color='black')

# histogram of age by gender group
ggplot(census_data) + aes(x=as.numeric(Age), group=Gender, fill=Gender) + 
  geom_histogram(binwidth=1, color='black')

It is noticed that majority of the female has income less than 50K but make can make money more than $50,000 a year. For those do make over $50,000 annually, they are mainly in midcareer. Interestingly, females are It is noticed that majority of the observations make less than $50,000 a year. For those do make over $50,000 annually, they are mainly in midcareer. Interestingly, females are overrepresented. This could be possibly caused by census bias. The average age of most working group lies around 42 for both gender who earn maximum annual income and there are more male than female in the working industries.

The variable Workclass stands for the industry in which the responding unit is employed.

library(ggplot2)
library(dplyr)

# Delete rows with "?" and "Armed-Forces" in Workclass Column
census_data <- census_data %>%
  filter(!(Workclass %in% c("?", "Never-worked")))

# combine into Government job
census_data$Workclass<- gsub('^Federal-gov', 'Government', census_data$Workclass)
census_data$Workclass <- gsub('^Local-gov', 'Government', census_data$Workclass)
census_data$Workclass <- gsub('^State-gov', 'Government', census_data$Workclass) 

# combine into Sele-Employed job
census_data$Workclass <- gsub('Self-emp-inc','Self-Employed', census_data$Workclass)
census_data$Workclass <- gsub('Self-emp-not-inc','Self-Employed', census_data$Workclass)

census_data$Workclass <- as.factor(census_data$Workclass)
summary(census_data$Workclass)

##    Government       Private Self-Employed   Without-pay 
##          4351         22696          3657            14

Notice that there are two small groups – Never-worked and Without-pay. I will combine them with Unknowns into a group called Other/Unknown. Those who work in the government are further break down into federal, state, and local levels. To facilitate the analysis, I group them into one group called Government. While those who are self-employed fall into two groups, incorporated and not incorporated, and are grouped into Self-Employed.

To calculate counts and proportions, and then creates a bar plot showing the relationship between industry and income across different Workclass categories.

# barplot of job type by income group
# get the counts by industry and income group
count <- table(census_data[census_data$Workclass == 'Government',]$Income)["<=50K"]
count <- c(count, table(census_data[census_data$Workclass == 'Government',]$Income)[">50K"])

count <- c(count, table(census_data[census_data$Workclass == 'Private',]$Income)["<=50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Private',]$Income)[">50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Self-Employed',]$Income)["<=50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Self-Employed',]$Income)[">50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Without-pay',]$Income)["<=50K"])
count <- c(count, table(census_data[census_data$Workclass == 'Without-pay',]$Income)[">50K"])
count <- as.numeric(count)

# create a dataframe
industry <- rep(levels(census_data$Workclass), each = 2)
Income <- rep(c('<=50K', '>50K'), 4)
df <- data.frame(industry, Income, count)
df

##        industry Income count
## 1    Government  <=50K  3010
## 2    Government   >50K  1341
## 3       Private  <=50K 17733
## 4       Private   >50K  4963
## 5 Self-Employed  <=50K  2311
## 6 Self-Employed   >50K  1346
## 7   Without-pay  <=50K    14
## 8   Without-pay   >50K    NA

# Get the counts by Workclass and Income group
counts_workclass_income <- sapply(unique(census_data$Workclass), function(w) {
  sapply(c('<=50K', '>50K'), function(inc) {
    sum(census_data$Income[census_data$Workclass == w] == inc)
  })
})

# Create a dataframe
df <- data.frame(
  Workclass = rep(unique(census_data$Workclass), each = 2),
  Income = rep(c('<=50K', '>50K'), times = length(unique(census_data$Workclass))),
  Count = as.vector(counts_workclass_income)
)

# Plotting the bar plot
ggplot(df, aes(x = Workclass, y = Count, fill = Income)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Workclass", y = "Count", title = "Counts of Income Group by Workclass") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

Those who are self employed have the highest tendency of making greater than $50,000 a year.

Since EducationNum is a continuous representation of education, a stacked bar plot is used to visualize the relationship between EducationNum and Income, in-group proportions are calculated as well.

# create a dataframe
df1 <- data.frame(table(census_data$Income, census_data$EducationNum))
names(df1) <- c('Income', 'EducationNum', 'count')
df1

##    Income EducationNum count
## 1   <=50K            1    46
## 2    >50K            1     0
## 3   <=50K            2   150
## 4    >50K            2     6
## 5   <=50K            3   289
## 6    >50K            3    14
## 7   <=50K            4   535
## 8    >50K            4    38
## 9   <=50K            5   437
## 10   >50K            5    26
## 11  <=50K            6   771
## 12   >50K            6    60
## 13  <=50K            7   996
## 14   >50K            7    60
## 15  <=50K            8   362
## 16   >50K            8    31
## 17  <=50K            9  8339
## 18   >50K            9  1629
## 19  <=50K           10  5423
## 20   >50K           10  1352
## 21  <=50K           11   973
## 22   >50K           11   348
## 23  <=50K           12   761
## 24   >50K           12   259
## 25  <=50K           13  3006
## 26   >50K           13  2176
## 27  <=50K           14   734
## 28   >50K           14   941
## 29  <=50K           15   143
## 30   >50K           15   415
## 31  <=50K           16   103
## 32   >50K           16   295

# Plotting the bar plot
ggplot(df1, aes(x = as.factor(EducationNum), y = count, fill = Income)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Education Number", y = "Count", title = "Counts of Income by Education Number") +
  theme_minimal()

The bars illustrates that the level of education rises, so does the likelihood of earning more than $50,000 annually. Individuals with less than or equal to 8 years of education have less than 10% chance of achieving this income level. Conversely, for those with doctorate degrees, almost 75% surpass the $50,000 yearly income threshold.

# Delete rows with "?" and "Armed-Forces" in Occupation Column
census_data <- census_data %>%
  filter(!(Occupation %in% c("?", "Armed-Forces")))

census_data$Occupation <- gsub('Adm-clerical', 'White-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Craft-repair', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Exec-managerial', 'White-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Farming-fishing', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Handlers-cleaners', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Machine-op-inspct', 'Blue-Collar', census_data$Occupation)
census_data$Occupation <- gsub('Other-service', 'Service', census_data$Occupation)
census_data$Occupation <- gsub('Priv-house-serv', 'Service', census_data$Occupation)
census_data$Occupation <- gsub('Prof-specialty', 'Professional', census_data$Occupation)
census_data$Occupation <- gsub('Protective-serv', 'Service', census_data$Occupation)
census_data$Occupation <- gsub('Tech-support', 'Service', census_data$Occupation)
census_data$Occupation <- gsub('Transport-moving', 'Blue-Collar', census_data$Occupation)

census_data$Occupation <- as.factor(census_data$Occupation)
summary(census_data$Occupation)

##  Blue-Collar Professional        Sales      Service White-Collar 
##        10062         4140         3650         5021         7836

# create a dataframe
df2 <- data.frame(table(census_data$Income, census_data$Occupation))
names(df2) <- c('Income', 'Occupation', 'count')
df2

##    Income   Occupation count
## 1   <=50K  Blue-Collar  8362
## 2    >50K  Blue-Collar  1700
## 3   <=50K Professional  2281
## 4    >50K Professional  1859
## 5   <=50K        Sales  2667
## 6    >50K        Sales   983
## 7   <=50K      Service  4389
## 8    >50K      Service   632
## 9   <=50K White-Collar  5361
## 10   >50K White-Collar  2475

# using dplyr functions
df2 <- df2 %>%
  group_by(Occupation) %>%
  mutate(percent = count/sum(count) * 100,
         pos = cumsum(count) - 0.5 * count)

# Plotting the bar plot for df2
ggplot(df2, aes(x = Occupation, y = count, fill = Income)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Occupation", y = "Count", fill = "Income", title = "Count of Income Levels by Occupation") +
  geom_text(aes(label = paste0(round(percent), "%")), position = position_dodge(width = 0.9), vjust = -0.5) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

There’s a significant income disparity among distinct occupations. Approximately half of those in Professional roles earn more than $50,000 yearly, contrasting with just 14% within the Service occupation achieving this income level.

DATA 607 Tidyverse Extend Assignment

Lwin Shwe

2023-11-18

Introduction

Data Loading

Data Manipulation with dplyr

Conclusion

TidyVerse EXTEND: Graphical Representation of census-income data analysis