First we need to load the library tidyverse that will help in manipulation of the data
library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorsWe then import the CSV file. For this Week 3 Data Dive onward, we are working with the Census Income Data Set.
income_data <- read.csv("./censusincome.csv")
group_by_workclass <- income_data |>
group_by(workclass) |>
summarize(avg_hours_per_week = mean(hours_per_work,na.rm = TRUE))
head(group_by_workclass)
## # A tibble: 6 × 2
## workclass avg_hours_per_week
## <chr> <dbl>
## 1 " ?" 31.9
## 2 " Federal-gov" 41.4
## 3 " Local-gov" 41.0
## 4 " Never-worked" 28.4
## 5 " Private" 40.3
## 6 " Self-emp-inc" 48.8
cleaned_income_data <- income_data |>
filter(workclass != " ?")
#creating the the grouping by work class
group_by_workclass <- cleaned_income_data |>
group_by(workclass) |>
summarise(
avg_hours_per_week = mean(hours_per_work, na.rm = TRUE),
number_of_employees = n()
)
group_by_workclass
## # A tibble: 8 × 3
## workclass avg_hours_per_week number_of_employees
## <chr> <dbl> <int>
## 1 " Federal-gov" 41.4 960
## 2 " Local-gov" 41.0 2093
## 3 " Never-worked" 28.4 7
## 4 " Private" 40.3 22696
## 5 " Self-emp-inc" 48.8 1116
## 6 " Self-emp-not-inc" 44.4 2541
## 7 " State-gov" 39.0 1298
## 8 " Without-pay" 32.7 14
#creating a line plot for the working class against average hours per week
ggplot(group_by_workclass, aes(x=workclass, y=avg_hours_per_week, group=1)) +
geom_line(color = "blue", size=0.9) +
geom_point(color = "blue", size=2) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title = "Average Hours Worked Per Workclass", x="Workclass", y="Average Hours Per Week")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#creating the group by education
group_by_education <- cleaned_income_data |>
group_by(education) |>
summarise(
avg_hours_per_week = mean(hours_per_work),
number_of_employees = n()
)
group_by_education
## # A tibble: 16 × 3
## education avg_hours_per_week number_of_employees
## <chr> <dbl> <int>
## 1 " 10th" 37.5 833
## 2 " 11th" 34.2 1057
## 3 " 12th" 35.9 393
## 4 " 1st-4th" 38.5 156
## 5 " 5th-6th" 39.1 303
## 6 " 7th-8th" 40.2 574
## 7 " 9th" 38.7 463
## 8 " Assoc-acdm" 41.2 1020
## 9 " Assoc-voc" 42.0 1321
## 10 " Bachelors" 43.0 5182
## 11 " Doctorate" 47.6 398
## 12 " HS-grad" 41.0 9969
## 13 " Masters" 44.2 1675
## 14 " Preschool" 36.9 46
## 15 " Prof-school" 48.0 558
## 16 " Some-college" 39.4 6777
For the education versus average hours per week, we can build a visualization as follows:
ggplot(group_by_education, aes(x=education, y=avg_hours_per_week)) +
geom_bar(stat = "identity", fill="lightblue")+
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title = "Education Level vs. Average Hours Per Week",
x = "Education Level",
y = "Average Hours Per Week")
From the above plot, we can see that the Prof-school education level has the highest number of average hours per week for all the education levels in our data set. However, for all the education levels we don’t see a glaring difference in average hours per week across all those groupings. The average hours per week lie approximately 40 hours per week for all the categories.
Group by marital status
#creating the group by education
group_by_marital <- cleaned_income_data |>
group_by(marital_status) |>
summarise(
avg_hours_per_week = mean(hours_per_work),
number_of_employees = n()
)
group_by_marital
## # A tibble: 7 × 3
## marital_status avg_hours_per_week number_of_employees
## <chr> <dbl> <int>
## 1 " Divorced" 41.5 4259
## 2 " Married-AF-spouse" 44.2 21
## 3 " Married-civ-spouse" 43.8 14340
## 4 " Married-spouse-absent" 40.0 389
## 5 " Never-married" 37.3 9917
## 6 " Separated" 39.7 959
## 7 " Widowed" 34.6 840
For the marital status categories versus average working hours per week (the results above), we might need to create a plot as below to visualize it better.
#creating a line plot for the working class against average hours per week
ggplot(group_by_marital, aes(x=marital_status, y=avg_hours_per_week, group=1)) +
geom_line(color = "blue", size=0.9) +
geom_point(color = "blue", size=2) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title = "Average Hours Worked Per Marital Status Category", x="Marital Status", y="Average Hours Per Week")
From the plot above, we can see that the persons who are married to a spouse in the Armed Forces tend to work the longest number of hours per week. This could be explained probably by the fact that their partner is away for most of the time for deployment, exposing them to a slightly increased financial pressure, hence the need to work for longer hours.
Interestingly, the Widowed group has the lowest average in terms of hours per week. One would have thought that it would be the contrary as the widowed spouse has to shoulder more financial responsibilities. One would however, argue that they are grieving and thus they are not able to work as they would normally, which also makes a lot of logical sense.
group_by_workclass_income <- cleaned_income_data |>
group_by(workclass, income) |>
summarise(number_of_employees = n())
## `summarise()` has grouped output by 'workclass'. You can override using the
## `.groups` argument.
group_by_workclass_income
## # A tibble: 14 × 3
## # Groups: workclass [8]
## workclass income number_of_employees
## <chr> <chr> <int>
## 1 " Federal-gov" " <=50K" 589
## 2 " Federal-gov" " >50K" 371
## 3 " Local-gov" " <=50K" 1476
## 4 " Local-gov" " >50K" 617
## 5 " Never-worked" " <=50K" 7
## 6 " Private" " <=50K" 17733
## 7 " Private" " >50K" 4963
## 8 " Self-emp-inc" " <=50K" 494
## 9 " Self-emp-inc" " >50K" 622
## 10 " Self-emp-not-inc" " <=50K" 1817
## 11 " Self-emp-not-inc" " >50K" 724
## 12 " State-gov" " <=50K" 945
## 13 " State-gov" " >50K" 353
## 14 " Without-pay" " <=50K" 14
We can see from the above results that our groupings for the workclass category have now been subdivided even further i.e. into \(\le 50k\) and \(> 50k\) categories. To better visualize this we can plot the results above using bar plot as follows:
ggplot(group_by_workclass_income, aes(x=workclass, y=number_of_employees,fill = income)) +
geom_bar(stat = "identity", position="dodge")+
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs( title = "Number of Employees by Workclass and Income", x="workclass", y="Number of employees") +
scale_fill_manual(values = c(" <=50K"= "red", " >50K" = "blue"))
From the plot above, we can see relatively that the percentage of people above and below the 50k limit do not vary significantly for all the working classes except the Private class where it varies hugely. This does make a lot of logical sense because, this group as we saw earlier when were analyzing working class vs. average hours per week, the Private class had the highest average hours per week across all the working classes available in our data set.
We can also see that the group with the highest probability is the Private Working class who earn above 50k and the least common is the Never Working class who earn above 50k. In fact, the latter combination is completely missing i.e the probability of Never Worked class who earn above 50k is zero. This does make sense because it is hard to find a person who has never been employed in a formal job have had an income of above 50k.
Groupings help us create categories of our data set and visualize our data set better. For the Census Income data luckily we had enough categorical columns so we did not have to create bins from continuous columns. However, we able to use the groups and probability concepts to derive key insights from our data set. Of course, even more groupings and combinations could be created for deeper analysis.