This report explores two core datasets: one focused on educational attainment and income in the United States, and the other on global annual mean temperatures. Using R, I apply data wrangling, custom function creation, visualization, and classification techniques to uncover trends and insights across both domains.
acs_df <- readRDS("acs.rds") %>%
clean_names() %>%
mutate(
edu = factor(edu, levels = c("Less than HS", "HS", "Some College",
"Associate", "Bachelor", "Master",
"Professional", "Doctorate"))
)
acs_df %>%
group_by(edu) %>%
summarise(
count = n(),
median_income = median(income, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(median_income))
plot_a <- ggplot(acs_df, aes(x = edu)) +
geom_bar(fill = "steelblue") +
labs(title = "Number of People by Education Level", x = "Education", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
plot_b <- ggplot(acs_df, aes(x = edu, y = income)) +
geom_boxplot(fill = "purple") +
scale_y_continuous(labels = scales::dollar_format()) +
labs(title = "Household Income by Education Level", x = "Education", y = "Income") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
plot_a / plot_b
Insight: Median household income increases consistently with higher education levels.
temp_df <- read_csv("annual_mean_temperature.csv") %>%
clean_names()
temp_fun <- function(celsius, speed = 0) {
if (!is.numeric(celsius) | !is.numeric(speed)) {
stop("This function only works for numeric inputs! You provided: ", class(celsius))
}
adjusted_temp <- celsius - (0.7 * speed / 10)
fahrenheit <- adjusted_temp * 9/5 + 32
return(fahrenheit)
}
temp_df <- temp_df %>%
mutate(annual_mean = temp_fun(annual_mean))
temp_df <- temp_df %>%
mutate(
climate = case_when(
annual_mean < 40 ~ "cold",
annual_mean >= 40 & annual_mean <= 60 ~ "temperate",
annual_mean > 60 ~ "warm"
)
)
temp_df_2022 <- temp_df %>% filter(year == 2022)
ggplot(temp_df_2022, aes(x = climate)) +
geom_bar(fill = "skyblue") +
labs(title = "Country Counts by Climate (2022)", x = "Climate Category", y = "Number of Countries") +
theme_minimal()
Insight: Most countries fall into the “temperate” and “warm” categories, suggesting global warming patterns.
This project showcases a range of data analysis skills, including:
dplyr, janitorggplot2 and
patchworkcase_when()Both parts demonstrate how data storytelling and technical skills come together to derive meaningful conclusions.
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] janitor_2.2.1 patchwork_1.3.0 lubridate_1.9.3 forcats_1.0.0
## [5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [9] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.3
## [5] hms_1.1.3 digest_0.6.35 magrittr_2.0.3 evaluate_1.0.3
## [9] grid_4.3.3 timechange_0.3.0 fastmap_1.2.0 jsonlite_1.8.8
## [13] fansi_1.0.6 scales_1.3.0 jquerylib_0.1.4 cli_3.6.2
## [17] crayon_1.5.2 rlang_1.1.3 bit64_4.0.5 munsell_0.5.0
## [21] withr_3.0.2 cachem_1.1.0 yaml_2.3.8 parallel_4.3.3
## [25] tools_4.3.3 tzdb_0.4.0 colorspace_2.1-0 vctrs_0.6.5
## [29] R6_2.5.1 lifecycle_1.0.4 snakecase_0.11.1 bit_4.0.5
## [33] vroom_1.6.5 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.6.2
## [37] gtable_0.3.4 glue_1.7.0 xfun_0.43 tidyselect_1.2.1
## [41] highr_0.10 rstudioapi_0.16.0 knitr_1.45 farver_2.1.1
## [45] htmltools_0.5.8 rmarkdown_2.26 labeling_0.4.3 compiler_4.3.3