Final Project:National Employment in the U.S.

Author

Qian He

Introduction:

My dataset is from the U.S. Bureau of Labor Statistics (BLS) and focuses on national employment in 2023 and projected employment in 2033 across different industries in the United States. It includes several quantitative variables, such as 2023 Employment, Projected 2033 Employment, Employment Change between 2023 and 2033, and Employment Percent Change.The dataset also contains categorical variables including Industry Title and Display Level, which represent different industry categories and hierarchy levels within the employment classification system.

For the data cleaning part, I first removed missing values using filter() for variables such as Industry Title, Display Level, 2023 Employment, Projected 2033 Employment, Employment Percent Change, and related percentage variables. Then I filtered out general summary rows such as “Total employment” and “Self-employed workers” because they were not specific industries and could affect the visualization results. I also removed duplicated industry records with different display levels to make the dataset cleaner and easier to visualize.

I chose this dataset mainly because it allows me to compare current and projected employment distributions across industries. It also helps me better understand how employment patterns may change in the future and which industries are expected to grow or remain dominant over time for my future job reference.

This topic is meaningful to me because employment trends are closely connected to career planning, economic development, and future job opportunities. As a data science student, understanding projected employment patterns helps me explore which industries may become more important in the future and how labor market changes can impact society and individual career decisions.

Background Research (reference:https://www.bls.gov/news.release/archives/ecopro_08292024.pdf)

According to employment projections for 2023–2033, data science and information security are among the fastest-growing occupations in the United States. The report also suggests that advances in AI and automation may reduce demand for some traditional office and sales jobs while increasing demand for technology-related careers.

Variable Definitions (reference: https://quarto.org/docs/authoring/tables.html)

Variable	Type	Meaning
Industry Title	Categorical	Name of the industry
Display Level	Categorical	Represents the hierarchy level of industry classification(lower levels: broader industries,higher levels: specialized sub-industries )
2023 Employment	Quantitative	Number of employed workers in the industry in 2023 (in millions)
2023 Percent of Occupation	Quantitative	Percentage of total occupations represented by the industry in 2023
2023 Percent of Industry	Quantitative	Percentage the sub-industry contributes within its broader industry category in 2023
Projected 2033 Employment	Quantitative	Predicted number of employed workers in the industry in 2033 (in millions)
Projected 2033 Percent of Occupation	Quantitative	Predicted percentage of total occupations represented by the industry in 2033
Projected 2033 Percent of Industry	Quantitative	Predicted percentage the sub-industry contributes within its broader industry category in 2033
Employment Change, 2023-2033	Quantitative	Numeric change in employment between 2023 and 2033
Employment Percent Change, 2023-2033	Quantitative	Percentage increase or decrease in employment between 2023 and 2033

Load the dataset and filter NAs

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.1     ✔ readr     2.1.6
✔ ggplot2   4.0.3     ✔ stringr   1.6.0
✔ lubridate 1.9.5     ✔ tibble    3.3.1
✔ purrr     1.2.1     ✔ tidyr     1.3.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

employment <- read_csv("National Employment_dataBLSgov_projections.csv")

Rows: 84 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Industry Title, Industry Code, Industry Type
dbl (10): 2023 Employment, 2023 Percent of Occupation, 2023 Percent of Indus...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#filter all nas 
employment_nona <- employment |>
  filter(!is.na(`Industry Title`),
         !is.na(`Display Level`),
         !is.na(`2023 Percent of Occupation`),
         !is.na(`Projected 2033 Employment`),
         !is.na(`Projected 2033 Percent of Industry`),
         !is.na(`Employment Percent Change, 2023-2033`),
         !is.na(`2023 Employment`),
         !is.na(`2023 Percent of Industry`),
         !is.na(`Projected 2033 Percent of Occupation`),
         !is.na(`Employment Change, 2023-2033`),
         #filter all the outliers
         `Industry Title`!="Total employment",
         `Industry Title`!="Self-employed workers" ,
         `Industry Title`!="Total wage and salary employment" ,
         #filter duplicated data(other information is the same ,only display level is different )
         !(`Industry Title`=="Professional, scientific, and technical services" & `Display Level`==3),
         !(`Industry Title`=="Management of companies and enterprises" & `Display Level`==3 ))|>
  select(-`Industry Type`,
         -`Industry Code`,
         -`Industry Sort`)
employment_nona

# A tibble: 79 × 10
   `Industry Title`                     `2023 Employment` 2023 Percent of Occu…¹
   <chr>                                            <dbl>                  <dbl>
 1 Mining, quarrying, and oil and gas …               2.2                    0.4
 2 Oil and gas extraction                             1.7                    0.3
 3 Support activities for mining                      0.5                    0.1
 4 Utilities                                          0.6                    0.1
 5 Utilities                                          0.6                    0.1
 6 Electric power generation, transmis…               0.4                    0.1
 7 Electric power generation                          0.1                    0  
 8 Natural gas distribution                           0.2                    0  
 9 Construction                                      12.1                    2.2
10 Construction of buildings                         10.6                    1.9
# ℹ 69 more rows
# ℹ abbreviated name: ¹`2023 Percent of Occupation`
# ℹ 7 more variables: `2023 Percent of Industry` <dbl>,
#   `Projected 2033 Employment` <dbl>,
#   `Projected 2033 Percent of Occupation` <dbl>,
#   `Projected 2033 Percent of Industry` <dbl>,
#   `Employment Change, 2023-2033` <dbl>, …

#View(employment_nona)

Multilinear Regression & Modeling

#y=Employment Change, 2023-2033, x=2023 Employment
#cor:correlation
library(ggfortify)
cor(employment_nona$`Employment Change, 2023-2033`,employment_nona$`2023 Employment`)

[1] 0.9986963

#find out the correlation between Employment Change, 2023-2033 ,2023 Employment,2023 Percent of #Industry and 2023 Percent of Occupation
p1 <- lm(`Employment Change, 2023-2033` ~ `2023 Employment`+ `2023 Percent of Industry`+ `2023 Percent of Occupation`, data=employment_nona)
summary(p1)


Call:
lm(formula = `Employment Change, 2023-2033` ~ `2023 Employment` + 
    `2023 Percent of Industry` + `2023 Percent of Occupation`, 
    data = employment_nona)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.28037 -0.01545  0.01194  0.01288  0.30219 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)
(Intercept)                  -0.012770   0.011559  -1.105    0.273
`2023 Employment`             0.008271   0.075330   0.110    0.913
`2023 Percent of Industry`   -0.018670   0.021778  -0.857    0.394
`2023 Percent of Occupation`  0.257409   0.409011   0.629    0.531

Residual standard error: 0.09527 on 75 degrees of freedom
Multiple R-squared:  0.9974,    Adjusted R-squared:  0.9973 
F-statistic:  9733 on 3 and 75 DF,  p-value: < 2.2e-16

#diagnostic plots
autoplot(p1,1:4,nrow=2,ncol=2)

Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

options(scipen=0)

Regression Analysis

The regression model is:(Employment Change, 2023-2033)= -0.0128 +0.0083(2023 Employment)-0.0187(2023 Percent of Industry)+0.2574(2023 Percent of Occupation).

The Adjusted R-squared value is 0.9973, which means about 99.7% of the variation in employment change can be explained by this model. This suggests the model has a very strong overall fit to the dataset.

The overall model is statistically significant because the p-value is less than 2.2e-16. However, when examining the individual variables, the p-values for 2023 Employment, 2023 Percent of Industry, and 2023 Percent of Occupation are all greater than 0.05. This indicates that these predictors may not individually have strong statistical significance in explaining employment change when combined together in the same model.

From the diagnostic plots, most residuals are centered around zero, which suggests the model generally fits the data reasonably well. The Residuals vs Fitted plot shows slight curvature, meaning there may be some non-linear patterns in the data. The Normal Q-Q plot shows several points deviating from the straight line at both ends, suggesting minor deviation from normality and there are a few outliers.

The Scale-Location plot shows some uneven spread of residuals, indicating slight heteroscedasticity(data 101 vocab). In the Cook’s Distance plot, observations 39, 43, and 75 appear to have relatively larger influence on the model compared to other observations.

Overall, the regression model demonstrates a strong relationship between projected employment change and the selected employment-related variables, although some individual predictors may not be statistically significant on their own.

Data Visualization

Animated Bar Graph

library(ggplot2)
library(gganimate)
library(gifski)

#create 2 new datasets for animation:data2023 , data2033
data2023<-employment_nona |>
  mutate(Year=2023,
         Employment=`2023 Employment`) 

data2033<-employment_nona |>
  mutate(Year=2033,
         Employment=`Projected 2033 Employment`)

#using rbind to combine data 2023 and data 2033 for animated bar graph
#(reference:https://www.r-bloggers.com/2024/04/data-frame-merging-in-r-with-examples/#google_vignette)
bar_data<-rbind(data2023,data2033)

# filter the desired top 8 industries for ploting

top8industries<-employment_nona |>
  arrange(desc(`2023 Employment`)) |>
  head(8)

#top8industries

#filter the data with dplyr for the 10 desired industries
top8<- bar_data |>
  filter( `Industry Title`%in% top8industries$`Industry Title`)
#top8

# change names into abbreviation

top8a <- top8 |>
  #round employment to integers
  mutate(Employment = round(Employment)) 
top8a$`Industry Title`[top8a$`Industry Title` =="Real estate and rental and leasing"] <- "R.E. Rental& Leasing"
top8a$`Industry Title`[top8a$`Industry Title` =="Professional, scientific, and technical services"] <- "Scientific & Tech"
top8a$`Industry Title`[top8a$`Industry Title` =="Residential building construction"] <- "Residential Const."


#create an animated bar graph
ani_bar <- ggplot(
  top8a,
  aes(x = reorder(`Industry Title`, Employment),
      y = Employment,
      fill = `Industry Title`)) +
  geom_col() +
  geom_text(aes(label = Employment), vjust = 0.5)+
  scale_fill_brewer(palette = "Set3")+
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Top Industry Employment in {(closest_state)} ",
    x = "Industry Title",
    y = "Employment(in millions)",
    caption = "  Data in 2033 is projected.\nSource: U.S. Bureau of Labor Statistics (BLS)") +
  #animate between different years
  transition_states(Year,transition_length = 8,state_length = 5) +
  # use smooth linear movement during animation
  ease_aes("linear")+
  theme(plot.title=element_text(hjust = 0.5))
#(animation reference:https://www.r-bloggers.com/2024/06/r-gganimate-how-to-make-stunning-chart-animations-with-ggplo#t2/)  
animate(
  # animated plot object
  ani_bar,
  # width of the gif
  width = 600,
  # height of the gif
  height = 400,
  # resolution of the gif
  res = 60,
  # total number of frames
  nframes = 8,
  # frames shown per second
  fps = 2,
  # save animation as a gif file
  renderer = gifski_renderer(file = "animated_bar_employment.gif"))

Create a Tableau

https://public.tableau.com/app/profile/vanessa.he8144/viz/Treemapbarchart_17778568684320/Dashboard1?publish=yes

Essay

The animated bar graph shows the employment distribution across top 8 industries from 2023 to projected 2033 values. Real estate and rental and leasing remain the largest industries throughout the animation, while finance and insurance and scientific and technical services remain relatively small. The visualization suggests that projected employment growth is concentrated in a few dominant industries rather than being evenly distributed across all sectors.The animated bar graph shows that Scientific and Tech, Residential Construction, Real Estate, and Real Estate Rental and Leasing experience relatively larger employment changes from 2023 to 2033, while the rest remains more stable.

The Tableau treemap highlights the relative size differences between industries in both 2023 and projected 2033 employment data. The real estate related industries occupy the largest portion of the treemap, indicating their dominant contribution to total employment. Smaller industries such as finance and insurance occupy much smaller areas, showing limited relative employment size. The dashboard makes it easier to visually compare proportional differences across industries.

One interesting observation from the visualization is that Real Estate appears to be a steadily growing industry, suggesting it may continue to provide strong employment opportunities in the future. In addition, the Scientific and Tech industry also shows growth potential and may become an increasingly important field over time.

One limitation of this visualization is that the dataset is relatively small, which limits the overall depth of the analysis.Another limitation is that the employment values vary greatly across industries. The large differences make smaller industries difficult to visualize clearly in the bar graph and treemap. In the future, it would be helpful to include more industries and variables to provide a broader understanding of employment trends and apply different scaling methods may help improve the balance of the visualization.

Citation

source:U.S. Bureau of Labor Statistics (BLS)

https://www.bls.gov/news.release/archives/ecopro_08292024.pdf (background reference)

https://quarto.org/docs/authoring/tables.html (table format)

https://www.r-bloggers.com/2024/04/data-frame-merging-in-r-with-examples/#google_vignette (rbind)

https://www.r-bloggers.com/2024/06/r-gganimate-how-to-make-stunning-chart-animations-with-ggplot2/ (bar graph animation)