My dataset is from the U.S. Bureau of Labor Statistics (BLS) and focuses on national employment in 2023 and projected employment in 2033 across different industries in the United States. It includes several quantitative variables, such as 2023 Employment, Projected 2033 Employment, Employment Change between 2023 and 2033, and Employment Percent Change.The dataset also contains categorical variables including Industry Title and Display Level, which represent different industry categories and hierarchy levels within the employment classification system.
For the data cleaning part, I first removed missing values using filter() for variables such as Industry Title, Display Level, 2023 Employment, Projected 2033 Employment, Employment Percent Change, and related percentage variables. Then I filtered out general summary rows such as “Total employment” and “Self-employed workers” because they were not specific industries and could affect the visualization results. I also removed duplicated industry records with different display levels to make the dataset cleaner and easier to visualize.
I chose this dataset mainly because it allows me to compare current and projected employment distributions across industries. It also helps me better understand how employment patterns may change in the future and which industries are expected to grow or remain dominant over time for my future job reference.
This topic is meaningful to me because employment trends are closely connected to career planning, economic development, and future job opportunities. As a data science student, understanding projected employment patterns helps me explore which industries may become more important in the future and how labor market changes can impact society and individual career decisions.
Background Research (reference:https://www.bls.gov/news.release/archives/ecopro_08292024.pdf)
According to employment projections for 2023–2033, data science and information security are among the fastest-growing occupations in the United States. The report also suggests that advances in AI and automation may reduce demand for some traditional office and sales jobs while increasing demand for technology-related careers.
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 84 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Industry Title, Industry Code, Industry Type
dbl (10): 2023 Employment, 2023 Percent of Occupation, 2023 Percent of Indus...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#filter all nas employment_nona <- employment |>filter(!is.na(`Industry Title`),!is.na(`Display Level`),!is.na(`2023 Percent of Occupation`),!is.na(`Projected 2033 Employment`),!is.na(`Projected 2033 Percent of Industry`),!is.na(`Employment Percent Change, 2023-2033`),!is.na(`2023 Employment`),!is.na(`2023 Percent of Industry`),!is.na(`Projected 2033 Percent of Occupation`),!is.na(`Employment Change, 2023-2033`),#filter all the outliers`Industry Title`!="Total employment",`Industry Title`!="Self-employed workers" ,`Industry Title`!="Total wage and salary employment" ,#filter duplicated data(other information is the same ,only display level is different )!(`Industry Title`=="Professional, scientific, and technical services"&`Display Level`==3),!(`Industry Title`=="Management of companies and enterprises"&`Display Level`==3 ))|>select(-`Industry Type`,-`Industry Code`,-`Industry Sort`)employment_nona
# A tibble: 79 × 10
`Industry Title` `2023 Employment` 2023 Percent of Occu…¹
<chr> <dbl> <dbl>
1 Mining, quarrying, and oil and gas … 2.2 0.4
2 Oil and gas extraction 1.7 0.3
3 Support activities for mining 0.5 0.1
4 Utilities 0.6 0.1
5 Utilities 0.6 0.1
6 Electric power generation, transmis… 0.4 0.1
7 Electric power generation 0.1 0
8 Natural gas distribution 0.2 0
9 Construction 12.1 2.2
10 Construction of buildings 10.6 1.9
# ℹ 69 more rows
# ℹ abbreviated name: ¹`2023 Percent of Occupation`
# ℹ 7 more variables: `2023 Percent of Industry` <dbl>,
# `Projected 2033 Employment` <dbl>,
# `Projected 2033 Percent of Occupation` <dbl>,
# `Projected 2033 Percent of Industry` <dbl>,
# `Employment Change, 2023-2033` <dbl>, …
#find out the correlation between Employment Change, 2023-2033 ,2023 Employment,2023 Percent of #Industry and 2023 Percent of Occupationp1 <-lm(`Employment Change, 2023-2033`~`2023 Employment`+`2023 Percent of Industry`+`2023 Percent of Occupation`, data=employment_nona)summary(p1)
Call:
lm(formula = `Employment Change, 2023-2033` ~ `2023 Employment` +
`2023 Percent of Industry` + `2023 Percent of Occupation`,
data = employment_nona)
Residuals:
Min 1Q Median 3Q Max
-0.28037 -0.01545 0.01194 0.01288 0.30219
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.012770 0.011559 -1.105 0.273
`2023 Employment` 0.008271 0.075330 0.110 0.913
`2023 Percent of Industry` -0.018670 0.021778 -0.857 0.394
`2023 Percent of Occupation` 0.257409 0.409011 0.629 0.531
Residual standard error: 0.09527 on 75 degrees of freedom
Multiple R-squared: 0.9974, Adjusted R-squared: 0.9973
F-statistic: 9733 on 3 and 75 DF, p-value: < 2.2e-16
#diagnostic plotsautoplot(p1,1:4,nrow=2,ncol=2)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
options(scipen=0)
Regression Analysis
The regression model is:(Employment Change, 2023-2033)= -0.0128 +0.0083(2023 Employment)-0.0187(2023 Percent of Industry)+0.2574(2023 Percent of Occupation).
The Adjusted R-squared value is 0.9973, which means about 99.7% of the variation in employment change can be explained by this model. This suggests the model has a very strong overall fit to the dataset.
The overall model is statistically significant because the p-value is less than 2.2e-16. However, when examining the individual variables, the p-values for 2023 Employment, 2023 Percent of Industry, and 2023 Percent of Occupation are all greater than 0.05. This indicates that these predictors may not individually have strong statistical significance in explaining employment change when combined together in the same model.
From the diagnostic plots, most residuals are centered around zero, which suggests the model generally fits the data reasonably well. The Residuals vs Fitted plot shows slight curvature, meaning there may be some non-linear patterns in the data. The Normal Q-Q plot shows several points deviating from the straight line at both ends, suggesting minor deviation from normality and there are a few outliers.
The Scale-Location plot shows some uneven spread of residuals, indicating slight heteroscedasticity(data 101 vocab). In the Cook’s Distance plot, observations 39, 43, and 75 appear to have relatively larger influence on the model compared to other observations.
Overall, the regression model demonstrates a strong relationship between projected employment change and the selected employment-related variables, although some individual predictors may not be statistically significant on their own.
Data Visualization
Animated Bar Graph
library(ggplot2)library(gganimate)library(gifski)#create 2 new datasets for animation:data2023 , data2033data2023<-employment_nona |>mutate(Year=2023,Employment=`2023 Employment`) data2033<-employment_nona |>mutate(Year=2033,Employment=`Projected 2033 Employment`)#using rbind to combine data 2023 and data 2033 for animated bar graph#(reference:https://www.r-bloggers.com/2024/04/data-frame-merging-in-r-with-examples/#google_vignette)bar_data<-rbind(data2023,data2033)# filter the desired top 8 industries for plotingtop8industries<-employment_nona |>arrange(desc(`2023 Employment`)) |>head(8)#top8industries#filter the data with dplyr for the 10 desired industriestop8<- bar_data |>filter( `Industry Title`%in% top8industries$`Industry Title`)#top8# change names into abbreviationtop8a <- top8 |>#round employment to integersmutate(Employment =round(Employment)) top8a$`Industry Title`[top8a$`Industry Title`=="Real estate and rental and leasing"] <-"R.E. Rental& Leasing"top8a$`Industry Title`[top8a$`Industry Title`=="Professional, scientific, and technical services"] <-"Scientific & Tech"top8a$`Industry Title`[top8a$`Industry Title`=="Residential building construction"] <-"Residential Const."#create an animated bar graphani_bar <-ggplot( top8a,aes(x =reorder(`Industry Title`, Employment),y = Employment,fill =`Industry Title`)) +geom_col() +geom_text(aes(label = Employment), vjust =0.5)+scale_fill_brewer(palette ="Set3")+coord_flip() +theme_minimal() +labs(title ="Top Industry Employment in {(closest_state)} ",x ="Industry Title",y ="Employment(in millions)",caption =" Data in 2033 is projected.\nSource: U.S. Bureau of Labor Statistics (BLS)") +#animate between different yearstransition_states(Year,transition_length =8,state_length =5) +# use smooth linear movement during animationease_aes("linear")+theme(plot.title=element_text(hjust =0.5))#(animation reference:https://www.r-bloggers.com/2024/06/r-gganimate-how-to-make-stunning-chart-animations-with-ggplo#t2/) animate(# animated plot object ani_bar,# width of the gifwidth =600,# height of the gifheight =400,# resolution of the gifres =60,# total number of framesnframes =8,# frames shown per secondfps =2,# save animation as a gif filerenderer =gifski_renderer(file ="animated_bar_employment.gif"))
The animated bar graph shows the employment distribution across top 8 industries from 2023 to projected 2033 values. Real estate and rental and leasing remain the largest industries throughout the animation, while finance and insurance and scientific and technical services remain relatively small. The visualization suggests that projected employment growth is concentrated in a few dominant industries rather than being evenly distributed across all sectors.The animated bar graph shows that Scientific and Tech, Residential Construction, Real Estate, and Real Estate Rental and Leasing experience relatively larger employment changes from 2023 to 2033, while the rest remains more stable.
The Tableau treemap highlights the relative size differences between industries in both 2023 and projected 2033 employment data. The real estate related industries occupy the largest portion of the treemap, indicating their dominant contribution to total employment. Smaller industries such as finance and insurance occupy much smaller areas, showing limited relative employment size. The dashboard makes it easier to visually compare proportional differences across industries.
One interesting observation from the visualization is that Real Estate appears to be a steadily growing industry, suggesting it may continue to provide strong employment opportunities in the future. In addition, the Scientific and Tech industry also shows growth potential and may become an increasingly important field over time.
One limitation of this visualization is that the dataset is relatively small, which limits the overall depth of the analysis.Another limitation is that the employment values vary greatly across industries. The large differences make smaller industries difficult to visualize clearly in the bar graph and treemap. In the future, it would be helpful to include more industries and variables to provide a broader understanding of employment trends and apply different scaling methods may help improve the balance of the visualization.