Harp 325 Midterm

Use the following data to produce 1 table of summary information and 2-3 graphs.

Notes on the data: the total, professional, and computer_all job groups are available for all four years of data. Focus on these if you want to produce graphs of jobs over time. The individual occupations change from one year to the next, so you will not be able to graph them over time (with the exception of computer programmers). You may, however, filter one year of data and make a bar or point plot for each occupation in that year.

The All variable is measured as total jobs in that category, while Women, Black, Asian, and Latino are the percent of workers who identify with each group. You cannot put All on the same plot as one of these variables, since they are measured in different units.

The goal of the assignment is not only to practice making plots: your task is to present the data in a clear and meaningful way. Points will be deducted, for example, from plots that are hard to read or understand. You are also encouraged to think about how to use colors, labels, and themes effectively. Graphs with multiple groups must have a legend.

Although you will need to filter the data, it does not need to be cleaned up too much in order to be graphed. Don’t overthink it!

rm(list=ls())

employment_data <- read.csv("http://tinyurl.com/dida325midtermdata", stringsAsFactors = F, fileEncoding="UTF-8-BOM")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(tidyr)
library(ggthemes)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggthemes':
## 
##     theme_map

library(corrplot)

## corrplot 0.92 loaded

First load and name data; then load all necessary libraries

summary(employment_data)

##    job_type         description             year           All          
##  Length:70          Length:70          Min.   :2005   Min.   :     3.0  
##  Class :character   Class :character   1st Qu.:2010   1st Qu.:    83.0  
##  Mode  :character   Mode  :character   Median :2015   Median :   223.5  
##                                        Mean   :2014   Mean   : 10589.8  
##                                        3rd Qu.:2020   3rd Qu.:   774.2  
##                                        Max.   :2020   Max.   :148834.0  
##                                                                         
##      Women           Black            Asian       Hispanic.Latino 
##  Min.   : 9.30   Min.   : 3.000   Min.   : 3.40   Min.   : 2.000  
##  1st Qu.:21.05   1st Qu.: 6.450   1st Qu.: 8.65   1st Qu.: 5.550  
##  Median :27.60   Median : 9.100   Median :11.80   Median : 6.900  
##  Mean   :31.18   Mean   : 8.997   Mean   :14.69   Mean   : 7.853  
##  3rd Qu.:43.85   3rd Qu.:11.600   3rd Qu.:19.75   3rd Qu.: 9.350  
##  Max.   :57.40   Max.   :15.300   Max.   :34.10   Max.   :17.600  
##  NA's   :11      NA's   :11       NA's   :11      NA's   :11

head(employment_data)

Get summary and use head to few first few rows of data to get sense of the data and find na’s.

employment_data <- na.omit(employment_data)

Remove na’s from data.

str(employment_data)

## 'data.frame':    59 obs. of  8 variables:
##  $ job_type       : chr  "total" "professional" "computer_all" "computer" ...
##  $ description    : chr  "Total, 16 years and over" "Professional and related occupations" "Computer and mathematical occupations" "Computer systems analysts" ...
##  $ year           : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ All            : num  147795 36502 5603 594 137 ...
##  $ Women          : num  46.8 57 25.2 35.6 11.4 21.1 19.4 25.1 27.8 44.8 ...
##  $ Black          : num  12.1 10.5 9.1 9.7 11.9 6.3 6.2 12 3.7 5.9 ...
##  $ Asian          : num  6.4 10.1 23 18.7 6.9 28.3 34.1 29.6 16.2 9.9 ...
##  $ Hispanic.Latino: num  17.6 10.1 8.4 8.1 15.8 6.6 5.9 9.2 5.9 15.8 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 4 17 18 25 36 37 40 53 54 56 ...
##   ..- attr(*, "names")= chr [1:11] "4" "17" "18" "25" ...

View(employment_data)

filtered_data <- employment_data %>% 
  filter(description == "Computer programmers" | job_type == "total" | 
           job_type == "professional" | job_type == "computer_all")

Since there are multiple job types listed as computer, I filtered by description for computer programmer since each is unique. Filtered by job type for others. Used Filter to create new a variable for the data with just the four job types I wish to conduct my analysis on. The four I chose to examine were total, professional, computer programmers, and professional, as they exist over all four years of data collection.

filtered_data$job_type[filtered_data$job_type == "computer"] <- "computer_programmer"

Rename the job_type computer to computer_programmer for clarity.

https://sparkbyexamples.com/r-programming/replace-values-in-r/

View(filtered_data)
print(filtered_data)

##               job_type                           description year    All Women
## 1                total              Total, 16 years and over 2020 147795  46.8
## 2         professional  Professional and related occupations 2020  36502  57.0
## 3         computer_all Computer and mathematical occupations 2020   5603  25.2
## 4  computer_programmer                  Computer programmers 2020    417  21.1
## 5                total              Total, 16 years and over 2015 148834  46.8
## 6         professional  Professional and related occupations 2015  33852  57.2
## 7         computer_all Computer and mathematical occupations 2015   4369  24.7
## 8  computer_programmer                  Computer programmers 2015    480  21.0
## 9                total              Total, 16 years and over 2010 139064  47.2
## 10        professional  Professional and related occupations 2010  30805  57.4
## 11        computer_all Computer and mathematical occupations 2010   3531  25.8
## 12 computer_programmer                  Computer programmers 2010    470  22.0
## 13               total              Total, 16 years and over 2005 141730  46.4
## 14        professional Professional and related occupations  2005  28795  56.3
## 15        computer_all Computer and mathematical occupations 2005   3246  27.0
## 16 computer_programmer                  Computer programmers 2005    581  26.0
##    Black Asian Hispanic.Latino
## 1   12.1   6.4            17.6
## 2   10.5  10.1            10.1
## 3    9.1  23.0             8.4
## 4    6.3  28.3             6.6
## 5   11.7   5.8            16.4
## 6    9.8   8.7             8.8
## 7    8.6  19.9             6.8
## 8    7.0  18.9             6.9
## 9   10.8   4.8            14.3
## 10   9.2   7.0             7.1
## 11   6.7  16.1             5.5
## 12   5.1  12.4             6.5
## 13  10.8   4.4            13.1
## 14   8.8   6.6             6.4
## 15   6.9  14.7             5.3
## 16   4.6  18.0             5.7

I have now cleaned my data of na’s filtered for the specific job types I wish to exam further. So, lets view the new filtered data and make sure it looks satisfactory.

summary_table <- filtered_data %>% 
  group_by(job_type) %>% 
  summarise(
    Avg_Women = mean(Women),
    Avg_Black = mean(Black),
    Avg_Asian = mean(Asian),
    Avg_His.Lat = mean(Hispanic.Latino),
    sd_Women = sd(Women),
    sd_Black = sd(Black),
    sd_Asian = sd(Asian),
    sd_His.Lat = sd(Hispanic.Latino)
    )
  
View(summary_table)

I created this table to show the mean and standard deviation for each variable across all four years, grouped by job type. I choose to create each variable individually rather than use across all so I could create the table all in one function without needing to use r bind to merge variables. Additionally, each variable is already named, and there isn’t to many values to deal with so it is manageable to work with.

 summary_table <- summary_table %>% 
  mutate_if(is.numeric, round, digits=2)

Mutated the table to round values to two decimal places.

summary_table <- summary_table %>% 
  mutate(
    Most_Rep_Race = case_when(
      Avg_Black == pmax(Avg_Black, Avg_Asian, Avg_His.Lat) ~ "Black",
      Avg_Asian == pmax(Avg_Black, Avg_Asian, Avg_His.Lat) ~ "Asian",
      Avg_His.Lat == pmax(Avg_Black, Avg_Asian, Avg_His.Lat) ~ "Hispanic.Latino"
    ),
    Least_Rep_Race = case_when(
      Avg_Black == pmin(Avg_Black, Avg_Asian, Avg_His.Lat) ~ "Black",
      Avg_Asian == pmin(Avg_Black, Avg_Asian, Avg_His.Lat) ~ "Asian",
      Avg_His.Lat == pmin(Avg_Black, Avg_Asian, Avg_His.Lat) ~ "Hispanic.Latino"
    ),
  
  )
View(summary_table)

Next I mutated the table again to add two new columns that use the previous calculations to find the minimum/maximum average value across the different races and return that race for the least represented and most represented race columns respectively.

sources:https://www.statology.org/r-pmax-pmin/ https://www.geeksforgeeks.org/case-when-statement-in-r-dplyr-package-using-case_when-function/

summary_table1 <- summary_table %>%
  kbl(escape = F, align = "c", caption = "<center><strong>Summary Statistics: Job Data</strong></center>",
      format = "html") %>%
  kable_paper("striped", full_width = F) %>% 
  column_spec(c(2:5), bold = T, color = "darkorange") %>% 
  column_spec(c(6:9), bold = T, color = "red") %>% 
  column_spec(c(10:11), bold = T, color = "forestgreen") 
  
print(summary_table1)

## <table class=" lightable-paper lightable-striped" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
## <caption><center><strong>Summary Statistics: Job Data</strong></center></caption>
##  <thead>
##   <tr>
##    <th style="text-align:center;"> job_type </th>
##    <th style="text-align:center;"> Avg_Women </th>
##    <th style="text-align:center;"> Avg_Black </th>
##    <th style="text-align:center;"> Avg_Asian </th>
##    <th style="text-align:center;"> Avg_His.Lat </th>
##    <th style="text-align:center;"> sd_Women </th>
##    <th style="text-align:center;"> sd_Black </th>
##    <th style="text-align:center;"> sd_Asian </th>
##    <th style="text-align:center;"> sd_His.Lat </th>
##    <th style="text-align:center;"> Most_Rep_Race </th>
##    <th style="text-align:center;"> Least_Rep_Race </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:center;"> computer_all </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 25.68 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 7.82 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 18.42 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 6.50 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 0.99 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 1.20 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 3.76 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 1.43 </td>
##    <td style="text-align:center;font-weight: bold;color: forestgreen !important;"> Asian </td>
##    <td style="text-align:center;font-weight: bold;color: forestgreen !important;"> Hispanic.Latino </td>
##   </tr>
##   <tr>
##    <td style="text-align:center;"> computer_programmer </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 22.52 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 5.75 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 19.40 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 6.42 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 2.36 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 1.10 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 6.59 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 0.51 </td>
##    <td style="text-align:center;font-weight: bold;color: forestgreen !important;"> Asian </td>
##    <td style="text-align:center;font-weight: bold;color: forestgreen !important;"> Black </td>
##   </tr>
##   <tr>
##    <td style="text-align:center;"> professional </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 56.98 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 9.57 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 8.10 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 8.10 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 0.48 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 0.74 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 1.61 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 1.67 </td>
##    <td style="text-align:center;font-weight: bold;color: forestgreen !important;"> Black </td>
##    <td style="text-align:center;font-weight: bold;color: forestgreen !important;"> Asian </td>
##   </tr>
##   <tr>
##    <td style="text-align:center;"> total </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 46.80 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 11.35 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 5.35 </td>
##    <td style="text-align:center;font-weight: bold;color: darkorange !important;"> 15.35 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 0.33 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 0.66 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 0.91 </td>
##    <td style="text-align:center;font-weight: bold;color: red !important;"> 2.03 </td>
##    <td style="text-align:center;font-weight: bold;color: forestgreen !important;"> Hispanic.Latino </td>
##    <td style="text-align:center;font-weight: bold;color: forestgreen !important;"> Asian </td>
##   </tr>
## </tbody>
## </table>

In this table, I’ve implemented stylistic enhancements to improve readability and presentation. Key changes include the addition of a clear title to summarize the content of the table and the strategic use of colors to highlight different statistics and demographic groups.

The decision to maintain the data in wide format facilitates easier analysis across demographic groups within each job type, enhancing interpretability and allowing for more efficient comparison.

I hypothesize that there is a significant association between demographic factors, specifically gender and ethnicity, and job type within a given population. Additionally, I expect to find correlations between these demographic variables and the distribution of job types.

filtered_long <- filtered_data %>% 
  pivot_longer(cols = 5:8, names_to = "Group", values_to = "Percentage") %>% 
  select(job_type, Group, Percentage, year) %>% 
  filter(job_type != "professional")
  
View(filtered_long)

Here I transform the filtered data to long format to be able to construct a graph. Source:https://www.youtube.com/watch?v=CshEAjYQN8U

ggplot(filtered_long, aes(x = year, y = Percentage, color = Group)) +
  geom_line(linewidth = 2) +
  facet_wrap(~job_type) +
  labs(y = "Percentage", x = "Year", 
       title = "Demographics of Jobs Over Time",
       color = "Group")+
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

For this plot I created three graphs for three job types and choose to omit professional as it is similar to total and felt it was redundant in this case. This graph shows the trend in how different demographic groups representation in various jobs has changed from 2005 to 2020.

Some key take a ways to note include the following:

First, when comparing the demographic representation of Hispanic/Latino and Black individuals in computer programming and computer-related occupations to the total demographic representation, a similar pattern of growth is evident. This suggests a consistent increase in the representation of these demographic groups within these specific job types over the analyzed period.

In contrast to the total demographic representation, the representation of Asian individuals in computer programming and related occupations exhibits a notably faster rate of growth. This implies a significant rise in the presence of Asian individuals within these job types compared to the overall demographic trends.

Despite the overall stability in female demographic makeup across all job types over the studied period, there is a concerning decline in female representation within computer programming and related fields. This decline is noteworthy considering the relatively stable female representation across all job types, indicating a unique challenge or trend specific to these technology-related occupations.

corr_data <- filtered_data %>% 
  select(Women, Black, Asian, Hispanic.Latino)

corr_table <- cor(corr_data)

corrplot(corr_table, tl.cex = 1, cl.cex = 0.5)

This correlation table is showing correlation between variables across all job types from filtered_data. In this table we can gather both the strength and direction of the correlation between variables.

The strongest positive correlation is observed between the Black and Hispanic/Latino demographic groups within this dataset. This indicates a consistent trend of these demographic groups exhibiting similar patterns or proportions across different job types, suggesting potential shared experiences or representation within the workforce.

Conversely, the strongest negative correlation is found between the Women and Asian demographic groups. This suggests an inverse relationship in the representation or proportions of these demographic groups across job types within the dataset. Such findings align with the trends observed in the data, reinforcing the significance of these correlations in understanding demographic dynamics within the workforce.

a <- ggplot(filtered_data)+
  geom_point(aes(x = Black, y = Women, color = job_type))+
  geom_smooth(method = "lm", aes(x = Black, y = Women ))+
  theme_minimal()+
  labs(title = "Women vs Black")+
  scale_color_brewer(palette = "Dark2")

b <- ggplot(filtered_data)+
  geom_point(aes(x = Asian, y = Women, color = job_type))+
  geom_smooth(method = "lm", aes(x = Asian, y = Women ))+
  theme_minimal()+
  labs(title = "Women vs Asian")+
  scale_color_brewer(palette = "Dark2")

c <- ggplot(filtered_data)+
  geom_point(aes(x = Hispanic.Latino , y = Women, color = job_type))+
  geom_smooth(method = "lm", aes(x = Hispanic.Latino, y = Women ))+
  theme_minimal()+
  labs(title = "Women vs Hispanic.Latino")+
  scale_color_brewer(palette = "Dark2")
  

plot_grid(a, b, c)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

This set of graphs is specifically showing the correlation between each race compared to women.

The graphs depicting the correlation between women and Hispanic/Latino, as well as women and Black individuals, exhibit similar patterns with relatively strong positive linear relationships. This finding is consistent with the previously observed strong correlation between Black and Hispanic/Latino demographic groups. It suggests a tendency for these demographic groups to exhibit similar representation patterns across various job types.

The graph depicting the correlation between women and Asian individuals stands out with the strongest relationship, characterized by a negative linear relationship. This indicates that as the percentage of Asians in a job type increases, the likelihood of women being in that particular job type decreases. This observation highlights a distinct trend in the dataset, emphasizing potential disparities or differential representation between Asian individuals and women across different job types.

long_table <- summary_table %>% 
  pivot_longer(cols = 3:5, names_to = "Race", values_to = "Percentage") %>% 
  select(job_type, Race, Percentage)
View(long_table)

Used pivot Longer so each Avg_*race was listed under the new column Race, and each of their values is listed under the new column Percentage. Used select so only the variables I needed were included.

ggplot(long_table, aes(x = job_type, y = Percentage, fill = Race)) +
  geom_bar(width= 0.75, stat = "identity", position = "stack") +
  labs(x = "Job Type", y = "Percentage", title="Race Distribution By Job Type") +
  scale_fill_brewer(palette = "Dark2") +
  theme_minimal()

The bar chart provides a clear visualization of the demographic breakdown within each job type, segmented by race. Several key observations can be drawn from the chart. Note the percentages do not go to 100 as this does not show every possible race.

The chart effectively translates the information from the summary table into a visual format, allowing for easier comparison and interpretation of demographic proportions across different job types. By segmenting the bars by race, the chart provides a comprehensive overview of demographic diversity within each job category.

What stands out the most is despite comprising the smallest percentage of the total and professional job types, Asians exhibit a disproportionately higher representation in the computer all and computer programmer job types. Specifically, Asians are approximately twice as likely to be found working in these technology-related occupations compared to their representation in the total or professional job types. This highlights a notable trend in workforce distribution, underscoring the significance of Asian representation within the technology sector.

Overall, the findings align with my hypothesis by demonstrating distinct relationships between gender, ethnicity, and job type. Visualizations such as line graphs and bar charts effectively portray these associations, providing clarity and insight into demographic distributions within different occupational categories.

These visualizations serve as powerful tools for highlighting trends and patterns within the data. However, to advance our understanding, the next logical step in the analysis involves delving deeper into the underlying causes driving these observed trends.

Moving forward, it would be valuable to explore potential causal factors contributing to the identified associations. This could involve conducting further statistical analyses or qualitative investigations to uncover the mechanisms influencing demographic representation within specific job types.