DATA 606 Project Proposal

Data Preparation

We read in our data regarding automation from github (in its raw form).

data <- read.csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-606/Data-Project/automation_data_by_state.csv", header = TRUE)

We then operate upon the automation dataset to merge our State columns into one column representative of total employment figures across the nation. We can call this new variable \(Total.Employed\).

#Drop SOC column
data <- subset(data, select = -c(SOC) )

#Add Total.Employed column
automation <- data %>%
  mutate(Total.Employed <- select(., Alabama:Washington) %>% rowSums(na.rm = TRUE))

#Verify addition of new column
head(automation$Total.Employed)

##       1       2       3       4       5       6 
##  218390 2141480   27840  202630  358740   62810

The last line of code verifies a successful addition of the \(Employed.Total\) variable. At first, I’d displayed all (702) observations, but this added 2-3 pgs to the report. The first few entries give the idea that we can now access the values within this new variable and that the column was successfully added so for sake of conciseness, I just showed the head (the first 6 observations).

Research question

What are the 5 most and least likely occupations to be automated? and what characteristics do occupations in each category share?

Cases

The cases are Occupations, of which there are 702:

nrow(automation)

## [1] 702

Data collection

The data was downloaded from Kaggle and then uploaded to Github to be read from (in its raw form).

Type of study

This is an observational study. The data is observed, probabilities were measured, and no attempt was made to affect the outcome.

Data Source

The source of data is cited (APA-style) below:

Larxel. (2020). Occupation, Salary and Likelihood of Automation [Data file]. Retrieved from https://www.kaggle.com/andrewmvd/occupation-salary-and-likelihood-of-automation

The author of this dataset asked for the following data citation(s) as well:

US Bureau of Labor Statistics. (2019). May 2019 National Occupational Employment and Wage Estimates United States [Salary data]. Retrieved from https://www.bls.gov/oes/current/oes_nat.htm
Carl Benedikt Frey and Michael A. Osborne (2013). The Future of Employment [article]. Retrieved from https://ora.ox.ac.uk/objects/uuid:4ed9f1bd-27e9-4e30-997e-5fc8405b0491/download_file?safe_filename=future-of-employment.pdf&file_format=application%2Fpdf&type_of_work=Journal%2Barticle

Dependent Variable

Probability: this is a quantitative (numeric) variable that describes the probability by which the Occuptation is likely to be automated in the future.

Independent Variables

Occupation: this is a qualitative variable that describes the type of role that is in question.
Total.Employed: this is a quantitative variable that describes the total number of people employed in this position (nationally) as of the date the data was collected.

Relevant summary statistics

Below are the summary statistics for the variables of our consideration:

#Probability
summary(automation$Probability)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0028  0.1100  0.6400  0.5355  0.8900  0.9900

#Occupation
summary(automation$Occupation)

##    Length     Class      Mode 
##       702 character character

#Total.Employed
summary(automation$Total.Employed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   13802   43600  173367  137162 4412910

These summary statistics give an idea of the type and range of values that we’re dealing with. For instance, the \(Probability\) variable has decimal values between 0.0028 and 0.99 with an average value of 0.5355. These values indicate the probability that the corresponding \(Occupation\) will be automated.

Once we’ve taken stock of our variables, we explore their visualization via ggplot2. To steal from the tidyverse site:

It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

#Scatterplot of Employment vs. Probability of Automation

ggplot(automation, aes(x=Probability, y=automation$Total.Employed)) + 
  geom_point() +
  labs(title = "Employment vs. Probability of Automation", subtitle = "(A visualization of 702 Occupations)", x = "Probability of Automation", y = "Total Employed")

## Warning: Use of `automation$Total.Employed` is discouraged. Use `Total.Employed`
## instead.

#Refine this plot based on Probability of Automation
automation <- automation %>%
  mutate(High.Prob = ifelse(Probability > 0.8, "yes", "no") )

ggplot(automation, aes(x=Probability, y=automation$Total.Employed, color = automation$High.Prob)) + 
  geom_point() +
  labs(title = "Employment vs. Probability of Automation", subtitle = "(A visualization of 702 Occupations)", x = "Probability of Automation", y = "Total Employed")

## Warning: Use of `automation$Total.Employed` is discouraged. Use `Total.Employed`
## instead.

## Warning: Use of `automation$High.Prob` is discouraged. Use `High.Prob` instead.

The visualizations above show that I should refine my means of filtering data to gain greater insight regarding the question at hand. This filter could likely consider both the probability of automation and the total number people employed that may be affected by automation to see what we may infer …

The above summary statistics and plots provide a launch pad for answering the question at hand and exploring what occupations appear to be safe from automation v. what fields are highly likely to be automated in the not-so-distant future.