Data Preparation

Load data

# load data
library(tidyverse)
library(stringr)
library(ggpubr)
library(DATA606)

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

raw_data <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv', stringsAsFactors = FALSE)
raw_data

Select distinct majors in column Major_Category, and create a character class storing all STEM majors in the raw dataset.

majors <- raw_data$Major_category %>%
  unique()

stem_majors <- c('Engineering','Physical Sciences','Computers & Mathematics',
                 'Agriculture & Natural Resources','Health','Social Science',
                 'Biology & Life Science')

create a new column named Is_STEM, if the major is STEM then STEM else Non-STEM. Select column as Major, Major_category, Is_STEM, Unemployment_rate as output cleaned data.

clean_data <- raw_data %>% 
  drop_na() %>%
  mutate(Is_STEM = 
           case_when(
             Major_category %in% stem_majors ~ 'STEM',
             str_detect(Major,'TECHNOLOG') ~ 'STEM',
             TRUE ~ 'Non-STEM')) %>%
  select(Major, 
         Major_category, 
         Is_STEM, 
         Median, 
         Unemployment_rate) %>%
  arrange(Major)
clean_data

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Answer:

Resarch Question:

Is the average unemployment rate of STEM majors different than that of non-STEM majors?
Can unemployment rate be predicted by types of major and median of incoming?

Cases

What are the cases, and how many are there?

Answer: The cases are the employment statistics of each major.

Data collection

Describe the method of data collection.

Answer: The data is from American Community Survey 2010-2012 Public Use Microdata Series.

The American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help determine how more than $675 billion in federal and state funds are distributed each year.

Through the ACS, we know more about jobs and occupations, educational attainment, veterans, whether people own or rent their homes, and other topics.

*Reference: https://www.census.gov/programs-surveys/acs/about.html

Type of study

What type of study is this (observational/experiment)?

Answer: This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Answer: The data is from the source below:

https://github.com/fivethirtyeight/data/tree/master/college-majors

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

Answer: The response variable is Unemployment Rate. It’s quantitative

Independent Variable

You should have two independent variables, one quantitative and one qualitative.