Load packages

Chapter 7 Homework

In Chapter 5, we briefly explored data on the salaries of engineering graduates from the National Science Foundation 2017 National Survey of College Graduates from a univariate perspective. Now, let’s explore the relationships between multiple variables.

When a question asks you to make a plot, remember to set a theme, title, subtitle, labels, colors, etc. It is up to you how to personalize your plots, but put in some effort, and make the plotting approach consistent throughout the document. For example, you could use the same theme for all plots. I also like to use the subtitle as a place for the main summary for the viewer.

Question 1: Data wrangling

Within a pipeline, import the data from the .csv file, convert all column names to lowercase text (either “manually” with dplyr::rename(), or use clean_names() from the janitor package), convert genderfrom “numeric” to “factor”, and drop any and all observations with salary recorded as 0. Assign this to a dataframe object with a meaningful name.

## # A tibble: 10 × 3
##    salary   age gender
##     <dbl> <dbl> <fct> 
##  1  89000    32 M     
##  2  90000    31 M     
##  3  95000    38 M     
##  4 130000    36 F     
##  5  25000    60 M     
##  6 204000    71 M     
##  7  80000    52 M     
##  8  80350    31 M     
##  9 120000    34 M     
## 10  96000    36 M

How many observations have a 0 (zero) value for salary? Note: The last question asked you to remove these observations from the resultant data frame.

## [1] 15

There are 15 observations that have a 0 value for the salary. This means that there were 15 people in the study that didn’t have a salary.

What are the levels in gender? (Ignore the fact that the observations refer to “biological sex”, not “gender”. Gender is now recognized as a fluid term with more than two options; biological sex - what was assigned at birth - is binary term).

## [1] "F" "M"

There are two levels in the gender column which are “F” for female and “M” for male.

Question 2: Univariate EDA

Using what you learned in Chapter 5, generate basic plots and/or descriptive statistics to explore age, gender, and salary. List whether each variable is continuous or categorical, and explain how and why you adjusted your EDA approach accordingly.

Question 3: Multivariate histograms

Create a histogram of salary, faceted by gender. Add bins = 50 and color = "lightgrey".

Create a histogram of age, faceted by gender. Add bins = 50 and color = "lightgrey".

Question 4: Multivariate boxplots

Create a boxplot of salary, faceted by gender. Use oulier.shope = 1 to better visualize the outliers.

Create a boxplot of age, faceted by gender.

Question 5: Scatterplot and correlation

Create a scatterplot of age (x-axis) and salary, differentiating by gender.

Question 6: Cumulative distribution function

Plot the cumulative distribution function of salary by gender. Adjust the x-axis with scale_x_log10(limits = c(5e4, 5e5)) to zoom in a bit. What do you notice about the salaries for men and women? Hint: Remember there are greater differences the farther up you go on a log scale axis.

The two cumulative distribution function plots show a difference in salaries between men and women. The CDF for the female plot shows a quick tapering off towards the top and bottom of the graph which shows that the majority of female salaries in engineering are close together. For the CDF of the men’s salaries, there is a more gradual tapering off at the upper limit which represents that there are more graduates that are making a greater amount of money compared to the median salary.

Question 7: Quantiles

Calculate the quantiles of salary by gender. You can either subset the data with dplyr::filter() and dataframe assignment, or you can group by, summarize by quantile, and ungroup.

Bonus point: Assign the output to a dataframe, and use inline code to call individual values when answering the following questions. Do not let R use scientific notation in the text output; check the knitted document.

## # A tibble: 2 × 6
##   gender   min    Q1 median     Q3     max
##   <fct>  <dbl> <dbl>  <dbl>  <dbl>   <dbl>
## 1 F        140 68000  90000 110513  350000
## 2 M        105 75000  97000 123000 1027653

What is the difference in salary between men and women at the median?

Median salary for women is $90,000
Median salary for men is $97,000
The difference at the median is $7,000

At the top percentile (maximum)?

Maximum salary for women is $350,000
Maximum salary for men is $1,027,653
The difference at the maximum is $677,653

Do you think there is a salary difference by gender across the pay scale? What other information would you need to test your hypothesis?

Yes, I think that there is a salary difference across the pay scale for men and women. On average, men are making more money than women and the maximum salary for a man in this data frame of engineering graduate students is significantly greater than the maximum salary that a woman is making. I think that the information that is available is sufficient enough to make this inference, but some other factors that would help in proving this could include: types of engineering jobs/fields, the year that the graduates joined the work force, how many years they have been working in their job, and ignoring extreme outlines (especially for men’s salaries).

Question 8: Hypothetical analysis

Think about what other variables you would like to include in an hypothetical analysis. From your perspective, what are the most important individual, family, and workforce factors related to salary—beyond gender and age?

Some other important factors/variables that relate to salary might include: - Education - Experience and skills - Type of degree (bachelors, masters, or PHD) - Marital Status - Number of children - Nepotism - Background - Location

Question 9: Recreate plot

Recreate this plot with the mpg dataset. Remember to use ?mpg for information on the dataset and the variables. How would you describe the correlation between the independent variable and dependent variable? Do you see any patterns when considering the third variable?

(View R Markdown PDF for image)

## # A tibble: 5 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…

This plot of mileage by engine displacement shows a negative correlations between the independent variable and dependent variable. For the third variable (car class) it looks like suvs and pickup trucks have the worst mpg and two-seaters, compact, and subcompact have the best mpg. This is most likely because of the weight of the car and the corresponding power of the engine. If a car weighs more, then the engine will have to work harder, or a higher power engine that uses more fuel will be needed.

Appendix

# set global options for figures, code, warnings, and messages
knitr::opts_chunk$set(fig.width=6, fig.height=4, fig.path="../figs/",
                      echo=FALSE, warning=FALSE, message=FALSE)
# load packages for current session
library(tidyverse)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(janitor)
# import and tidy salary data
raw_salaries <- readr::read_csv(file = "salaries.csv")

eng_salaries <- raw_salaries %>%
  clean_names() %>%
  mutate(gender = as.factor(gender)) %>%
  filter(salary != 0)

head(eng_salaries, n=10)

# number of observations with salary as 0 
zero_salaries <- nrow(raw_salaries) - nrow(eng_salaries)
print(zero_salaries)

# number of factor levels
levels(eng_salaries$gender)


#The age plot is continuous, I used a histogram to better show the range of ages
age_plot <- ggplot(data = eng_salaries,
                   aes(x = age)) +
  geom_histogram(bins = max(eng_salaries$age) - min(eng_salaries$age),
                 binwidth = 1,
                 linewidth = .3,
                 fill = "mediumpurple",
                 color = "black") +
  labs(title = "Age Histogram of Mechanical Engineering Graduates",
       x = "Age",
       y = "Number of Graduates",
       subtitle = "2017 National Survey of College Graduates") +
  scale_x_continuous(breaks = seq(from = 20,
                                  to = 80,
                                  by = 5)) +
  theme_minimal()
age_plot


#The gender plot is categorical containing two variables (male, female), I used a bar plot to show the number of graduate students that are female and male
gender_plot <- ggplot(data = eng_salaries,
                      aes(x = gender,
                          fill = gender)) +
  geom_bar(width = 0.5,
           position = "dodge",
           color = "black") +
  labs(title = "Gender Barplot of Mechanical Engineering Graduates",
       x = "Gender",
       y = "Number of Graduates",
       subtitle = "2017 National Survey of College Graduates") +
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 3500,
                                  by = 500)) +
  scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
  theme_minimal()
gender_plot


#The salary plot is continuous, I used a histogram to show the range of salaries for the graduates
salary_plot <- ggplot(data = eng_salaries,
                      aes(x = salary)) +
  geom_histogram(bins = 50,
                 fill = "seagreen",
                 color = "black") +
  labs(title = "Salary Histogram of Mechanical Engineering Graduates",
       x = "Salary ($)",
       y = "Number of Graduates",
       subtitle = "2017 National Survey of College Graduates") +
  theme_minimal()

salary_plot
# histogram of salaries split by gender

salary_gender_plot <- ggplot(data = eng_salaries,
                             aes(x = salary,
                                 fill = gender)) +
  geom_histogram(bins = 50,
                 color = "lightgrey") +
  facet_wrap(eng_salaries$gender) +
  scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
  labs(title = "Histogram of Engineering Salaries, Faceted by Gender",
       x = "Salary ($)",
       y = "Number of Graduates",
       subtitle = "2017 National Survey of College Graduates") 
  

salary_gender_plot
# histogram of ages split by gender
age_gender_plot <- ggplot(data = eng_salaries, 
                        aes(x = age,
                            fill = gender)) +
  geom_histogram(bins = 50,
                 color = "lightgrey") +
  facet_wrap(eng_salaries$gender) +
  scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
  labs(title = "Age Histogram of Engineering Graduates, Faceted by Gender",
       x = "Age",
       y = "Number of Graduates",
       subtitle = "2017 National Survey of College Graduates") +
  scale_x_continuous(breaks = seq(from = 20,
                                  to = 80,
                                  by = 5))
                        
age_gender_plot                        
# boxplots of salary data by gender

sal_gen_boxplot <- ggplot(data = eng_salaries,
                          aes(x = salary,
                              fill = gender)) +
  geom_boxplot(outlier.shape = 1) +
  facet_wrap(eng_salaries$gender) +
  scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
  labs(title = "Boxplot of Engineering Salaries, Faceted by Gender",
       x = "Salary ($)",
       y = "Number of Graduates",
       subtitle = "2017 National Survey of College Graduates")
sal_gen_boxplot
# boxplots of age data by gender

age_gen_boxplot <- ggplot(data = eng_salaries, 
                        aes(x = age,
                            fill = gender,
                            y = "")) +
  geom_boxplot(outlier.shape = 1) +
  facet_wrap(eng_salaries$gender) +
  scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
  labs(title = "Age Boxplot of Engineering Graduates, Faceted by Gender",
       x = "Age",
       y = "",
       subtitle = "2017 National Survey of College Graduates") +
  scale_x_continuous(breaks = seq(from = 20,
                                  to = 80,
                                  by = 5))
age_gen_boxplot
# scatterplot of salary across age by gender

age_sal_scat <- ggplot(data = eng_salaries,
                       aes(x = age,
                           y = salary,
                           color = gender)) +
  geom_point(alpha = 0.35) +
  facet_wrap(eng_salaries$gender) +
  scale_x_continuous(breaks = seq(from = 20,
                                  to = 80,
                                  by = 5)) +
  labs(title = "Scatterplot of Engineering Graduate's Age vs Salary, Faceted by Gender",
       x = "Age",
       y = "Salary ($)",
       subtitle = "2017 National Survey of College Graduates")
  
  

age_sal_scat
  
# plot cdf of salary by gender

sal_gen_ecdf <- ggplot(data = eng_salaries,
                       aes(x = salary,
                           color = gender)) + 
  stat_ecdf() + 
  facet_wrap(eng_salaries$gender) +
  scale_x_log10(limits = c(5e4, 5e5)) +
  labs(title = "CDF of Graduate's Salary by Gender",
       subtitle = "2017 National Survey of College Graduates")

sal_gen_ecdf
  

# calculate quantiles of salary by gender

quantiles_sal_gen <- eng_salaries %>%
  group_by(gender) %>%
  summarize(min = min(salary),
            Q1 = quantile(salary, 0.25),
            median = quantile(salary, 0.5),
            Q3 = quantile(salary, 0.75),
            max = max(salary)) %>%
  ungroup()


quantiles_sal_gen
# information on mpg dataset 
?mpg
head(mpg, n=5)

# Recreate mpg plot
mpg_plot <- ggplot(data = mpg,
                   aes(x = displ,
                       y = hwy,
                       color = class)) +
  geom_point() +
  labs(title = "Milage by Engine Displacement",
       x = "Engine Displacement (litres)",
       y = "Higheay Miles per Gallon",
       subtitle = "Data from 1998 and 2008",
       caption = "Source: EPA (http://fueleconomy.gov)",
       color = "Car Class") +
  theme_minimal()
  

mpg_plot

MECH476: Engineering Data Analysis in R

Chapter 7 Homework: Multivariate Exploratory Data Analysis

Connor Stephan

08 November, 2023