In Chapter 5, we briefly explored data on the salaries of engineering graduates from the National Science Foundation 2017 National Survey of College Graduates from a univariate perspective. Now, let’s explore the relationships between multiple variables.
When a question asks you to make a plot, remember to set a theme, title, subtitle, labels, colors, etc. It is up to you how to personalize your plots, but put in some effort, and make the plotting approach consistent throughout the document. For example, you could use the same theme for all plots. I also like to use the subtitle as a place for the main summary for the viewer.
Within a pipeline, import the data from the .csv file, convert all
column names to lowercase text (either “manually” with
dplyr::rename(), or use clean_names()
from the janitor package), convert
genderfrom “numeric” to “factor”, and drop any and all
observations with salary recorded as 0. Assign this to a
dataframe object with a meaningful name.
## # A tibble: 10 × 3
## salary age gender
## <dbl> <dbl> <fct>
## 1 89000 32 M
## 2 90000 31 M
## 3 95000 38 M
## 4 130000 36 F
## 5 25000 60 M
## 6 204000 71 M
## 7 80000 52 M
## 8 80350 31 M
## 9 120000 34 M
## 10 96000 36 M
How many observations have a 0 (zero) value for salary? Note: The last question asked you to remove these observations from the resultant data frame.
## [1] 15
There are 15 observations that have a 0 value for the salary. This means that there were 15 people in the study that didn’t have a salary.
What are the levels in gender? (Ignore the fact that the
observations refer to “biological sex”, not “gender”. Gender is
now recognized as a fluid term with more than two options;
biological sex - what was assigned at birth - is binary
term).
## [1] "F" "M"
There are two levels in the gender column which are “F” for female and “M” for male.
Using what you learned in Chapter 5, generate basic plots and/or
descriptive statistics to explore age, gender,
and salary. List whether each variable is continuous or
categorical, and explain how and why you adjusted your EDA approach
accordingly.
Create a histogram of salary, faceted by
gender. Add bins = 50 and
color = "lightgrey".
Create a histogram of age, faceted by
gender. Add bins = 50 and
color = "lightgrey".
Create a boxplot of salary, faceted by
gender. Use oulier.shope = 1 to better
visualize the outliers.
Create a boxplot of age, faceted by
gender.
Create a scatterplot of age (x-axis) and
salary, differentiating by gender.
Plot the cumulative distribution function of salary by
gender. Adjust the x-axis with
scale_x_log10(limits = c(5e4, 5e5)) to zoom in a bit. What
do you notice about the salaries for men and women? Hint: Remember there
are greater differences the farther up you go on a log scale axis.
The two cumulative distribution function plots show a difference in salaries between men and women. The CDF for the female plot shows a quick tapering off towards the top and bottom of the graph which shows that the majority of female salaries in engineering are close together. For the CDF of the men’s salaries, there is a more gradual tapering off at the upper limit which represents that there are more graduates that are making a greater amount of money compared to the median salary.
Calculate the quantiles of salary by
gender. You can either subset the data with
dplyr::filter() and dataframe assignment, or you can group
by, summarize by quantile, and ungroup.
Bonus point: Assign the output to a dataframe, and use inline code to call individual values when answering the following questions. Do not let R use scientific notation in the text output; check the knitted document.
## # A tibble: 2 × 6
## gender min Q1 median Q3 max
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 140 68000 90000 110513 350000
## 2 M 105 75000 97000 123000 1027653
What is the difference in salary between men and women at the median?
At the top percentile (maximum)?
Do you think there is a salary difference by gender across the pay scale? What other information would you need to test your hypothesis?
Yes, I think that there is a salary difference across the pay scale for men and women. On average, men are making more money than women and the maximum salary for a man in this data frame of engineering graduate students is significantly greater than the maximum salary that a woman is making. I think that the information that is available is sufficient enough to make this inference, but some other factors that would help in proving this could include: types of engineering jobs/fields, the year that the graduates joined the work force, how many years they have been working in their job, and ignoring extreme outlines (especially for men’s salaries).
Think about what other variables you would like to include in an hypothetical analysis. From your perspective, what are the most important individual, family, and workforce factors related to salary—beyond gender and age?
Some other important factors/variables that relate to salary might include: - Education - Experience and skills - Type of degree (bachelors, masters, or PHD) - Marital Status - Number of children - Nepotism - Background - Location
Recreate this plot with the mpg dataset. Remember to use
?mpg for information on the dataset and the variables. How
would you describe the correlation between the independent variable and
dependent variable? Do you see any patterns when considering the third
variable?
(View R Markdown PDF for image)
## # A tibble: 5 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
This plot of mileage by engine displacement shows a negative correlations between the independent variable and dependent variable. For the third variable (car class) it looks like suvs and pickup trucks have the worst mpg and two-seaters, compact, and subcompact have the best mpg. This is most likely because of the weight of the car and the corresponding power of the engine. If a car weighs more, then the engine will have to work harder, or a higher power engine that uses more fuel will be needed.
# set global options for figures, code, warnings, and messages
knitr::opts_chunk$set(fig.width=6, fig.height=4, fig.path="../figs/",
echo=FALSE, warning=FALSE, message=FALSE)
# load packages for current session
library(tidyverse)
library(dplyr)
library(ggplot2)
library(stringr)
library(lubridate)
library(janitor)
# import and tidy salary data
raw_salaries <- readr::read_csv(file = "salaries.csv")
eng_salaries <- raw_salaries %>%
clean_names() %>%
mutate(gender = as.factor(gender)) %>%
filter(salary != 0)
head(eng_salaries, n=10)
# number of observations with salary as 0
zero_salaries <- nrow(raw_salaries) - nrow(eng_salaries)
print(zero_salaries)
# number of factor levels
levels(eng_salaries$gender)
#The age plot is continuous, I used a histogram to better show the range of ages
age_plot <- ggplot(data = eng_salaries,
aes(x = age)) +
geom_histogram(bins = max(eng_salaries$age) - min(eng_salaries$age),
binwidth = 1,
linewidth = .3,
fill = "mediumpurple",
color = "black") +
labs(title = "Age Histogram of Mechanical Engineering Graduates",
x = "Age",
y = "Number of Graduates",
subtitle = "2017 National Survey of College Graduates") +
scale_x_continuous(breaks = seq(from = 20,
to = 80,
by = 5)) +
theme_minimal()
age_plot
#The gender plot is categorical containing two variables (male, female), I used a bar plot to show the number of graduate students that are female and male
gender_plot <- ggplot(data = eng_salaries,
aes(x = gender,
fill = gender)) +
geom_bar(width = 0.5,
position = "dodge",
color = "black") +
labs(title = "Gender Barplot of Mechanical Engineering Graduates",
x = "Gender",
y = "Number of Graduates",
subtitle = "2017 National Survey of College Graduates") +
scale_y_continuous(breaks = seq(from = 0,
to = 3500,
by = 500)) +
scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
theme_minimal()
gender_plot
#The salary plot is continuous, I used a histogram to show the range of salaries for the graduates
salary_plot <- ggplot(data = eng_salaries,
aes(x = salary)) +
geom_histogram(bins = 50,
fill = "seagreen",
color = "black") +
labs(title = "Salary Histogram of Mechanical Engineering Graduates",
x = "Salary ($)",
y = "Number of Graduates",
subtitle = "2017 National Survey of College Graduates") +
theme_minimal()
salary_plot
# histogram of salaries split by gender
salary_gender_plot <- ggplot(data = eng_salaries,
aes(x = salary,
fill = gender)) +
geom_histogram(bins = 50,
color = "lightgrey") +
facet_wrap(eng_salaries$gender) +
scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
labs(title = "Histogram of Engineering Salaries, Faceted by Gender",
x = "Salary ($)",
y = "Number of Graduates",
subtitle = "2017 National Survey of College Graduates")
salary_gender_plot
# histogram of ages split by gender
age_gender_plot <- ggplot(data = eng_salaries,
aes(x = age,
fill = gender)) +
geom_histogram(bins = 50,
color = "lightgrey") +
facet_wrap(eng_salaries$gender) +
scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
labs(title = "Age Histogram of Engineering Graduates, Faceted by Gender",
x = "Age",
y = "Number of Graduates",
subtitle = "2017 National Survey of College Graduates") +
scale_x_continuous(breaks = seq(from = 20,
to = 80,
by = 5))
age_gender_plot
# boxplots of salary data by gender
sal_gen_boxplot <- ggplot(data = eng_salaries,
aes(x = salary,
fill = gender)) +
geom_boxplot(outlier.shape = 1) +
facet_wrap(eng_salaries$gender) +
scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
labs(title = "Boxplot of Engineering Salaries, Faceted by Gender",
x = "Salary ($)",
y = "Number of Graduates",
subtitle = "2017 National Survey of College Graduates")
sal_gen_boxplot
# boxplots of age data by gender
age_gen_boxplot <- ggplot(data = eng_salaries,
aes(x = age,
fill = gender,
y = "")) +
geom_boxplot(outlier.shape = 1) +
facet_wrap(eng_salaries$gender) +
scale_fill_manual(values = c("M" = "skyblue", "F" = "pink")) +
labs(title = "Age Boxplot of Engineering Graduates, Faceted by Gender",
x = "Age",
y = "",
subtitle = "2017 National Survey of College Graduates") +
scale_x_continuous(breaks = seq(from = 20,
to = 80,
by = 5))
age_gen_boxplot
# scatterplot of salary across age by gender
age_sal_scat <- ggplot(data = eng_salaries,
aes(x = age,
y = salary,
color = gender)) +
geom_point(alpha = 0.35) +
facet_wrap(eng_salaries$gender) +
scale_x_continuous(breaks = seq(from = 20,
to = 80,
by = 5)) +
labs(title = "Scatterplot of Engineering Graduate's Age vs Salary, Faceted by Gender",
x = "Age",
y = "Salary ($)",
subtitle = "2017 National Survey of College Graduates")
age_sal_scat
# plot cdf of salary by gender
sal_gen_ecdf <- ggplot(data = eng_salaries,
aes(x = salary,
color = gender)) +
stat_ecdf() +
facet_wrap(eng_salaries$gender) +
scale_x_log10(limits = c(5e4, 5e5)) +
labs(title = "CDF of Graduate's Salary by Gender",
subtitle = "2017 National Survey of College Graduates")
sal_gen_ecdf
# calculate quantiles of salary by gender
quantiles_sal_gen <- eng_salaries %>%
group_by(gender) %>%
summarize(min = min(salary),
Q1 = quantile(salary, 0.25),
median = quantile(salary, 0.5),
Q3 = quantile(salary, 0.75),
max = max(salary)) %>%
ungroup()
quantiles_sal_gen
# information on mpg dataset
?mpg
head(mpg, n=5)
# Recreate mpg plot
mpg_plot <- ggplot(data = mpg,
aes(x = displ,
y = hwy,
color = class)) +
geom_point() +
labs(title = "Milage by Engine Displacement",
x = "Engine Displacement (litres)",
y = "Higheay Miles per Gallon",
subtitle = "Data from 1998 and 2008",
caption = "Source: EPA (http://fueleconomy.gov)",
color = "Car Class") +
theme_minimal()
mpg_plot