Rename this file to your [LAST NAME]_ps2.Rmd
file. Then
change the author: [Your Name]
to your name.
We will be using the sc_debt.Rds
file.
All of the following questions should be answered in this
.Rmd
file. There are code chunks with incomplete code that
need to be filled in.
This problem set is worth 5 total points, plus 1 extra credit point. The point values for each question are indicated in brackets below. To receive full credit, you must have the correct code. In addition, some questions ask you to provide a written response in addition to the code.
You are free to rely on whatever resources you need to complete this problem set, including lecture notes, lecture presentations, Google, your classmates…you name it. However, the final submission must be complete by you. There are no group assignments. To submit, compiled the completed problem set and upload the HTML file to Brightspace on Friday by midnight.
Note: Do NOT ask for office hours the day before this assignment is due if you have not started the assignment. lack of planning & effort on your part does not constitute an emergency on mine.
Good luck!
*Copy the link to ChatGPT you used here: _________________
Require tidyverse
and load the
sc_debt.Rds
data by assigning it to an object named
df
.
require('tidyverse') # Load tidyverse
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_rds('~/DS1000/sc_debt.Rds') # Load the dataset
Research Question: Do students who graduate from smaller schools (i.e., schools with smaller student bodies) make more money in their future careers? Before looking at the data, write out what you think the answer is, and explain why you think so.
I think there will be no necessary cause and effect from the size of school as there are good and bad liberal arts schools, and there are good and bad universitys.
Based on this research question, what is the outcome / dependent / \(Y\) variable and what is the explanatory / independent / \(X\) variable? What are their average values in the data?
df %>%
summarise(
avg_x=mean(ugds,na.rm=TRUE),
avg_y=mean(md_earn_wne_p6,na.rm=TRUE)
)
## # A tibble: 1 × 2
## avg_x avg_y
## <dbl> <dbl>
## 1 4861. 33028.
dependent y is the salarey of future carreers ; independent x is size of school ; avg is : avg_x avg_y
1 4861. 33028.
Create the scatterplot of the data that analyzes your hypothesis, along with a line of best fit. Then, describe the result. Is your answer to the research question supported?
df %>%
ggplot(aes(x = ugds,
y = md_earn_wne_p6)) +
geom_point() +
geom_smooth(method='lm') +
labs(title = 'relationship between size of school and salary after graduation',
x = 'size of school',
y = 'salary after graduation')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 241 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 241 rows containing missing values or values outside the scale range
## (`geom_point()`).
There does not seem to be a strong relationships between x and y, as the points form a circle meaning there are combinations of 2 high x 2 low. So my answer to question is indeed supported.
Does this relationship change by whether the school is a research university? Using the filter() function, create two versions of the plot, one for research universities and the other for non-research universities. What do you find?
df %>%
filter(research_u==1) %>%
ggplot(aes(x = ugds,
y = md_earn_wne_p6)) +
geom_point() +
geom_smooth(method='lm') +
labs(title = 'relationship between size of school and salary after graduation ',
x = 'size of school',
y = 'salary after graduation',
subtitle = 'for research university')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
df %>%
filter(research_u==0) %>%
ggplot(aes(x = ugds,
y = md_earn_wne_p6)) +
geom_point() +
geom_smooth(method='lm') +
labs(title = 'relationship between size of school and salary after graduation ',
x = 'size of school',
y = 'salary after graduation',
subtitle = 'for non research university')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 240 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 240 rows containing missing values or values outside the scale range
## (`geom_point()`).
From intuitive the research university have the negative relationship, meaning the larger size of school result in lower salary; where in non research university have the positive relationship. Having said that, the relationship was so low to generate any effective result.
Instead of creating two separate plots, color the points by
whether the school is a research university. To do this, you first need
to modify the research_u variable to be categorical (it is currently
stored as numeric). To do this, use the mutate command with
ifelse()
to create a new variable called
research_u_cat
which is either “Research” if
research_u
is equal to 1, and “Non-Research”
otherwise.
df <- df %>%
mutate(research_u_cat = ifelse(research_u==1,
'research',
'non-research'))
df %>%
ggplot(aes(x = ugds,
y = md_earn_wne_p6,
color = research_u_cat)) +
geom_point() +
geom_smooth() +
labs(title = 'relationship between size of school and salary after graduation',
subtitle = 'by research vs non research university',
x = 'size of school',
y = 'salary after graduation')
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 241 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 241 rows containing missing values or values outside the scale range
## (`geom_point()`).
Write a short paragraph discussing your findings. What do you think the data is telling us?
from math perspective, the data is telling us nothing as the goodness of fit is very very low, but if will give tolorance to that then the result is that there some is positive relationship between size of school and salary after graduation, but if we catagorize it into research and non research, we find that higher research school size lead to lower salary, maybe because of larger university’s rigorsity that was used to get a job goes to imspire people to do the research.