what factors influent the cost of attendancy
How does cost of attendancy and unemployment rate relate
I am going to work with college scorecard data set from US Department of education, we will center our exploration of the dataset around the variable reflected in our question, our target variable: cost of attendancy
cor.test(college_sc$costt4_a, college_sc$sat_avg)
##
## Pearson's product-moment correlation
##
## data: college_sc$costt4_a and college_sc$sat_avg
## t = 23.05, df = 1306, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4980670 0.5751889
## sample estimates:
## cor
## 0.5377519
college_sc %>%
ggplot(aes(x = sat_avg, y = costt4_a)) + geom_point() + geom_smooth(method = "lm") +
annotate("text", x = 750, y = 75000, label= "r = 0.54, p < 0.001")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 5804 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 5804 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Exploring variable cost of attendance
summary(college_sc$costt4_a)
college_sc %>% ggplot(aes(x = costt4_a)) + geom_density()
which state have the most average unemployment rate
college_sc %>%
group_by(state) %>%
summarise(avg_unemp_rate = mean(unemp_rate, na.rm = TRUE)) %>%
arrange(desc(avg_unemp_rate))
## # A tibble: 59 × 2
## state avg_unemp_rate
## <chr> <dbl>
## 1 PR 7.88
## 2 AK 6.61
## 3 MT 5.93
## 4 DC 4.70
## 5 MS 4.54
## 6 NM 4.51
## 7 CA 4.49
## 8 ND 4.35
## 9 LA 4.35
## 10 VI 4.29
## # ℹ 49 more rows
college_sc %>%
ggplot(aes(x = unemp_rate, y = costt4_a, color = state)) + geom_point()
## Warning: Removed 3909 rows containing missing values or values outside the scale range
## (`geom_point()`).
Performing liner regression to compare cost of attandancy and unemployment rate
lnr_reg <- lm(costt4_a ~ unemp_rate, college_sc)
summary(lnr_reg)
##
## Call:
## lm(formula = costt4_a ~ unemp_rate, data = college_sc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24696 -11640 -3277 8793 44978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41093.7 833.5 49.30 <2e-16 ***
## unemp_rate -3841.8 209.1 -18.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14750 on 3201 degrees of freedom
## (3909 observations deleted due to missingness)
## Multiple R-squared: 0.09542, Adjusted R-squared: 0.09514
## F-statistic: 337.7 on 1 and 3201 DF, p-value: < 2.2e-16
college_sc %>%
ggplot(aes(x = unemp_rate, y = costt4_a)) + geom_point() +
geom_smooth(method = "lm") +
annotate("text", x = 6, y = 75000, label= "r = 0.54, p < 0.001")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3909 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 3909 rows containing missing values or values outside the scale range
## (`geom_point()`).
college_sc %>%
group_by(state) %>%
summarise(avg_cost = mean(costt4_a, na.rm = TRUE)) %>%
arrange(desc(avg_cost)) %>%
head(10) %>%
# Create a variable to indicate whether state == "PA"
mutate(state_pa = ifelse(state == "PA", 1, 0)) %>%
# Adjust bar fill by values of new state_pa variable
ggplot(aes(x = fct_reorder(state, avg_cost), y = avg_cost, fill=state_pa)) +
geom_col() +
labs(
x = "State",
y = "Average cost of attendance",
title = "Which 10 states have the highest average cost of attendance?",
# Change subtitle to highlight our interest in Pennsylvania
subtitle = "Pennsylvania is among the top ten",
caption = "Source: US Dept. of Education"
) +
scale_y_continuous(labels = scales::dollar) +
# Suppress the fill color legend
theme(legend.position = "none")