what factors influent the cost of attendancy

How does cost of attendancy and unemployment rate relate

I am going to work with college scorecard data set from US Department of education, we will center our exploration of the dataset around the variable reflected in our question, our target variable: cost of attendancy

Load ggplot2, leaflet and dataset

Performing statistical test comparing cott4_a and sat_avg

cor.test(college_sc$costt4_a, college_sc$sat_avg)
## 
##  Pearson's product-moment correlation
## 
## data:  college_sc$costt4_a and college_sc$sat_avg
## t = 23.05, df = 1306, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4980670 0.5751889
## sample estimates:
##       cor 
## 0.5377519
college_sc %>%
        ggplot(aes(x = sat_avg, y = costt4_a)) + geom_point() + geom_smooth(method = "lm") +
        annotate("text", x = 750, y = 75000, label= "r = 0.54, p < 0.001")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 5804 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 5804 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Exploring variable cost of attendance
summary(college_sc$costt4_a)
college_sc %>%  ggplot(aes(x = costt4_a)) + geom_density()

which state have the most average unemployment rate

college_sc %>%
  group_by(state) %>%
  summarise(avg_unemp_rate = mean(unemp_rate, na.rm = TRUE)) %>%
  arrange(desc(avg_unemp_rate))
## # A tibble: 59 × 2
##    state avg_unemp_rate
##    <chr>          <dbl>
##  1 PR              7.88
##  2 AK              6.61
##  3 MT              5.93
##  4 DC              4.70
##  5 MS              4.54
##  6 NM              4.51
##  7 CA              4.49
##  8 ND              4.35
##  9 LA              4.35
## 10 VI              4.29
## # ℹ 49 more rows
college_sc %>%
  ggplot(aes(x = unemp_rate, y = costt4_a, color = state)) + geom_point()
## Warning: Removed 3909 rows containing missing values or values outside the scale range
## (`geom_point()`).

Performing liner regression to compare cost of attandancy and unemployment rate

lnr_reg <- lm(costt4_a ~ unemp_rate, college_sc)
summary(lnr_reg)
## 
## Call:
## lm(formula = costt4_a ~ unemp_rate, data = college_sc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24696 -11640  -3277   8793  44978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41093.7      833.5   49.30   <2e-16 ***
## unemp_rate   -3841.8      209.1  -18.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14750 on 3201 degrees of freedom
##   (3909 observations deleted due to missingness)
## Multiple R-squared:  0.09542,    Adjusted R-squared:  0.09514 
## F-statistic: 337.7 on 1 and 3201 DF,  p-value: < 2.2e-16
college_sc %>%
        ggplot(aes(x = unemp_rate, y = costt4_a)) + geom_point() +
        geom_smooth(method = "lm") +
        annotate("text", x = 6, y = 75000, label= "r = 0.54, p < 0.001") 
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3909 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 3909 rows containing missing values or values outside the scale range
## (`geom_point()`).

Using plot to compare the relationship between cost of attendace and state

college_sc %>% 
        group_by(state) %>% 
        summarise(avg_cost = mean(costt4_a, na.rm = TRUE)) %>%
        arrange(desc(avg_cost)) %>% 
        head(10) %>%
        # Create a variable to indicate whether state == "PA"
        mutate(state_pa = ifelse(state == "PA", 1, 0)) %>%
        # Adjust bar fill by values of new state_pa variable
        ggplot(aes(x = fct_reorder(state, avg_cost), y = avg_cost, fill=state_pa)) +
                geom_col() +
                labs(
                        x = "State",
                        y = "Average cost of attendance",
                        title = "Which 10 states have the highest average cost of attendance?",
                        # Change subtitle to highlight our interest in Pennsylvania
                        subtitle = "Pennsylvania is among the top ten",
                        caption = "Source: US Dept. of Education"
                ) +
                scale_y_continuous(labels = scales::dollar) +
                # Suppress the fill color legend
                theme(legend.position = "none")

links