library(tidyverse)
library(scales)
# Import data
recent_grads <- read.csv("file:///C:/Users/User/Documents/recent_grads.csv") %>% as_tibble()
# Create a scatterplot
recent_grads %>%
ggplot(aes(ShareWomen, Unemployment_rate)) +
geom_point()
# Compute correlation coefficient
cor(recent_grads$Unemployment_rate, recent_grads$ShareWomen, use = "pairwise.complete.obs")
## [1] 0.07320458
# Create a linear model 1
mod_1 <- lm(Unemployment_rate ~ ShareWomen, data = recent_grads)
# View summary of model 1
summary(mod_1)
##
## Call:
## lm(formula = Unemployment_rate ~ ShareWomen, data = recent_grads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.069268 -0.017685 -0.001467 0.018476 0.112827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.063007 0.005730 10.996 <2e-16 ***
## ShareWomen 0.009606 0.010037 0.957 0.34
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03035 on 170 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.005359, Adjusted R-squared: -0.0004919
## F-statistic: 0.9159 on 1 and 170 DF, p-value: 0.3399
# Compute correlation coefficient
cor(recent_grads$Median, recent_grads$ShareWomen, use = "pairwise.complete.obs")
## [1] -0.6186898
Residual standard error is the standard deviation from the mean. It indicates that of the 170 observed datapoints, there is a variance in median income of $9,031 per standard deviation.
The adjusted r-squared value indicates we have a moderately weak correlation between median income and percent share of women. While we cannot trust a precise prediction using our model, we can reasonably infer that the model itself will provide a relatively accurate output.