Correlation and Regression

Q1. Create the same scatterplot to examine the relationship between ShareWomen and median earnings.
Q2 Compute correlation coefficient between the two variables and interpret them.
Q3 Build a regression model to predict median earnings using the share of women.
Q4. Is the coefficient of ShareWomen statistically significant at 5%? Interpret the coefficient.
Q5. How much median earnings does the model predict for a major that has 60% as the share of women?
Q6. Interpret the reported residual standard error.
Q7. Interpret the reported adjusted R squared.

library(tidyverse)
library(scales)

# Import data
recent_grads <- read.csv("file:///C:/Users/User/Documents/recent_grads.csv") %>% as_tibble()

# Create a scatterplot
recent_grads %>%
  ggplot(aes(ShareWomen, Unemployment_rate)) +
  geom_point()


# Compute correlation coefficient
cor(recent_grads$Unemployment_rate, recent_grads$ShareWomen, use = "pairwise.complete.obs")
## [1] 0.07320458

# Create a linear model 1
mod_1 <- lm(Unemployment_rate ~ ShareWomen, data = recent_grads)

# View summary of model 1
summary(mod_1)
## 
## Call:
## lm(formula = Unemployment_rate ~ ShareWomen, data = recent_grads)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.069268 -0.017685 -0.001467  0.018476  0.112827 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.063007   0.005730  10.996   <2e-16 ***
## ShareWomen  0.009606   0.010037   0.957     0.34    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03035 on 170 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.005359,   Adjusted R-squared:  -0.0004919 
## F-statistic: 0.9159 on 1 and 170 DF,  p-value: 0.3399

Q1. Create the same scatterplot to examine the relationship between ShareWomen and median earnings.

# Create a scatterplot
recent_grads %>%
  ggplot(aes(ShareWomen, Median)) +
  geom_point()

Q2 Compute correlation coefficient between the two variables and interpret them.

# Compute correlation coefficient
cor(recent_grads$Median, recent_grads$ShareWomen, use = "pairwise.complete.obs")
## [1] -0.6186898

Q3 Build a regression model to predict median earnings using the share of women.

# Create a linear model 1
mod_1 <- lm(Median ~ ShareWomen, data = recent_grads)

# View summary of model 1
summary(mod_1)
## 
## Call:
## lm(formula = Median ~ ShareWomen, data = recent_grads)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17261  -5474  -1007   3502  57604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    56093       1705   32.90   <2e-16 ***
## ShareWomen    -30670       2987  -10.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9031 on 170 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3828, Adjusted R-squared:  0.3791 
## F-statistic: 105.4 on 1 and 170 DF,  p-value: < 2.2e-16

Q4. Is the coefficient of ShareWomen statistically significant at 5%? Interpret the coefficient.

I don’t understand this question. Coefficients are not statistically significant at 5% but this coefficient is not 5%, it is close to 0% which is statistically significant.

Q6. Interpret the reported residual standard error.

Residual standard error is the standard deviation from the mean. It indicates that of the 170 observed datapoints, there is a variance in median income of $9,031 per standard deviation.

Q7. Interpret the reported adjusted R squared.

The adjusted r-squared value indicates we have a moderately weak correlation between median income and percent share of women. While we cannot trust a precise prediction using our model, we can reasonably infer that the model itself will provide a relatively accurate output.