# Load the econ dataset and suppress the column specification message
econ <- read_csv("econ.csv", show_col_types = FALSE)
gpasalary <- read_csv("gpasalary.csv", show_col_types = FALSE)

Question 6

a)

The data set has 75 observations. 74 out of the 75 people answered the gender question while 60 out of 75 answered their high school gpa question.

b)

I believe the missing highschool gpa is not random. There were probably sever factors on why the individuals did not choose to share there score. Due to this I believe that the observed sample could have higher GPAs than the full population, leading to a potential upward bias in the average high school GPA for the sample that provided this data.
c)

The mean is 3.571014, median is 3.6, standard deviation is .307599, minimum is 2.3 and max is 4.

mean(econ$gpa, na.rm = TRUE)
## [1] 3.571014
median(econ$gpa, na.rm = TRUE)
## [1] 3.6
sd(econ$gpa, na.rm = TRUE)
## [1] 0.307599
min(econ$gpa, na.rm = TRUE)
## [1] 2.3
max(econ$gpa, na.rm = TRUE)
## [1] 4

d)

The mean gpa of men is 3.540278 and the mean gpa of women is 3.606081
mean(econ$gpa[econ$female == 0], na.rm = TRUE)
## [1] 3.540278
mean(econ$gpa[econ$female == 1], na.rm = TRUE)
## [1] 3.606081

e)

H0 = μ1-μ2=0 and H1 = μ1-μ2>0.
We fail to reject the null hypothesis, at the 0.05 level of significance, the test provides no evidence to support the claim that the GPA of women is significantly higher than that of men. Based on the results, we conclude that men and women have similar GPAs, and any observed difference in the sample could be due to random chance.
t_test_result <- t.test(econ$gpa ~ econ$female, alternative = "greater", var.equal = TRUE)

print(t_test_result)
## 
##  Two Sample t-test
## 
## data:  econ$gpa by econ$female
## t = -0.90886, df = 71, p-value = 0.8168
## alternative hypothesis: true difference in means between group 0 and group 1 is greater than 0
## 95 percent confidence interval:
##  -0.186468       Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        3.540278        3.606081

f)

The correlation is positive meaning that as the mothers years of education increases so does the student’s GPA. However since the number is very close to 0 the correlation is very little. This is around what I expected, the more educated your mother is means that that the mother can instill discipline for the student to spend time on their grades. However I thought it would be a small positive correlation because ultimately it is the student doing the work and not the parent. There are also other factors like mothers not with their children when they are away at college and students can hide their grades from their mothers as well.
cor(econ$yrsedmom, econ$gpa, use = "complete.obs")
## [1] 0.1990016

Question 7

a)

The regression shows that it is a positive linear relationship meaning that is a somewhat strong and positive relationship between gpa and salary.
# Scatterplot of GPA vs Salary
plot(gpasalary$GPA, gpasalary$Salary,
     xlab = "GPA",
     ylab = "Salary",
     main = "Scatterplot of GPA vs. Salary",
     pch = 16, # solid circles for points
     col = "blue")

# Fit a linear model
model <- lm(Salary ~ GPA, data = gpasalary)

# Add the regression line to the plot
abline(model, col = "red", lwd = 2)

# Display the summary of the model
summary(model)
## 
## Call:
## lm(formula = Salary ~ GPA, data = gpasalary)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15105  -5311  -2954   1203  19901 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -110210      21083  -5.228 0.000795 ***
## GPA            72990       7147  10.212 7.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11440 on 8 degrees of freedom
## Multiple R-squared:  0.9288, Adjusted R-squared:  0.9199 
## F-statistic: 104.3 on 1 and 8 DF,  p-value: 7.254e-06

b)

The regression shows that GPA is a strong predictor of salary 10 years after graduation. For each additional GPA point, salary increases by approximately $72,990.51. The equation is:Salary=−110210.4 + 72990.51 * GPA
model <- lm(Salary ~ GPA, data = gpasalary)

summary(model)
## 
## Call:
## lm(formula = Salary ~ GPA, data = gpasalary)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15105  -5311  -2954   1203  19901 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -110210      21083  -5.228 0.000795 ***
## GPA            72990       7147  10.212 7.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11440 on 8 degrees of freedom
## Multiple R-squared:  0.9288, Adjusted R-squared:  0.9199 
## F-statistic: 104.3 on 1 and 8 DF,  p-value: 7.254e-06
intercept <- coef(model)[1]
slope <- coef(model)[2]

cat("The estimated regression equation is: Salary =", intercept, "+", slope, "* GPA\n")
## The estimated regression equation is: Salary = -110210.4 + 72990.51 * GPA
cat("Interpretation of coefficients:\n")
## Interpretation of coefficients:
cat("Intercept (", intercept, "): This is the estimated salary when GPA is 0. Though GPA of 0 isn't realistic, the intercept helps to position the regression line.\n")
## Intercept ( -110210.4 ): This is the estimated salary when GPA is 0. Though GPA of 0 isn't realistic, the intercept helps to position the regression line.
cat("Slope (", slope, "): For each additional point increase in GPA, the salary is predicted to increase by", slope, "dollars.\n")
## Slope ( 72990.51 ): For each additional point increase in GPA, the salary is predicted to increase by 72990.51 dollars.

c)

There is a significant relationship since the p-value is 7.253817e-06. Since the p-value is below .05 we reject the null hypothesis, and since our null hypothesis states there is no correlation between gpa and slavery the p-value shows that there is.
model <- lm(Salary ~ GPA, data = gpasalary)

summary_model <- summary(model)

# Extract the p-value for the GPA coefficient
p_value <- summary_model$coefficients[2, 4]  # This extracts the p-value for the GPA coefficient

cat("P-value for the GPA coefficient:", p_value, "\n")
## P-value for the GPA coefficient: 7.253817e-06
if (p_value < 0.05) {
    cat("At the 0.05 level of significance, there is a significant relationship between GPA and Salary.\n")
} else {
    cat("At the 0.05 level of significance, there is NOT a significant relationship between GPA and Salary.\n")
}
## At the 0.05 level of significance, there is a significant relationship between GPA and Salary.

d)

Multiple R-squared: 0.9288, Adjusted R-squared: 0.9199 (found from previous r script).R- squared represents the variance dependent variable explained by the independent variable. With a multiple R-Squared being .9228, this means that GPA explains 92.88% of the variation of salary. Since the R-Squared is so close to one it gpa almost perfectly explains salaries variation.

e)

The estimated standard error for the intercept is -110210 and for GPA is 72990 (Found from previous r script).