#1 a. #What is the estimated coefficient for Private or Public School X3? Was this estimate significance at the α=0.05 level? Clearly write out the null and alternative hypotheses, observed t-statistic, and p-value.
alumni <- read.csv("alumni.csv")
tail(alumni)
## school percent_of_classes_under_20
## 43 U. of Washington 37
## 44 U. of Wisconsin-Madison 37
## 45 Vanderbuilt University 68
## 46 Wake Forest University 59
## 47 Washington University-St. Louis 73
## 48 Yale University 77
## student_faculty_ratio alumni_giving_rate private
## 43 12 12 0
## 44 13 13 0
## 45 9 31 1
## 46 11 38 1
## 47 7 33 1
## 48 7 50 1
attach(alumni)
model <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + private, data = alumni)
summary(model)
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 +
## student_faculty_ratio + private, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.757 -6.320 -2.273 5.152 25.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.78364 13.67220 2.690 0.01005 *
## percent_of_classes_under_20 0.07725 0.17873 0.432 0.66768
## student_faculty_ratio -1.39835 0.51075 -2.738 0.00889 **
## private 6.28534 5.35633 1.173 0.24693
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.06 on 44 degrees of freedom
## Multiple R-squared: 0.5747, Adjusted R-squared: 0.5457
## F-statistic: 19.81 on 3 and 44 DF, p-value: 2.818e-08
Estimated coefficient for Private or Public School is 6.28534 and standard error is 5.35633
Null Hypothesis: It states that the value of Y is not dependent on the values of X (X1, X2, and X3) meaning their coefficients are zero.
Alternate hypothesis: It is the opposite of null hypothesis, stating that Y is dependent on values of X (X1, X2, and X3) meaning their coefficients are non zeros. T-statistics of Intercept is 2.690 and p-value is 0.01005 T-statistics of percent_of_classes_under_20 is 0.432 and p-value is 0.66768 T-statistics of student_faculty_ratio is -2.738 and p-value is 0.00889 T-statistics of private is 1.173 and p-value is 0.24693
Conclusion: Reject the null hypothesis. The model is significant at the 0.05 level.
Null Hypothesis: It states that the value of Y is not dependent on the values of X (X1, X2, and X3) meaning their coefficients are zero.
Alternate hypothesis: It is the opposite of null hypothesis, stating that Y is dependent on values of X (X1, X2, and X3) meaning their coefficients are non zeros.
F-statistic: 19.81 on 3 and 44 DF p-value: 2.81758e-08 Conclusion: Reject the null hypothesis. The model is significant at the 0.05 level.
#c. Repeat part a and b now. Add the interaction between Student/Faculty Ratio (X2) and Private or Public School(X3) . What is the estimated slopes for X2 now? Interpret the results of with the interaction term.
model_interaction <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio *private, data = alumni)
summary(model_interaction)
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 +
## student_faculty_ratio * private, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.021 -6.275 -2.207 6.439 23.455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.572987 15.388370 1.662 0.1038
## percent_of_classes_under_20 0.005511 0.182487 0.030 0.9760
## student_faculty_ratio -0.584364 0.737736 -0.792 0.4326
## private 27.907119 15.265512 1.828 0.0745 .
## student_faculty_ratio:private -1.477699 0.978896 -1.510 0.1385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.931 on 43 degrees of freedom
## Multiple R-squared: 0.5961, Adjusted R-squared: 0.5585
## F-statistic: 15.86 on 4 and 43 DF, p-value: 4.745e-08
Slope
The estimated slope for X2 is -0.584364, negative sign indicates that there is a negative association between the student_faculty ratio and alumni giving rate. However, there is no statistical significance as p-value is 0.4326.
The estimated slope for X2:X3 (the interaction) is -1.477699. The negative sign indicates that the adverse impact of the student_faculty ratio on alumni giving rate is accentuated for private schools compared to public schools.The interaction effect is not statistically significant, as p-value is 0.1385.
In summary, although a negative trend exists between the student_faculty ratio and alumni_giving_rate, the interaction term introduces a nuanced relationship dependent on whether the institution is private or public. The statistical significance of these insights is constrained by non-significant p-values.
#d. Compare the model with and without the interaction term. Which model would you choose based on the data? Please explain your choice.
anova(model, model_interaction)
## Analysis of Variance Table
##
## Model 1: alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio +
## private
## Model 2: alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio *
## private
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 3611.8
## 2 43 3430.1 1 181.77 2.2788 0.1385
#e. What is the predicted alumni giving rate for an observation with (X1=40, X2=5, X3=1)?
data_frame <- data.frame(
percent_of_classes_under_20 = 40,
student_faculty_ratio = 5,
private = 1
)
predicted_value1 <- predict(model, newdata = data_frame)
predicted_value1
## 1
## 39.16739
predicted_value2 <- predict(model_interaction, newdata = data_frame)
predicted_value2
## 1
## 43.39025
Predicted alumni giving rate for the model without interaction is 39.16739. Predicted alumni giving rate for the model with interaction is 43.39025.
#2a. #Generate data with X1∼N(μ=3,σ=0.5), sample size n=200, and error term ϵ∼N(μ=0,σ=0.5).
set.seed(123)
n <- 200
X <- rnorm(n, mean=3, sd=0.5)
epsilon <- rnorm(n, mean=0, sd=0.5)
Y <- 10 + 5*X - 2*X^2 + epsilon
data <- data.frame(X, Y)
#b. Fit a simple linear regression using just X. What is the estimated regression equation? Please conduct model estimation, inference, and residual diagnostics. What do you conclude?
fit_model<- lm(X ~ Y)
summary(fit_model)
##
## Call:
## lm(formula = X ~ Y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41546 -0.05480 0.01936 0.05846 0.19961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.854949 0.015096 255.37 <2e-16 ***
## Y -0.130020 0.002014 -64.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1007 on 198 degrees of freedom
## Multiple R-squared: 0.9546, Adjusted R-squared: 0.9544
## F-statistic: 4167 on 1 and 198 DF, p-value: < 2.2e-16
Estimated Linear Regression Equation is Y =3.8549 -0.13*X Estimation for Intercept is 3.85 and standard error of Intercept is 0.015. Estimation for Slope is -0.13 and standard error of Slope is 0.002.
par(mfrow=c(2,2))
plot(fit_model, which=1:4)
As there is a pattern int he Residuals Vs fitted plot, we can conclude
that the model is not a proper model that fits the data. Hence we need
to add an higher order variable.
#c.Update the model from part b) by adding a quadratic term. Conduct model estimation, inference, and residual diagnostics. What do you conclude? Does this model seem to fit the data better? Please explain.
new_model <- lm(Y ~ X + I(X^2))
summary(new_model)
##
## Call:
## lm(formula = Y ~ X + I(X^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22766 -0.33289 -0.03275 0.33394 1.29078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.9551 1.0505 11.381 < 2e-16 ***
## X 3.7415 0.6870 5.446 1.52e-07 ***
## I(X^2) -1.8003 0.1109 -16.229 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4962 on 197 degrees of freedom
## Multiple R-squared: 0.9806, Adjusted R-squared: 0.9804
## F-statistic: 4976 on 2 and 197 DF, p-value: < 2.2e-16
Estimated Equation is Y =11.9551 + 3.7415 * X -(1.8003 * (X^2)).
par(mfrow=c(2,2))
plot(new_model, which=1:4)
The Residuals vs Fitted plot exhibits a random dispersion of points, this indicates that there are constant variance across errors. As Q-Q plot shows an ideal normal distribution line, this explains the credibility of our model. As the Scale-Location plot displays no trends in error, we can conclude that the performance of the model increased. As per Cook’s Distance plot, we have only few outliers which indicates that the reliability of the model. Overall, we can conclude that the model with quadratic term is more fitted model to the data.
#d. Generate Y data with using E(Y−−√|X)=10+5X−2X2. Fit data with the model in part c with the quadratic term, i.e., lm(y ~ x + I(x^2)). Does the assumption of constant error variance appear to be violated?
set.seed(123)
n1 <- 200
X1 <- rnorm(n, mean=3, sd=0.5)
Y1 <- (10 + 5*X1 - 2*X1^2)^2
fit_model2 <- lm(Y1 ~ X1 + I(X1^2))
summary(fit_model2)
##
## Call:
## lm(formula = Y1 ~ X1 + I(X1^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.817 -3.936 1.276 4.286 72.258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 638.056 16.416 38.87 <2e-16 ***
## X1 -314.395 10.736 -29.29 <2e-16 ***
## I(X1^2) 39.085 1.733 22.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.754 on 197 degrees of freedom
## Multiple R-squared: 0.9582, Adjusted R-squared: 0.9578
## F-statistic: 2257 on 2 and 197 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit_model2, which=1:4)
As there is a pattern followed in Residual vs fitted plot, we can say that the constant variance is violated. We need to update the model by any alternative modeling techniques or refine our current model to enhance accuracy across all data ranges.
#e. Now repeat part d but transform Ywith MASS::boxcox . What is the lambda value you choose? Does this model seem to fit the data better?
library(MASS)
bc <- boxcox(fit_model2)
lambda <- bc$x[which.max(bc$y)]
lambda
## [1] 1.79798
par(mfrow=c(2,2))
plot(fit_model2)
Y_new <- (Y^lambda - 1) / lambda
fit_new <- lm(Y_new ~ X + I(X^2))
fit_new
##
## Call:
## lm(formula = Y_new ~ X + I(X^2))
##
## Coefficients:
## (Intercept) X I(X^2)
## 134.140 -46.748 2.696
par(mfrow=c(2,2))
plot(fit_new)
list(lambda=lambda, summary=summary(fit_new))
## $lambda
## [1] 1.79798
##
## $summary
##
## Call:
## lm(formula = Y_new ~ X + I(X^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1407 -1.6383 -0.1923 1.2853 7.8036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.1397 8.2836 16.193 < 2e-16 ***
## X -46.7476 5.7496 -8.131 6e-14 ***
## I(X^2) 2.6957 0.9865 2.733 0.00689 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.655 on 185 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9573, Adjusted R-squared: 0.9568
## F-statistic: 2072 on 2 and 185 DF, p-value: < 2.2e-16
Value of lamdba is 1.79798. Based on the box-cox transformation, the predictive model appears to have remarkable performance. Adjusted R-squared after transformation is 0.9804 which indicates that the model is nearer to the perfect model. The chosen transformation, indicated by a lambda value approximately around 1.8, seems to be well-suited for our dataset. In summary, post-transformation, our model demonstrates a significantly improved fit to the data compared to its initial state.