#1 a. #What is the estimated coefficient for Private or Public School X3? Was this estimate significance at the α=0.05 level? Clearly write out the null and alternative hypotheses, observed t-statistic, and p-value.

alumni <- read.csv("alumni.csv")
tail(alumni)

##                             school percent_of_classes_under_20
## 43                U. of Washington                          37
## 44         U. of Wisconsin-Madison                          37
## 45          Vanderbuilt University                          68
## 46          Wake Forest University                          59
## 47 Washington University-St. Louis                          73
## 48                 Yale University                          77
##    student_faculty_ratio alumni_giving_rate private
## 43                    12                 12       0
## 44                    13                 13       0
## 45                     9                 31       1
## 46                    11                 38       1
## 47                     7                 33       1
## 48                     7                 50       1

attach(alumni)
model <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + private, data = alumni)
summary(model)

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 + 
##     student_faculty_ratio + private, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.757  -6.320  -2.273   5.152  25.669 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                 36.78364   13.67220   2.690  0.01005 * 
## percent_of_classes_under_20  0.07725    0.17873   0.432  0.66768   
## student_faculty_ratio       -1.39835    0.51075  -2.738  0.00889 **
## private                      6.28534    5.35633   1.173  0.24693   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.06 on 44 degrees of freedom
## Multiple R-squared:  0.5747, Adjusted R-squared:  0.5457 
## F-statistic: 19.81 on 3 and 44 DF,  p-value: 2.818e-08

Estimated coefficient for Private or Public School is 6.28534 and standard error is 5.35633

Null Hypothesis: It states that the value of Y is not dependent on the values of X (X1, X2, and X3) meaning their coefficients are zero.

Alternate hypothesis: It is the opposite of null hypothesis, stating that Y is dependent on values of X (X1, X2, and X3) meaning their coefficients are non zeros. T-statistics of Intercept is 2.690 and p-value is 0.01005 T-statistics of percent_of_classes_under_20 is 0.432 and p-value is 0.66768 T-statistics of student_faculty_ratio is -2.738 and p-value is 0.00889 T-statistics of private is 1.173 and p-value is 0.24693

Conclusion: Reject the null hypothesis. The model is significant at the 0.05 level.

Was this the model significance at the α=0.05 level? Clearly write out the null and alternative hypotheses of this model, observed F-statistic, and p-value.

Null Hypothesis: It states that the value of Y is not dependent on the values of X (X1, X2, and X3) meaning their coefficients are zero.

Alternate hypothesis: It is the opposite of null hypothesis, stating that Y is dependent on values of X (X1, X2, and X3) meaning their coefficients are non zeros.

F-statistic: 19.81 on 3 and 44 DF p-value: 2.81758e-08 Conclusion: Reject the null hypothesis. The model is significant at the 0.05 level.

#c. Repeat part a and b now. Add the interaction between Student/Faculty Ratio (X2) and Private or Public School(X3) . What is the estimated slopes for X2 now? Interpret the results of with the interaction term.

model_interaction <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio *private, data = alumni)
summary(model_interaction)

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 + 
##     student_faculty_ratio * private, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.021  -6.275  -2.207   6.439  23.455 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                   25.572987  15.388370   1.662   0.1038  
## percent_of_classes_under_20    0.005511   0.182487   0.030   0.9760  
## student_faculty_ratio         -0.584364   0.737736  -0.792   0.4326  
## private                       27.907119  15.265512   1.828   0.0745 .
## student_faculty_ratio:private -1.477699   0.978896  -1.510   0.1385  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.931 on 43 degrees of freedom
## Multiple R-squared:  0.5961, Adjusted R-squared:  0.5585 
## F-statistic: 15.86 on 4 and 43 DF,  p-value: 4.745e-08

Slope

The estimated slope for X2 is -0.584364, negative sign indicates that there is a negative association between the student_faculty ratio and alumni giving rate. However, there is no statistical significance as p-value is 0.4326.

The estimated slope for X2:X3 (the interaction) is -1.477699. The negative sign indicates that the adverse impact of the student_faculty ratio on alumni giving rate is accentuated for private schools compared to public schools.The interaction effect is not statistically significant, as p-value is 0.1385.

In summary, although a negative trend exists between the student_faculty ratio and alumni_giving_rate, the interaction term introduces a nuanced relationship dependent on whether the institution is private or public. The statistical significance of these insights is constrained by non-significant p-values.

#d. Compare the model with and without the interaction term. Which model would you choose based on the data? Please explain your choice.

anova(model, model_interaction)

## Analysis of Variance Table
## 
## Model 1: alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + 
##     private
## Model 2: alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio * 
##     private
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     44 3611.8                           
## 2     43 3430.1  1    181.77 2.2788 0.1385

#e. What is the predicted alumni giving rate for an observation with (X1=40, X2=5, X3=1)?

data_frame <- data.frame(
  percent_of_classes_under_20 = 40,
  student_faculty_ratio = 5,
  private = 1
)

predicted_value1 <- predict(model, newdata = data_frame)
predicted_value1

##        1 
## 39.16739

predicted_value2 <- predict(model_interaction, newdata = data_frame)
predicted_value2

##        1 
## 43.39025

Predicted alumni giving rate for the model without interaction is 39.16739. Predicted alumni giving rate for the model with interaction is 43.39025.

#2a. #Generate data with X1∼N(μ=3,σ=0.5), sample size n=200, and error term ϵ∼N(μ=0,σ=0.5).

set.seed(123)

n <- 200
X <- rnorm(n, mean=3, sd=0.5)
epsilon <- rnorm(n, mean=0, sd=0.5)
Y <- 10 + 5*X - 2*X^2 + epsilon


data <- data.frame(X, Y)

#b. Fit a simple linear regression using just X. What is the estimated regression equation? Please conduct model estimation, inference, and residual diagnostics. What do you conclude?

fit_model<- lm(X ~ Y)
summary(fit_model)

## 
## Call:
## lm(formula = X ~ Y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41546 -0.05480  0.01936  0.05846  0.19961 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.854949   0.015096  255.37   <2e-16 ***
## Y           -0.130020   0.002014  -64.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1007 on 198 degrees of freedom
## Multiple R-squared:  0.9546, Adjusted R-squared:  0.9544 
## F-statistic:  4167 on 1 and 198 DF,  p-value: < 2.2e-16

Estimated Linear Regression Equation is Y =3.8549 -0.13*X Estimation for Intercept is 3.85 and standard error of Intercept is 0.015. Estimation for Slope is -0.13 and standard error of Slope is 0.002.

par(mfrow=c(2,2))
plot(fit_model, which=1:4)

As there is a pattern int he Residuals Vs fitted plot, we can conclude that the model is not a proper model that fits the data. Hence we need to add an higher order variable.

#c.Update the model from part b) by adding a quadratic term. Conduct model estimation, inference, and residual diagnostics. What do you conclude? Does this model seem to fit the data better? Please explain.

new_model <- lm(Y ~ X + I(X^2))

summary(new_model)

## 
## Call:
## lm(formula = Y ~ X + I(X^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22766 -0.33289 -0.03275  0.33394  1.29078 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.9551     1.0505  11.381  < 2e-16 ***
## X             3.7415     0.6870   5.446 1.52e-07 ***
## I(X^2)       -1.8003     0.1109 -16.229  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4962 on 197 degrees of freedom
## Multiple R-squared:  0.9806, Adjusted R-squared:  0.9804 
## F-statistic:  4976 on 2 and 197 DF,  p-value: < 2.2e-16

Estimated Equation is Y =11.9551 + 3.7415 * X -(1.8003 * (X^2)).

par(mfrow=c(2,2)) 
plot(new_model, which=1:4)

The Residuals vs Fitted plot exhibits a random dispersion of points, this indicates that there are constant variance across errors. As Q-Q plot shows an ideal normal distribution line, this explains the credibility of our model. As the Scale-Location plot displays no trends in error, we can conclude that the performance of the model increased. As per Cook’s Distance plot, we have only few outliers which indicates that the reliability of the model. Overall, we can conclude that the model with quadratic term is more fitted model to the data.

#d. Generate Y data with using E(Y−−√|X)=10+5X−2X2. Fit data with the model in part c with the quadratic term, i.e., lm(y ~ x + I(x^2)). Does the assumption of constant error variance appear to be violated?

set.seed(123)
n1 <- 200
X1 <- rnorm(n, mean=3, sd=0.5)
Y1 <- (10 + 5*X1 - 2*X1^2)^2  
fit_model2 <- lm(Y1 ~ X1 + I(X1^2))
summary(fit_model2)

## 
## Call:
## lm(formula = Y1 ~ X1 + I(X1^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.817  -3.936   1.276   4.286  72.258 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  638.056     16.416   38.87   <2e-16 ***
## X1          -314.395     10.736  -29.29   <2e-16 ***
## I(X1^2)       39.085      1.733   22.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.754 on 197 degrees of freedom
## Multiple R-squared:  0.9582, Adjusted R-squared:  0.9578 
## F-statistic:  2257 on 2 and 197 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(fit_model2, which=1:4)

As there is a pattern followed in Residual vs fitted plot, we can say that the constant variance is violated. We need to update the model by any alternative modeling techniques or refine our current model to enhance accuracy across all data ranges.

#e. Now repeat part d but transform Ywith MASS::boxcox . What is the lambda value you choose? Does this model seem to fit the data better?

library(MASS)
bc <- boxcox(fit_model2)

lambda <- bc$x[which.max(bc$y)] 
lambda

## [1] 1.79798

par(mfrow=c(2,2))
plot(fit_model2)

Y_new <- (Y^lambda - 1) / lambda
fit_new <- lm(Y_new ~ X + I(X^2))
fit_new

## 
## Call:
## lm(formula = Y_new ~ X + I(X^2))
## 
## Coefficients:
## (Intercept)            X       I(X^2)  
##     134.140      -46.748        2.696

par(mfrow=c(2,2))
plot(fit_new)

list(lambda=lambda, summary=summary(fit_new))

## $lambda
## [1] 1.79798
## 
## $summary
## 
## Call:
## lm(formula = Y_new ~ X + I(X^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1407 -1.6383 -0.1923  1.2853  7.8036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 134.1397     8.2836  16.193  < 2e-16 ***
## X           -46.7476     5.7496  -8.131    6e-14 ***
## I(X^2)        2.6957     0.9865   2.733  0.00689 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.655 on 185 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9573, Adjusted R-squared:  0.9568 
## F-statistic:  2072 on 2 and 185 DF,  p-value: < 2.2e-16

Value of lamdba is 1.79798. Based on the box-cox transformation, the predictive model appears to have remarkable performance. Adjusted R-squared after transformation is 0.9804 which indicates that the model is nearer to the perfect model. The chosen transformation, indicated by a lambda value approximately around 1.8, seems to be well-suited for our dataset. In summary, post-transformation, our model demonstrates a significantly improved fit to the data compared to its initial state.