Members of the group: Keerthi Chereddy, Devashish Kulkarni, Shashank Tippa, Trishla Thakur, Akash Rachewad


1a. Start with a basic exploratory data analysis. Show summary statistics of the response variable and predictor variable.
summary_X <- summary(alumni$'percent_of_classes_under_20')
summary_X
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   44.75   59.50   55.73   66.25   77.00
summary_Y <- summary(alumni$'alumni_giving_rate')
summary_Y
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   18.75   29.00   29.27   38.50   67.00


1b. What is the nature of the variables X and Y? Are there outliers? What is the correlation coefficient? Draw a scatter plot. Any major comments about the data?
par(mfrow=c(1,2), mar=c(2,2,2,2), oma=c(0,0,0,0))
boxplot(alumni$'percent_of_classes_under_20')
hist(alumni$'percent_of_classes_under_20', main = 'percent_of_classes_under_20')

boxplot(alumni$'alumni_giving_rate')
hist(alumni$'alumni_giving_rate', main = 'alumni_giving_rate')

plot(alumni$`percent_of_classes_under_20`, alumni$`alumni_giving_rate`, 
     xlab="percent_of_classes_under_20", ylab="alumni_giving_rate", pch  = 20)

fit <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)
#print(fit)
summary(fit)
## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -7.3861     6.5655  -1.125    0.266    
## percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07


1c. Fit a simple linear regression to the data. What is your estimated regression equation?

Deducing from the output, the value of b0 is -7.3861 and b1 is 0.6578. Hence the estimated regression equation is Y = -7.3861 + 0.6578 * X


1d. Interpret your results.

In addition to the above mentioned analysis, it is worth noting that

  1. The R-squared value (0.4169) suggests that approximately 41.69% of the variability in Alumni Giving Rate can be explained by the percentage of classes under 20.

  2. The F-statistic (32.88) and its associated p-value (7.23e-07) suggest that the model as a whole is statistically significant.