## [1] "Summary Statistics"
## school percent_of_classes_under_20 student_faculty_ratio
## Length:48 Min. :29.00 Min. : 3.00
## Class :character 1st Qu.:44.75 1st Qu.: 8.00
## Mode :character Median :59.50 Median :10.50
## Mean :55.73 Mean :11.54
## 3rd Qu.:66.25 3rd Qu.:13.50
## Max. :77.00 Max. :23.00
## alumni_giving_rate private
## Min. : 7.00 Min. :0.0000
## 1st Qu.:18.75 1st Qu.:0.0000
## Median :29.00 Median :1.0000
## Mean :29.27 Mean :0.6875
## 3rd Qu.:38.50 3rd Qu.:1.0000
## Max. :67.00 Max. :1.0000
Data Exploration of percent_of_classes_under_20 and alumni_giving_rate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 44.75 59.50 55.73 66.25 77.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 18.75 29.00 29.27 38.50 67.00
Correlation and Linear Regression
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
## data = alumni)
##
## Coefficients:
## (Intercept) percent_of_classes_under_20
## -7.3861 0.6578
## (Intercept) percent_of_classes_under_20
## -7.3860676 0.6577687
## Estimated Slope: 0.6577687
## T-Statistic: 5.734448
## P-Value: 7.228121e-07
## The null hypothesis is rejected as the estimated slope is significant at α=0.05 level.
## F-Statistic: 32.88389
## P-Value (F-Test): 7.228121e-07
## The slope from the F-test is statistically significant at α=0.05 level.
## The null hypothesis is rejected, indicating a correlation between class size and alumni giving rate.
R-Squared Interpretation
## R-squared: 0.4168645
## 41.69 % of the change in the giving rate is explained by the percentage of classes under 20 students.
## 58.31 % of the change in the giving rate is due to other factors not included in the model.
## Correlation Coefficient (r): 0.6456504
The relationship between r and R-squared (R²) is The correlation coefficient (r) is the square root of R-squared (R²). In other words, r= sqrt(R²).
Here sqrt(0.4169) = 0.6456
The correlation coefficient r determines how the variables X and Y are related to each other. Its value ranges from -1 to +1, if the r values is negative it means that both variables are inversely related(If X increases then Y decreases), if its 0 then there is no relation and if its positive it means that the variables are directly related(If X increases then Y increases)
If the r values is squared then it becomes equal to R^2. Meaning, if r = -1 or +1 and then it is squared the outcome is positive which explains the entire variance of Y(direct or indirect) with respect to X.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.725 1.923 2.001 2.004 2.070 2.243
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.09 19.67 20.11 20.17 20.70 21.80
## Coefficients of the linear model: 9.021796 5.56516 \n
## Mean Squared Error (MSE): 0.1992276 \n
## At the mean point, the confidence band is narrower compared to the rest of the model.
## X Y
## 1 1.907630 20.46643
## 2 1.949162 20.08051
## 3 2.029797 20.32815
## 4 2.101782 20.39166
## 5 2.000072 20.16090
## 6 1.926412 19.51731
## 7 1.917607 19.79414
## 8 2.040336 20.43825
## 9 2.182599 21.21011
## 10 2.086092 21.03108
## 11 2.067289 20.85001
## 12 2.022897 21.63406
## 13 2.053353 20.11494
## 14 2.242976 21.15448
## 15 2.151547 21.25863
## 16 1.912192 19.81571
## 17 2.020504 20.27142
## 18 2.044240 20.47649
## 19 1.931565 18.56397
## 20 2.033850 19.91332
## 21 2.151927 21.26723
## 22 1.997334 20.23343
## 23 2.230032 21.79501
## 24 1.842930 19.57491
## 25 1.934173 19.92232
## 26 2.007572 20.47243
## 27 2.001744 20.33157
## 28 1.968061 20.27458
## 29 2.166894 21.67713
## 30 1.992341 19.64292
## 31 2.045806 19.67286
## 32 2.023974 20.27715
## 33 2.159728 21.55489
## 34 2.004386 19.99734
## 35 2.025323 19.57583
## 36 1.921529 19.73812
## 37 2.023633 21.12066
## 38 2.223992 21.36007
## 39 1.884471 20.05939
## 40 1.793703 19.42602
## 41 2.019644 19.62076
## 42 1.967207 19.35324
## 43 2.044445 19.37014
## 44 2.121744 21.06565
## 45 1.870127 19.38481
## 46 1.913533 19.36998
## 47 1.915978 19.50796
## 48 1.913352 20.03506
## 49 2.182606 20.78263
## 50 2.020239 19.96956
## 51 2.081226 21.00517
## 52 2.145253 20.65863
## 53 2.151145 21.18551
## 54 1.981465 19.96638
## 55 1.825808 18.61813
## 56 1.955962 19.45445
## 57 2.183912 20.90835
## 58 2.041806 20.82543
## 59 2.140466 20.83446
## 60 1.923017 20.10493
## 61 1.821794 18.74627
## 62 1.938915 20.41719
## 63 2.067589 20.63475
## 64 2.195109 21.53962
## 65 1.977381 19.36826
## 66 2.075858 20.73429
## 67 2.113560 20.91959
## 68 1.754821 18.09132
## 69 1.921970 20.34154
## 70 1.950862 19.78382
## 71 1.993233 19.80837
## 72 1.997792 19.80434
## 73 1.887012 20.32858
## 74 1.894511 19.42089
## 75 1.806253 19.58467
## 76 1.918403 20.00117
## 77 1.927941 19.56458
## 78 1.966598 19.51181
## 79 1.890161 18.78226
## 80 2.080892 21.49074
## 81 2.120845 20.72850
## 82 1.919753 20.00931
## 83 2.056477 19.70376
## 84 1.934459 20.28275
## 85 1.725049 19.18307
## 86 2.084447 20.69202
## 87 2.041206 20.25407
## 88 1.841697 19.03820
## 89 1.974109 19.80064
## 90 2.042807 19.83332
## 91 1.854290 19.76495
## 92 2.009167 20.18823
## 93 2.127151 20.57715
## 94 1.982395 20.49000
## 95 1.993950 20.51286
## 96 2.053979 20.04600
## 97 1.895712 18.99081
## 98 1.969278 19.81406
## 99 1.961237 19.75642
## 100 2.178656 21.25103
##
## Call:
## lm(formula = Y ~ X, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## X 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
## Mean Squared Error (MSE): 0.2032934
Explanation of results:
The estimated prediction equation is Y = 9.0218 + 5.5652 * X. The estimated coefficients are 9.0218 and 5.5652. The standard errors are 0.8336 and 0.4155. The p-value is 2.2 x 10^-16, and the t-statistic is 13.39. Fitted model’s MSE is 0.2032. The null hypothesis is that the X value does not affect the value of Y (coefficient is zero). Since the p-values are less than 0.05, the null hypothesis is rejected.
## X Y
## 1 1.907630 21.39470
## 2 1.949162 20.41522
## 3 2.029797 20.50732
## 4 2.101782 20.27442
## 5 2.000072 20.32145
## 6 1.926412 19.40257
##
## Call:
## lm(formula = Y ~ X, data = simulated_data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4146 -0.6058 0.0186 0.6066 2.7090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0436 1.6673 4.824 5.16e-06 ***
## X 6.1303 0.8309 7.378 5.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9018 on 98 degrees of freedom
## Multiple R-squared: 0.3571, Adjusted R-squared: 0.3505
## F-statistic: 54.43 on 1 and 98 DF, p-value: 5.253e-11
## Mean Squared Error (MSE) for Alternative Model: 0.8131737
Explanation of results :
The estimated prediction equation is Y = 8.0436 + 6.1303 * X. The estimated coefficients are 8.0436 and 6.1303. The standard errors are 1.6673 and 0.8309. The p-value is 5.253 x 10^-11, and the t-statistic is 7.378. Fitted model’s MSE is 0.8132. The null hypothesis states that X does not affect Y. Since the p-value is less than 0.05, the null hypothesis is rejected.
## [1] "First few records from the data Generated : "
## X Y
## 1 1.907630 18.94756
## 2 1.949162 19.78539
## 3 2.029797 20.37686
## 4 2.101782 21.28714
## 5 2.000072 19.61935
## 6 1.926412 19.60554
##
## Call:
## lm(formula = Y ~ X, data = simulated_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76214 -0.33740 0.03615 0.32077 1.63021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7466 0.5015 19.44 <2e-16 ***
## X 5.1177 0.2490 20.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4887 on 398 degrees of freedom
## Multiple R-squared: 0.5149, Adjusted R-squared: 0.5137
## F-statistic: 422.5 on 1 and 398 DF, p-value: < 2.2e-16
## Mean Squared Error (MSE) with error variance 0.5 and sample size 400: 0.2388269
## X Y
## 1 1.907630 18.35697
## 2 1.949162 19.82497
## 3 2.029797 20.60473
## 4 2.101782 22.06537
## 5 2.000072 19.23834
## 6 1.926412 19.57903
##
## Call:
## lm(formula = Y ~ X, data = simulated_data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5243 -0.6748 0.0723 0.6415 3.2604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4933 1.0029 9.466 <2e-16 ***
## X 5.2355 0.4979 10.514 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9774 on 398 degrees of freedom
## Multiple R-squared: 0.2174, Adjusted R-squared: 0.2154
## F-statistic: 110.5 on 1 and 398 DF, p-value: < 2.2e-16
## Mean Squared Error (MSE) with error variance 1 and sample size 400: 0.9553077
The estimated prediction equation when the error variance is 0.5 is Y <- 9.7466 + 5.1177 * X + epsilon. The estimated coefficients in this case are 9.7466 and 5.1177. The standard errors are 0.5015 and 0.2490. The p-value is 2.2 x 10^-16. The t-statistic is 20.55. Fitted model’s MSE is 0.2388269.The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).Since the p-values is less than 0.05 means that the null hypothesis can be rejected.
The estimated prediction equation when the error variance is 1 is Y <- 9.4933 + 5.2355 * X + epsilon. The estimated coefficients in this case are 9.4933 and 5.2355. The standard errors are 1.0029 and 0.4979. The p-value is 2.2 x 10^-16. The t-statistic is 10.514. Fitted model’s MSE is 0.955.The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).Since the p-values is less than 0.05 means that the null hypothesis can be rejected.
When the error variance is increased from 0.5 to 1 estimated intercept value decreased and the estimated slope value got increased. The change was significant when the sample size was small(100), it only changed by a few decimal values when the sample size is increased to 400. When the sample size gets bigger, the standard errors got smaller
## MSE with sample size 100, error variance 0.5: 0.2032934
## MSE with sample size 100, error variance 1: 0.8131737
## MSE with sample size 400, error variance 0.5: 0.2388269
## MSE with sample size 400, error variance 1: 0.9553077
SampleSize | ErrorVariance | EstimatedIntercept | EstimatedSlope | MSE | StandardErrorofIntercept | StandardErrorofSlope |
---|---|---|---|---|---|---|
100 | 0.5 | 9.0218 | 5.5652 | 0.2032 | 0.8336 | 0.4155 |
100 | 1.0 | 8.0436 | 6.1303 | 0.8132 | 1.6673 | 0.8309 |
400 | 0.5 | 9.7466 | 5.1177 | 0.2388 | 0.5015 | 0.2490 |
400 | 1.0 | 9.4933 | 5.2355 | 0.9550 | 1.0029 | 0.4979 |
Conclusion on the effect of error variance and sample size :
The increase in the variance of error from 0.5 to 1 (in two different cases where sample sizes are 100 & 400) has drastically impacted the value of MSE in a directly proportional manner. Whereas, the formula states that the sample size is indirectly proportional to MSE, but the values from the above result did not differ much.When the sample size gets bigger, the standard errors got smaller.”
set.seed(7052)
simulated_regression <- function(n, sigma_error) {
X <- rnorm(n, mean=2, sd=0.1)
error <- rnorm(n, mean=0, sd=sigma_error)
Y <- 10 + 5 * X + error
# Add linear regression
model <- lm(Y ~ X)
coefficients_table <- broom::tidy(model)
t_statistic <- coefficients_table[2, "statistic"]
p_value <- coefficients_table[2, "p.value"]
mse <- mean(model$residuals^2)
# Return results
return(data.frame(
n=n,
sigma_error = sigma_error,
coefficients = coefficients_table,
t_statistic = t_statistic,
p_value = p_value,
mse = mse
))
}
results <- bind_rows(
simulated_regression(100, 0.5),
simulated_regression(100, 1),
simulated_regression(400, 0.5),
simulated_regression(400, 1)
)
print(results)
## n sigma_error coefficients.term coefficients.estimate
## 1 100 0.5 (Intercept) 9.021796
## 2 100 0.5 X 5.565160
## 3 100 1.0 (Intercept) 13.077796
## 4 100 1.0 X 3.460281
## 5 400 0.5 (Intercept) 9.697322
## 6 400 0.5 X 5.160805
## 7 400 1.0 (Intercept) 9.963304
## 8 400 1.0 X 5.033593
## coefficients.std.error coefficients.statistic coefficients.p.value statistic
## 1 0.8336483 10.822065 2.007658e-18 13.395491
## 2 0.4154502 13.395491 7.118010e-24 13.395491
## 3 2.0874900 6.264842 9.978287e-09 3.343580
## 4 1.0349032 3.343580 1.172512e-03 3.343580
## 5 0.5207461 18.621975 4.072601e-56 19.811650
## 6 0.2604935 19.811650 2.797442e-61 19.811650
## 7 1.0317701 9.656516 5.832549e-20 9.749772
## 8 0.5162781 9.749772 2.784618e-20 9.749772
## p.value mse
## 1 7.118010e-24 0.1992276
## 2 7.118010e-24 0.1992276
## 3 1.172512e-03 0.9106786
## 4 1.172512e-03 0.9106786
## 5 2.797442e-61 0.2568544
## 6 2.797442e-61 0.2568544
## 7 2.784618e-20 1.0589855
## 8 2.784618e-20 1.0589855
# True values for intercept and slope
true_intercept <- 10
true_slope <- 5
# Calculating bias for intercept and slope
bias_intercept <- mean(results$coefficients.estimate[results$coefficients.term == "(Intercept)"]) - true_intercept
bias_slope <- mean(results$coefficients.estimate[results$coefficients.term == "X"]) - true_slope
# Calculating variance for intercept and slope
variance_intercept <- var(results$coefficients.estimate[results$coefficients.term == "(Intercept)"])
variance_slope <- var(results$coefficients.estimate[results$coefficients.term == "X"])
# Displaying results
cat("Bias for Intercept:", bias_intercept, "\n")
## Bias for Intercept: 0.4400544
cat("Bias for Slope:", bias_slope, "\n")
## Bias for Slope: -0.1950401
cat("Variance for Intercept:", variance_intercept, "\n")
## Variance for Intercept: 3.249359
cat("Variance for Slope:", variance_slope, "\n")
## Variance for Slope: 0.8549879
Explanation of the impact of sample size and error variance:
As the sample size increases, we expect the variance of X and Y to decrease. This is because as the sample size increases, the result tends to normalize,nand therefore the percentage of variances tends to decrease. As the error variance increases, we expect the variance of X and Y to increase. This is because the points are not as close to the regression line as before.Therefore, the end result is more dependent on the individual points used in the analysis, and the variability is higher.
## Bias of MSE: 0.5964365
## ML estimate of sigma squared: 0.20124
Explanation of the difference between MSE bias and ML estimate:
Bias of MSE: 0.5964365 ML estimate of sigma squared: 0.20124 The main difference between the bias of the model’s MSE and the ML estimate of sigma squared is that the bias of the MSE is the difference between the average estimated MSE and the variance of the error terms.The bias is influenced by the random variables used in the dataset Whereas, the ML estimate of sigma squared is based on the residuals’ variances in the model.It can show the estimate of the true variance in error terms, not just the underlying variance.This provides a better picture of true variability in the data as a whole.This is why we use MSE, because it looks at how well the predictions match the actual data.