BANA_7052_Module_2

## [1] "Summary Statistics"

##     school          percent_of_classes_under_20 student_faculty_ratio
##  Length:48          Min.   :29.00               Min.   : 3.00        
##  Class :character   1st Qu.:44.75               1st Qu.: 8.00        
##  Mode  :character   Median :59.50               Median :10.50        
##                     Mean   :55.73               Mean   :11.54        
##                     3rd Qu.:66.25               3rd Qu.:13.50        
##                     Max.   :77.00               Max.   :23.00        
##  alumni_giving_rate    private      
##  Min.   : 7.00      Min.   :0.0000  
##  1st Qu.:18.75      1st Qu.:0.0000  
##  Median :29.00      Median :1.0000  
##  Mean   :29.27      Mean   :0.6875  
##  3rd Qu.:38.50      3rd Qu.:1.0000  
##  Max.   :67.00      Max.   :1.0000

Data Exploration of percent_of_classes_under_20 and alumni_giving_rate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   44.75   59.50   55.73   66.25   77.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   18.75   29.00   29.27   38.50   67.00

Correlation and Linear Regression

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
##     data = alumni)
## 
## Coefficients:
##                 (Intercept)  percent_of_classes_under_20  
##                     -7.3861                       0.6578

##                 (Intercept) percent_of_classes_under_20 
##                  -7.3860676                   0.6577687

Question 1

Question 1a: Statistical Analysis: T-Test for Slope Significance

## Estimated Slope: 0.6577687

## T-Statistic: 5.734448

## P-Value: 7.228121e-07

## The null hypothesis is rejected as the estimated slope is significant at α=0.05 level.

Question 1b: Repeat part a. above using the equivalent F-test.

## F-Statistic: 32.88389

## P-Value (F-Test): 7.228121e-07

## The slope from the F-test is statistically significant at α=0.05 level.
## The null hypothesis is rejected, indicating a correlation between class size and alumni giving rate.

Question 1c: What is the value of R^2? Please interpret

R-Squared Interpretation

## R-squared: 0.4168645

## 41.69 % of the change in the giving rate is explained by the percentage of classes under 20 students.
##  58.31 % of the change in the giving rate is due to other factors not included in the model.

Question 1d: Correlation Coefficient Analysis

## Correlation Coefficient (r): 0.6456504

The relationship between r and R-squared (R²) is The correlation coefficient (r) is the square root of R-squared (R²). In other words, r= sqrt(R²).

Here sqrt(0.4169) = 0.6456

The correlation coefficient r determines how the variables X and Y are related to each other. Its value ranges from -1 to +1, if the r values is negative it means that both variables are inversely related(If X increases then Y decreases), if its 0 then there is no relation and if its positive it means that the variables are directly related(If X increases then Y increases)

If the r values is squared then it becomes equal to R^2. Meaning, if r = -1 or +1 and then it is squared the outcome is positive which explains the entire variance of Y(direct or indirect) with respect to X.

Question 1e: Plot with the fitted regression line and a 95% (pointwise) confidence band for the mean responses

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.725   1.923   2.001   2.004   2.070   2.243

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.09   19.67   20.11   20.17   20.70   21.80

## Coefficients of the linear model: 9.021796 5.56516 \n

## Mean Squared Error (MSE): 0.1992276 \n

## At the mean point, the confidence band is narrower compared to the rest of the model.

Question 2

Question 2a: Simulation and Regression

##            X        Y
## 1   1.907630 20.46643
## 2   1.949162 20.08051
## 3   2.029797 20.32815
## 4   2.101782 20.39166
## 5   2.000072 20.16090
## 6   1.926412 19.51731
## 7   1.917607 19.79414
## 8   2.040336 20.43825
## 9   2.182599 21.21011
## 10  2.086092 21.03108
## 11  2.067289 20.85001
## 12  2.022897 21.63406
## 13  2.053353 20.11494
## 14  2.242976 21.15448
## 15  2.151547 21.25863
## 16  1.912192 19.81571
## 17  2.020504 20.27142
## 18  2.044240 20.47649
## 19  1.931565 18.56397
## 20  2.033850 19.91332
## 21  2.151927 21.26723
## 22  1.997334 20.23343
## 23  2.230032 21.79501
## 24  1.842930 19.57491
## 25  1.934173 19.92232
## 26  2.007572 20.47243
## 27  2.001744 20.33157
## 28  1.968061 20.27458
## 29  2.166894 21.67713
## 30  1.992341 19.64292
## 31  2.045806 19.67286
## 32  2.023974 20.27715
## 33  2.159728 21.55489
## 34  2.004386 19.99734
## 35  2.025323 19.57583
## 36  1.921529 19.73812
## 37  2.023633 21.12066
## 38  2.223992 21.36007
## 39  1.884471 20.05939
## 40  1.793703 19.42602
## 41  2.019644 19.62076
## 42  1.967207 19.35324
## 43  2.044445 19.37014
## 44  2.121744 21.06565
## 45  1.870127 19.38481
## 46  1.913533 19.36998
## 47  1.915978 19.50796
## 48  1.913352 20.03506
## 49  2.182606 20.78263
## 50  2.020239 19.96956
## 51  2.081226 21.00517
## 52  2.145253 20.65863
## 53  2.151145 21.18551
## 54  1.981465 19.96638
## 55  1.825808 18.61813
## 56  1.955962 19.45445
## 57  2.183912 20.90835
## 58  2.041806 20.82543
## 59  2.140466 20.83446
## 60  1.923017 20.10493
## 61  1.821794 18.74627
## 62  1.938915 20.41719
## 63  2.067589 20.63475
## 64  2.195109 21.53962
## 65  1.977381 19.36826
## 66  2.075858 20.73429
## 67  2.113560 20.91959
## 68  1.754821 18.09132
## 69  1.921970 20.34154
## 70  1.950862 19.78382
## 71  1.993233 19.80837
## 72  1.997792 19.80434
## 73  1.887012 20.32858
## 74  1.894511 19.42089
## 75  1.806253 19.58467
## 76  1.918403 20.00117
## 77  1.927941 19.56458
## 78  1.966598 19.51181
## 79  1.890161 18.78226
## 80  2.080892 21.49074
## 81  2.120845 20.72850
## 82  1.919753 20.00931
## 83  2.056477 19.70376
## 84  1.934459 20.28275
## 85  1.725049 19.18307
## 86  2.084447 20.69202
## 87  2.041206 20.25407
## 88  1.841697 19.03820
## 89  1.974109 19.80064
## 90  2.042807 19.83332
## 91  1.854290 19.76495
## 92  2.009167 20.18823
## 93  2.127151 20.57715
## 94  1.982395 20.49000
## 95  1.993950 20.51286
## 96  2.053979 20.04600
## 97  1.895712 18.99081
## 98  1.969278 19.81406
## 99  1.961237 19.75642
## 100 2.178656 21.25103

Question 2b: Linear Regression on Simulated Data

## 
## Call:
## lm(formula = Y ~ X, data = simulated_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## X             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16

## Mean Squared Error (MSE): 0.2032934

Explanation of results:

The estimated prediction equation is Y = 9.0218 + 5.5652 * X. The estimated coefficients are 9.0218 and 5.5652. The standard errors are 0.8336 and 0.4155. The p-value is 2.2 x 10^-16, and the t-statistic is 13.39. Fitted model’s MSE is 0.2032. The null hypothesis is that the X value does not affect the value of Y (coefficient is zero). Since the p-values are less than 0.05, the null hypothesis is rejected.

Question 2c: Alternative Simulation with Modified Parameters

##          X        Y
## 1 1.907630 21.39470
## 2 1.949162 20.41522
## 3 2.029797 20.50732
## 4 2.101782 20.27442
## 5 2.000072 20.32145
## 6 1.926412 19.40257

## 
## Call:
## lm(formula = Y ~ X, data = simulated_data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4146 -0.6058  0.0186  0.6066  2.7090 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.0436     1.6673   4.824 5.16e-06 ***
## X             6.1303     0.8309   7.378 5.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9018 on 98 degrees of freedom
## Multiple R-squared:  0.3571, Adjusted R-squared:  0.3505 
## F-statistic: 54.43 on 1 and 98 DF,  p-value: 5.253e-11

## Mean Squared Error (MSE) for Alternative Model: 0.8131737

Explanation of results :

The estimated prediction equation is Y = 8.0436 + 6.1303 * X. The estimated coefficients are 8.0436 and 6.1303. The standard errors are 1.6673 and 0.8309. The p-value is 5.253 x 10^-11, and the t-statistic is 7.378. Fitted model’s MSE is 0.8132. The null hypothesis states that X does not affect Y. Since the p-value is less than 0.05, the null hypothesis is rejected.

Question 2d: Repeat parts a)–c) using n=400. What do you conclude? What is the effect on the model parameter estimates when error variance gets smaller? What is the effect when sample size gets bigger?

## [1] "First few records from the data Generated : "

##          X        Y
## 1 1.907630 18.94756
## 2 1.949162 19.78539
## 3 2.029797 20.37686
## 4 2.101782 21.28714
## 5 2.000072 19.61935
## 6 1.926412 19.60554

## 
## Call:
## lm(formula = Y ~ X, data = simulated_data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76214 -0.33740  0.03615  0.32077  1.63021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.7466     0.5015   19.44   <2e-16 ***
## X             5.1177     0.2490   20.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4887 on 398 degrees of freedom
## Multiple R-squared:  0.5149, Adjusted R-squared:  0.5137 
## F-statistic: 422.5 on 1 and 398 DF,  p-value: < 2.2e-16

## Mean Squared Error (MSE) with error variance 0.5 and sample size 400: 0.2388269

##          X        Y
## 1 1.907630 18.35697
## 2 1.949162 19.82497
## 3 2.029797 20.60473
## 4 2.101782 22.06537
## 5 2.000072 19.23834
## 6 1.926412 19.57903

## 
## Call:
## lm(formula = Y ~ X, data = simulated_data3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5243 -0.6748  0.0723  0.6415  3.2604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4933     1.0029   9.466   <2e-16 ***
## X             5.2355     0.4979  10.514   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9774 on 398 degrees of freedom
## Multiple R-squared:  0.2174, Adjusted R-squared:  0.2154 
## F-statistic: 110.5 on 1 and 398 DF,  p-value: < 2.2e-16

## Mean Squared Error (MSE) with error variance 1 and sample size 400: 0.9553077

The estimated prediction equation when the error variance is 0.5 is Y <- 9.7466 + 5.1177 * X + epsilon. The estimated coefficients in this case are 9.7466 and 5.1177. The standard errors are 0.5015 and 0.2490. The p-value is 2.2 x 10^-16. The t-statistic is 20.55. Fitted model’s MSE is 0.2388269.The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).Since the p-values is less than 0.05 means that the null hypothesis can be rejected.

The estimated prediction equation when the error variance is 1 is Y <- 9.4933 + 5.2355 * X + epsilon. The estimated coefficients in this case are 9.4933 and 5.2355. The standard errors are 1.0029 and 0.4979. The p-value is 2.2 x 10^-16. The t-statistic is 10.514. Fitted model’s MSE is 0.955.The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).Since the p-values is less than 0.05 means that the null hypothesis can be rejected.

When the error variance is increased from 0.5 to 1 estimated intercept value decreased and the estimated slope value got increased. The change was significant when the sample size was small(100), it only changed by a few decimal values when the sample size is increased to 400. When the sample size gets bigger, the standard errors got smaller

Question 2e: MSE from each model

## MSE with sample size 100, error variance 0.5: 0.2032934

## MSE with sample size 100, error variance 1: 0.8131737

## MSE with sample size 400, error variance 0.5: 0.2388269

## MSE with sample size 400, error variance 1: 0.9553077

MSE Summary Table for Different Sample Sizes and Error Variances
		Estimates		Error Metrics
SampleSize	ErrorVariance	EstimatedIntercept	EstimatedSlope	MSE	StandardErrorofIntercept	StandardErrorofSlope
100	0.5	9.0218	5.5652	0.2032	0.8336	0.4155
100	1.0	8.0436	6.1303	0.8132	1.6673	0.8309
400	0.5	9.7466	5.1177	0.2388	0.5015	0.2490
400	1.0	9.4933	5.2355	0.9550	1.0029	0.4979

Conclusion on the effect of error variance and sample size :

The increase in the variance of error from 0.5 to 1 (in two different cases where sample sizes are 100 & 400) has drastically impacted the value of MSE in a directly proportional manner. Whereas, the formula states that the sample size is indirectly proportional to MSE, but the values from the above result did not differ much.When the sample size gets bigger, the standard errors got smaller.”

Question 3

Question 3a: Bias and Variance of the OLS Estimates X and Y

set.seed(7052)

simulated_regression <- function(n, sigma_error) {
  X <- rnorm(n, mean=2, sd=0.1)
  error <- rnorm(n, mean=0, sd=sigma_error)
  Y <- 10 + 5 * X + error
  
  # Add linear regression
  model <- lm(Y ~ X)
  coefficients_table <- broom::tidy(model)
  t_statistic <- coefficients_table[2, "statistic"]
  p_value <- coefficients_table[2, "p.value"]
  mse <- mean(model$residuals^2)
  
  # Return results
  return(data.frame(
    n=n,
    sigma_error = sigma_error,
    coefficients = coefficients_table,
    t_statistic = t_statistic,
    p_value = p_value,
    mse = mse
  ))
}

results <- bind_rows(
  simulated_regression(100, 0.5), 
  simulated_regression(100, 1),
  simulated_regression(400, 0.5), 
  simulated_regression(400, 1)
)

print(results)

##     n sigma_error coefficients.term coefficients.estimate
## 1 100         0.5       (Intercept)              9.021796
## 2 100         0.5                 X              5.565160
## 3 100         1.0       (Intercept)             13.077796
## 4 100         1.0                 X              3.460281
## 5 400         0.5       (Intercept)              9.697322
## 6 400         0.5                 X              5.160805
## 7 400         1.0       (Intercept)              9.963304
## 8 400         1.0                 X              5.033593
##   coefficients.std.error coefficients.statistic coefficients.p.value statistic
## 1              0.8336483              10.822065         2.007658e-18 13.395491
## 2              0.4154502              13.395491         7.118010e-24 13.395491
## 3              2.0874900               6.264842         9.978287e-09  3.343580
## 4              1.0349032               3.343580         1.172512e-03  3.343580
## 5              0.5207461              18.621975         4.072601e-56 19.811650
## 6              0.2604935              19.811650         2.797442e-61 19.811650
## 7              1.0317701               9.656516         5.832549e-20  9.749772
## 8              0.5162781               9.749772         2.784618e-20  9.749772
##        p.value       mse
## 1 7.118010e-24 0.1992276
## 2 7.118010e-24 0.1992276
## 3 1.172512e-03 0.9106786
## 4 1.172512e-03 0.9106786
## 5 2.797442e-61 0.2568544
## 6 2.797442e-61 0.2568544
## 7 2.784618e-20 1.0589855
## 8 2.784618e-20 1.0589855

# True values for intercept and slope
true_intercept <- 10
true_slope <- 5

# Calculating bias for intercept and slope
bias_intercept <- mean(results$coefficients.estimate[results$coefficients.term == "(Intercept)"]) - true_intercept
bias_slope <- mean(results$coefficients.estimate[results$coefficients.term == "X"]) - true_slope

# Calculating variance for intercept and slope
variance_intercept <- var(results$coefficients.estimate[results$coefficients.term == "(Intercept)"])
variance_slope <- var(results$coefficients.estimate[results$coefficients.term == "X"])

# Displaying results
cat("Bias for Intercept:", bias_intercept, "\n")

## Bias for Intercept: 0.4400544

cat("Bias for Slope:", bias_slope, "\n")

## Bias for Slope: -0.1950401

cat("Variance for Intercept:", variance_intercept, "\n")

## Variance for Intercept: 3.249359

cat("Variance for Slope:", variance_slope, "\n")

## Variance for Slope: 0.8549879

Question 3b: Effect of Sample Size and Error Variance on Coefficient Variance

Explanation of the impact of sample size and error variance:

As the sample size increases, we expect the variance of X and Y to decrease. This is because as the sample size increases, the result tends to normalize,nand therefore the percentage of variances tends to decrease. As the error variance increases, we expect the variance of X and Y to increase. This is because the points are not as close to the regression line as before.Therefore, the end result is more dependent on the individual points used in the analysis, and the variability is higher.

Question 3c: Bias of MSE and ML Estimate of Sigma Squared

## Bias of MSE: 0.5964365

## ML estimate of sigma squared: 0.20124

Explanation of the difference between MSE bias and ML estimate:

Bias of MSE: 0.5964365 ML estimate of sigma squared: 0.20124 The main difference between the bias of the model’s MSE and the ML estimate of sigma squared is that the bias of the MSE is the difference between the average estimated MSE and the variance of the error terms.The bias is influenced by the random variables used in the dataset Whereas, the ML estimate of sigma squared is based on the residuals’ variances in the model.It can show the estimate of the true variance in error terms, not just the underlying variance.This provides a better picture of true variability in the data as a whole.This is why we use MSE, because it looks at how well the predictions match the actual data.

BANA_7052_Module_2_Assignment

Silpa Prakash Rao

2024-10-26