Question 1
1a Start with a basic exploratory data analysis. Show summary statistics of the responsive variable and predictor variable
summary(alumni$alumni_giving_rate) #responsive variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 18.75 29.00 29.27 38.50 67.00
summary(alumni$percent_of_classes_under_20) # predictor variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 44.75 59.50 55.73 66.25 77.00
1b What is the nature of the variables X and Y? Are there outliers in the data. How might you define an outlier in this case?
X is a continuous variable same as y. The only outliers we can find in the plot will be below 35 and above 70 on the x-axis (percent of classes under 20)
What is the correlation coefficient? Draw a scatter plot. Any major comments about the data?
The correlation coefficient is 0.65. It is a positive correlation between the percentage of classes with fewer than 20 students and alumni giving rate.
cor.test(alumni$alumni_giving_rate, y = alumni$percent_of_classes_under_20)
##
## Pearson's product-moment correlation
##
## data: alumni$alumni_giving_rate and alumni$percent_of_classes_under_20
## t = 5.7344, df = 46, p-value = 7.228e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4427365 0.7856553
## sample estimates:
## cor
## 0.6456504
1c Fit a simple linear regression to the data. What is your estimated regression equation?
The esimated regression equation is Y=-7.386+.6758x
linear <- lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
data = alumni)
summary(linear)
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.053 -7.158 -1.660 6.734 29.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3861 6.5655 -1.125 0.266
## percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
## F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
coef(linear)
## (Intercept) percent_of_classes_under_20
## -7.3860676 0.6577687
1d Interpret your results(e.g., how would you interpret the slope in this application?)
The percent of classes under 20 is a very small rate (0.6578). This means the small rate won’t increase over time and the slope won’t be steep but barely an increase
2 A Simulation Study (Simple Linear Regression). Assuming the mean response is E(Y|X) = 10 + 5x
set.seed(7052)
x <- rnorm(100, mean = 2, sd = .1)
y <- rnorm(100, mean = 10 + 5*x, sd = 0.5)
lmline <- cbind(x,y)
summary(lmline)
## x y
## Min. :1.725 Min. :18.09
## 1st Qu.:1.923 1st Qu.:19.67
## Median :2.001 Median :20.11
## Mean :2.004 Mean :20.17
## 3rd Qu.:2.070 3rd Qu.:20.70
## Max. :2.243 Max. :21.80
Show summary statistics of the response variable and predictor variable. Are there outliers? What is the correlation coefficient? Draw a scatter plot
Only a few outliers around the y value 21. The correlation coefficient is .8042
cor.test(x, y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 13.395, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7218233 0.8641361
## sample estimates:
## cor
## 0.8042198
plot(x,y,pch=20)
abline(lm(y ~ x), lwd = 1)
Fit a simple linear regression. What is the estimated model? Report the estimated coefficients. What is the model mean sqaured(MSE)?
MSE estimate coefficient is .2032
fit <- lm(y ~ x)
df <- data.frame(cbind(x, y))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## x 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
sigma(fit)
## [1] 0.4508807
sigma(fit)^2
## [1] 0.2032934
What is the sample mean of both X and Y? Plot the fitted regresion line and the point (X, Y). What do you find?
Both x and y (average) are in the middle of the regression line
meanx <- mean(x)
meany <- mean(y)
dataframe.mean2 <- data.frame(cbind(meanx, meany))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_point(aes(x = meanx, y = meany, color = "red"))
## Warning in geom_point(aes(x = meanx, y = meany, color = "red")): All aesthetics have length 1, but the data has 100 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
## `geom_smooth()` using formula = 'y ~ x'
3 Ordinary least sqaures (OLS) is typically used to estimate the regression coefficient B0 and B1 in the simple linear regression model by minimizing the residual sum of squares (RSS)
This doesn’t work for minimizing the sum of residuals because postive and negative cancel each other out. This will result in zero so it won’t have a fit in the regression line.
This is a least absolute deviation. The problem with this method is you can minimize it and find in the regression line however it is hard to find the coefficients. Instead of using simple calculations, we will require complex operations to find the coefficients in the fit regression line.
The reasons why OLS is popular because it gives the best linear unbiased estimators with small variance making reliable and accurate. You can also have OLS find data vectors and plots easily on the fitted line. This makes a very easy computational method than the rest of the methods.
4 Establish the following relationships for the simple linear regression model. a. b. c.
knitr::include_graphics("C:/Users/maens/OneDrive/Desktop/BANA 7052/4 answers a, b, c.jpg")
knitr::include_graphics("C:/Users/maens/OneDrive/Desktop/BANA 7052/answer d.jpg")
e.
knitr::include_graphics("C:/Users/maens/OneDrive/Desktop/BANA 7052/answer e.jpg")
f. Based on the defintion of SSE, the equation is already minimized. The
sum of the sqaures shown here is the difference between the yi values
and yhati values. It already meeting the correct creteria to show the
best fitting line in the equation.