Question 1

Use the dataset about marks from Statistics 500 one year at the University of Michigan from package “faraway”, and complete the following questions.

1. Draw a scatter plot of midterm and final

library("faraway") #import 套件
## Warning: package 'faraway' was built under R version 3.6.3
grade_data<-faraway::stat500 #將dataset指定給grade_data
plot(x=grade_data$midterm, y=grade_data$final,col="blue",main="The scatter of midterm and final test grade",xlab="Midterm grade",ylab="Final grade") #畫scatter 

2. Determine a line to represent the data by guessing theintercept and slope.

plot(x=grade_data$midterm, y=grade_data$final,col="blue",main="The scatter of midterm and final test grade",xlab="Midterm grade",ylab="Final grade")
legends_coord <- locator(2) #在圖上point out兩個位置並貯存兩點的座標在變數legends_coord中
x<-legends_coord$x #將兩點的x座標給定給變數x
y<-legends_coord$y #將兩點的y座標給定給變數y
segments(9.505664, 15.81806, 29.406350, 36.62465, col= 'blue') #將兩點連線

3. Estimate the simple linear regression line.

result<-lm(final~midterm,data=grade_data) #建立簡單線性回歸模型
summary(result) #模型配適的結果
## 
## Call:
## lm(formula = final ~ midterm, data = grade_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.932 -2.657  0.527  2.984  9.286 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.0462     2.4822   6.062 1.44e-07 ***
## midterm       0.5633     0.1190   4.735 1.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.192 on 53 degrees of freedom
## Multiple R-squared:  0.2973, Adjusted R-squared:  0.284 
## F-statistic: 22.42 on 1 and 53 DF,  p-value: 1.675e-05

4. Superimpose your guessed and estimated lines on the plot.

windows()
plot(x=grade_data$midterm, y=grade_data$final,col="blue",main="The scatter of midterm and final test grade",xlab="Midterm grade",ylab="Final grade")
segments(9.505664, 15.81806, 29.406350, 36.62465, col= 'blue')
abline(result,col="red")
legend(x=c(8,18),y=c(35,38),legend=c("The line we determined","regression fitted line"),col=c("blue","red"),lty=c(1,1),cex=0.8)

5. Make sure that the SSE computed based on the line we choose intuitively is less than SSE computed based on the OLS.

f_x<-function(x){15.81806-((36.62465-15.81806)/(29.406350-9.505664))*9.505664+((36.62465-15.81806)/(29.406350-9.505664))*x}#計算出通過兩點的線性方程
estimate_final_grade<-f_x(grade_data$midterm)#將上述方程當作是期末考成績的估計式,並帶入期中考成績得到期末考成績的估計值。
SSE_based_on_f_x<-sum(grade_data$final-f_x(grade_data$midterm))^2
SSE_based_on_OLS<-sum((result$residuals)^2)
paste0("SSE_based_on_f_x = ", SSE_based_on_f_x,"   SSE_based_on_OLS =",SSE_based_on_OLS) 
## [1] "SSE_based_on_f_x = 1207.75293089733   SSE_based_on_OLS =931.285368085807"