1. (3.12) A student does not understand why the sum of squares SSPE is called a pure error sum of squares “since the formula looks like the one for an ordinary sum of squares”. Explain.
  1. (3.19) A student fitted a linear regression function for a class assignment. The student plotted the residuals \(ei\) against responses \(Yi\) and found positive relation. When the residuals were plotted against the fitted values \(\hat{Y}i\), the student found no relation.
  1. How could the differences arise? Which is the more meaningful plot
  1. (Computer project, 3.3) Refer to the GPA data from the previous h/w assignments
# read the data
gpa <- read.table("./data/CH01PR19.txt")
gpa %>%
  rename(GPA = V1, ACT = V2) -> gpa
  1. Plot residuals \(ei\) against the fitted values \(\hat{Y}i\). What departures from the standard regression assumptions can be detected from this plot?

To begin with, we need to know the assumptions of simple linear regression: 1. independent observation 2. Normally distribution 3. Equal variances 4. No influential outliers 5. Linear association between (mean) y and x. That is, residual : ri = yi - yhat i.

Based on the studentized residuals below, we can know: - Residuals vs Fitted: see the \(\hat{Y}i\) vs \(ei\). The pattern indicates linearity in the data. - Normal QQ: There are some potential outliers in the upper extremity and lower extremity. - Scale-Location: Testing the variance is constant or not. the data tested are not strongly constant. - Residual vs Leverage: probably no influential observations on this plot.

attach(gpa)
par(mfrow = c(2, 2))
reg <- lm(GPA~ACT)
plot(reg)

  1. Prepare a Normal Q-Q plot of the residuals and use it to comment on whether the data passes or fails the assumption of normality. Conduct the Shapiro-Wilk test for normality.
shapiro.test(rstudent(reg))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstudent(reg)
## W = 0.94176, p-value = 5.638e-05
  1. Test whether residuals in this regression analysis have the same variance. The p-value is 0.42397, meaning that we don’t have sufficient evidence to say there is non-constant variance.
# install.packages("car")
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
ncvTest(reg)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.6392758, Df = 1, p = 0.42397
  1. Conduct the lack-of-fit test and state your conclusion. Notice that we have a high p-value of 0.6324. So use this value to make your conclusion.
  1. (Computer project, ) Crime rate data set is available on our Blackboard site. A criminologist studies the relationship between level of education and crime rate in mediumsized U.S. counties. She collected data from a random sample of 84 counties; X is the percentage of individuals in the county having at least a high-school dipoma, and Y is the crime rate (crimes reported per 100,000 residents) last year.

read the data first

#col_names: describes the column names (equivalent to header in base R). It has three possible values:
#TRUE will use the the first row of data as column names.
#FALSE will number the columns sequentially.
#crime <- read_delim("./data/crimeRateData.txt", delim = " ", col_names = FALSE)
#crime

crime <- read.table("./data/crimeRateData.txt")
crime %>%
  rename(crime = V1, dipoma = V2) -> crime

A linear regression of Y on X is then fit to these data. Test: (a) normal distribution of residuals

reg2 <- lm(crime ~ dipoma, data = crime)
# rstudent(reg2)
plot(rstudent(reg2))

attach(crime)
## The following object is masked _by_ .GlobalEnv:
## 
##     crime
par(mfrow = c(2, 2))
plot(reg2)

shapiro.test(rstudent(reg2))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstudent(reg2)
## W = 0.97782, p-value = 0.1559
  1. constant variance of residuals;
ncvTest(reg2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.005045022, Df = 1, p = 0.94338
  1. presence of outliers;
outlierTest(reg2)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferroni p
## 27 3.048092          0.0031098      0.26122
  1. lack of fit.
reduced <-  lm(crime ~ dipoma, data = crime) # assume linearity,
full <-  lm(crime ~ as.factor(dipoma), data = crime) # the model does NOT assume linearity
anova(reduced,full) 
  1. For the “toy” example, consider a small data set Try to do as much as you can by hand, without the use of a computer. The numbers are quite simple!
  1. Plot these data and draw the least squares regression line, which has the expression y = 1 + x.

  1. Compute all the residuals

  1. Compute all sums of squares by hand, from their definitions: