1. Find some (x, y) data where you think x and y are strongly related. Make a scatterplot and find the least-squares and resistant fits. Plot the two sets of residuals. Interpret the fit and the residuals. Contrast the two methods of fitting – is it better to fit a resistant line?
college.ratings <- read.delim("~/data/college.ratings.txt")
data<-college.ratings[,c(4,5)]
(p <- ggplot(data, aes(Reputation, F.retention)) + geom_point())

We start by plotting these data on a scatterplot where the X variable is the Reputation of colleges and the Y variable is the Retention of the college. Here we see a positive trend in this graph which makes sense – higher reputation colleges tend to have higher retention - The obvious pattern shows the retention increased by reputation. Also, this scatter plot looks pretty nice, spots spreads quite evenly in graph and we can see a little right skewed in Reputation since most of the spots fall down before 4.0.

fit <- lm(F.retention ~ Reputation, data)
summary(fit)
## 
## Call:
## lm(formula = F.retention ~ Reputation, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.198754 -0.033945  0.004879  0.036055  0.136436 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.493703   0.015422   32.01   <2e-16 ***
## Reputation  0.105190   0.004984   21.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0569 on 247 degrees of freedom
## Multiple R-squared:  0.6433, Adjusted R-squared:  0.6418 
## F-statistic: 445.4 on 1 and 247 DF,  p-value: < 2.2e-16
data %>% 
  mutate(Predicted = predict(fit),
         Residual = residuals(fit)) -> data
p + geom_line(data = data, aes(Reputation, Predicted)) +
  ggtitle("Scatterplot of Reputation and Retention with Least Squares Fit")

The fitted line we got in least square fit is Y = 0.493703+0.105190X. We’ll compare it with the resistant line later. The slope is 0.105190 which means that, the mean retention of the colleges in United States increase by about 0.105190 unit for every 1 unit increasing in reputation.

Rfit <- rline(F.retention ~ Reputation, data)
data %>% 
  mutate(Predicted_new = Rfit$a + Rfit$b * (Reputation- Rfit$xC),
         Residual_new = Rfit$residual) ->data
ggplot(data, aes(Reputation, F.retention)) +
  geom_point() +
  geom_line(aes(Reputation, Predicted_new)) +
  ggtitle("Scatterplot of Reputation and Retention with Resistant Fit")

Rfit$a
## [1] 0.7935897
Rfit$b
## [1] 0.1384615
Rfit$xC
## [1] 2.8

So the three-group line is y = 0.7935897+0.1384615(x − 2.8) = 0.4058975 + 0.1384615x. This line is graphed on the scatterplot above. Similarly, every 1 unit increase in reputation will lead to 0.1384615 units increase in mean retention.

ggplot(data, aes(Reputation, Residual)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  ggtitle("Residual Plot with Least Squares Fit")

ggplot(data, aes(Reputation, Residual_new)) +
  geom_point() +geom_hline(yintercept = 0) +
  ggtitle("Residual Plot VS Reputation from Resistant Fit")

Although these two residual plots have almost the same pattern, we still can notice that the obviously difference within range 4-5. It seems that the whole pattern in least squares fit is much closer to the “0” mean, which shows this model fits better than the resistant fit. As we scan from left to right within thin vertical strips, the vertical mean seems satisfactory before 4 and negative in the right side especially between range 4-5 in resistant fit and 4.5-5 in least square fit. It means the fitted value is smaller than the observed value.

  1. Fit a resistant line to the following data (population of England and Wales in millions for different years). Data is stored in the file pop.england in the LearnEDA package. Plot the residuals and summarize the results.
pop.england <- read.delim("~/data/pop.england.txt")
Data<-pop.england
rfit <- rline(POPULATION ~ YEAR, pop.england)

## [1] 23.15333
## [1] 0.2674444
## [1] 1866

We see significant curvature in the scatter plot. The three-group line is y = 23.15333+0.2674444(x − 1866) = -475.8979 + 0.2674444x. The slope is 0.2674444 which means that,during the period of 1801-1931, the population of the British increases by about 0.2674444 million each year.

We can look more carefully at British population growth by looking at the pattern in the residual graph. Residual values above 0 correspond to years with higher than average growth, and negative values correspond to years with less than average growth. The years 1801 (exceed 30 percent),1811 and between 1851 and 1881 are notable — the yearly enrollment growth during this period exceeded 10 percent.

  1. For the following problems (from Tukey, EDA),
  1. straighten plot using transformations applied on summary points
  2. fit a line to the transformed plots and plot the residuals
  3. summarize all results
tukey.26a <- read.delim("C:/Users/ylu_local/Desktop/5470/LearnEDAfunctions-master/LearnEDAfunctions-master/data/tukey.26a.txt")
(p <- ggplot(tukey.26a, aes(temp, rate)) + geom_point())

data<-tukey.26a[c(1:8),]
plot(data)

summary.points <- data.frame(x = c(22, 24.5, 27),
                             y = c(314.6667, 517, 814.3333))
straightening.work(summary.points, 1, 1)
##           x        y   tx       ty     slope half.slope.ratio
## Left   22.0 314.6667 21.0 313.6667  80.93332         1.469522
## Center 24.5 517.0000 23.5 516.0000 118.93332               NA
## Right  27.0 814.3333 26.0 813.3333        NA               NA
straightening.work(summary.points, 2, 0.5)
##           x        y      tx       ty     slope half.slope.ratio
## Left   22.0 314.6667 241.500 33.47769 0.1720013         1.047433
## Center 24.5 517.0000 299.625 43.47527 0.1801597               NA
## Right  27.0 814.3333 364.000 55.07305        NA               NA
new.x <- data$temp ^ (2)
new.y <- data$rate ^ (0.5)
plot(new.x, new.y)

Dat<-data.frame(x=new.x,y=new.y)
fit<-lm(new.y~new.x,Dat)
plot(new.x,fit$residuals)
abline(h=0)

We can see there are 2 components in this scatter plot, the first half looks obviously curve that we have to reexpress and the second half looks quite linear that we would not adjust. Then subsample the first group as “data”, we got a slight curved relationship between x and y. We use three summary points to measure the curvature in the plot and help us choose an appropriate reexpression to straighten. Since the bulge in this curvature is towards small y and large x, we try reexpressing x by a square (moving up the ladder) and reexpressing y by a root. Luckily,the half-slope ratio is now 1.047433 after first trying. Finally, we got a really straight scatter plot which shows a linear relationship between new.x and new.y and a perfect residual plot which cannot see any obvious curvature or pattern.

tukey.26b <- read.delim("C:/Users/ylu_local/Desktop/5470/LearnEDAfunctions-master/LearnEDAfunctions-master/data/tukey.26b.txt")
(p <- ggplot(tukey.26b, aes(year, deposits)) + geom_point())

summary.points <- data.frame(x = c(41.5, 51, 60.5),
                             y = c(746.2, 1209.444, 2611.3))
straightening.work(summary.points, 1, 1)
##           x        y   tx       ty     slope half.slope.ratio
## Left   41.5  746.200 40.5  745.200  48.76253         3.026172
## Center 51.0 1209.444 50.0 1208.444 147.56379               NA
## Right  60.5 2611.300 59.5 2610.300        NA               NA
straightening.work(summary.points, 4, 0.15)
##           x        y      tx       ty        slope half.slope.ratio
## Left   41.5  746.200  741536 11.31529 1.422377e-06         1.003228
## Center 51.0 1209.444 1691300 12.66622 1.426968e-06               NA
## Right  60.5 2611.300 3349357 15.03221           NA               NA
new.x <- tukey.26b$year ^ (4)
new.y <- tukey.26b$deposits ^ (0.15)
plot(new.x, new.y)

Dat<-data.frame(x=new.x,y=new.y)
fit<-lm(new.y~new.x,Dat)
plot(new.x,fit$residuals)
abline(h=0)

In this scatter plot, we can see an obvious curve in the middle. We use three summary points to measure the curvature in the plot and help us choose an appropriate reexpression to straighten. Since the bulge in this curvature is towards small y and large x, we try reexpressing x by a square (moving up the ladder) and reexpressing y by a root. After first trying, the half-slope ratio is still high. We keep trying smaller y and larger x. When transforming x to the 4 power and y to the 0.15 power, the half-slope ratio is 1.003228. Then plot the new x with new y, we got a much straighter line. The residual plot is also acceptable and most of the spots are close to the 0 mean line.

tukey.26c <- read.delim("C:/Users/ylu_local/Desktop/5470/LearnEDAfunctions-master/LearnEDAfunctions-master/data/tukey.26c.txt")
(p <- ggplot(tukey.26c, aes(year, miles)) + geom_point())

summary.points <- data.frame(x = c(40.5, 48.5, 56.5),
                             y = c(1155.25, 7406.25, 22897))
straightening.work(summary.points, 1, 1)
##           x        y   tx       ty    slope half.slope.ratio
## Left   40.5  1155.25 39.5  1154.25  781.375         2.478124
## Center 48.5  7406.25 47.5  7405.25 1936.344               NA
## Right  56.5 22897.00 55.5 22896.00       NA               NA
straightening.work(summary.points, 2, 0.5)
##           x        y       tx        ty     slope half.slope.ratio
## Left   40.5  1155.25  819.625  65.97794 0.2925315         1.062287
## Center 48.5  7406.25 1175.625 170.11914 0.3107523               NA
## Right  56.5 22897.00 1595.625 300.63509        NA               NA
new.x <- tukey.26c$year ^ (2)
new.y <- tukey.26c$miles ^ (0.5)
plot(new.x, new.y)

Dat<-data.frame(x=new.x,y=new.y)
fit<-lm(new.y~new.x,Dat)
plot(new.x,fit$residuals)
abline(h=0)

Similarly, this plot also shows similar direction of the bulge in the curvature. We try reexpressing x by a square (moving up the ladder) and reexpressing y by a root and got a quite straight line in scatter plot which shows a strong linear relationship between new.x and new.y and a nice residual plot which cannot see any obvious curvature or pattern.