In this R activity, you’ll graphically add a weird point and see the impact of this point on the least-squares fit. You’ll explore the use of three resistant fitting procedures in fitting a line to data with outlying points.

  1. Install the package TeachingDemos into R.
library(TeachingDemos)
  1. Define the following (x, y) data and run the function put.points.demo on this data:

library(TeachingDemos) x=c(2,2,4,5,6,7,8,9,10) y=c(7,8,6,7,4,6,4,6,3) put.points.demo(x,y)

Record the equation of the least-squares line below:

y = 8.25 + (-0.44)x

  1. One problem with the least-squares is that it can be heavily influenced by a single point. By use of the “Add Point” feature of the put.points.demo function, add a single point that will have a significant effect on the least-squares fit.
x1=c(2,2,2,4,5,6,7,8,9,10)
y1=c(2,7,8,6,7,4,6,4,6,3)

Point you added: (2,2)

New least-squares fit: y = 6.43 + (-0.31)x

  1. The function rline in the LearnEDA package implements a resistant line described in the Lecture Notes. Find a resistant fit of both the original (x, y) data and of the new data with the new point added.
library(LearnEDAfunctions)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
df <- data.frame(x, y)
rline(y ~ x, df, iter=10)
## $a
## [1] 5.5
## 
## $b
## [1] -0.5
## 
## $xC
## [1] 6
## 
## $half.slope.ratio
## [1] 2.666667
## 
## $residual
## [1] -0.5  0.5 -0.5  1.0 -1.5  1.0 -0.5  2.0 -0.5
## 
## $spoints.x
## [1] 2 6 9
## 
## $spoints.y
## [1] 7 6 4
df1 <- data.frame(x1, y1)
rline(y1 ~ x1, df1, iter=10)
## $a
## [1] 5.5
## 
## $b
## [1] -0.5
## 
## $xC
## [1] 5.5
## 
## $half.slope.ratio
## [1] 2
## 
## $residual
##  [1] -5.25 -0.25  0.75 -0.25  1.25 -1.25  1.25 -0.25  2.25 -0.25
## 
## $spoints.x
## [1] 2.0 5.5 9.0
## 
## $spoints.y
## [1] 7 6 4

Original (x, y) data: y = 8.25 + (-0.44)x Equation of the resistant line: y = 5.5 + (-0.5)(x − 6) y = 8.5 + (-0.5)x

New data with additional point: y = 6.43 + (-0.31)x Equation of the resistant line: y = 5.5 + (-0.5)(x - 5.5) y = 8.25 + (-0.5)x

  1. Compare the least-squares and resistant fits for the original data and for the new data. Have you demonstrated in this example that the resistant fit is indeed resistant to outlying points? Yes, we have demonstrated in this example that the resistant fit is indeed resistant to outlying points. When looking at the least-squares fits, we can tell that adding that single point had a rather large impact on our lsr line. However, when comparing the two equations of the resistant lines, we see that adding a point barely changed our equation. This shows that the resistant fit proves to be resistant to outlying points.

  2. Demonstrate the differences between the two fits for the new data by plotting the data and both lines on the same graph using contrasting colors. I have plotted the least-squares fit in red and the resistant fit in blue

ggplot(df1, aes(x1,y1)) +
geom_point() +
geom_abline(slope = -0.31,
intercept = 6.43, color = "red") +
geom_abline(slope = -0.5,
intercept = 8.25, color = "blue") 

  1. There are other “resistant” fitting methods available. In the MASS package, both the functions rlm and lqs implement different “resistant fits”. Explain briefly the algorithms of these two fitting algorithms and use the rlm and lqs methods to fit lines to your new data. Contrast the resistant, rlm, lqs, and least-squares fits. the lqs functions works by fitting a regression to the good points in the data set. The rlm function uses iterated re-weighted least squares to create a resistant fit. Using the rlm and lqs functions to fit lines to our new data as outlined below, we find that the rlm fit is y = 7.100 + (-0.291)x and the lqs fit is y = 7.633 + (-0.200)x. Comparing that to our previous resistant fit (y = 8.25 + (-0.5)x), we can see that using the rlm function led us to having a larger slope of -0.291 and a slightly smaller intercept of 7.100. Similiarly, for out previous least-squares fit (y = 6.43 + (-0.31)x), we see that we had a smaller slope compared to the slope we got using the lqs function (-0.200). We also have a smaller intercept compared to the intercept we got using the lqs function (7.633)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:LearnEDAfunctions':
## 
##     farms
## The following object is masked from 'package:dplyr':
## 
##     select
lqs(y1 ~ x1, data = df1)
## Call:
## lqs.formula(formula = y1 ~ x1, data = df1)
## 
## Coefficients:
## (Intercept)           x1  
##       7.633       -0.200  
## 
## Scale estimates 1.098 1.628
rlm(y1 ~ x1, data = df1)
## Call:
## rlm(formula = y1 ~ x1, data = df1)
## Converged in 6 iterations
## 
## Coefficients:
## (Intercept)          x1 
##   7.1004119  -0.2912616 
## 
## Degrees of freedom: 10 total; 8 residual
## Scale estimate: 1.88