In this R activity, you’ll graphically add a weird point and see the impact of this point on the least-squares fit. You’ll explore the use of three resistant fitting procedures in fitting a line to data with outlying points.

  1. Install the package TeachingDemos into R.

  2. Define the following (x, y) data and run the function put.points.demo on this data:

library(LearnEDAfunctions)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
x=c(2,2,4,5,6,7,8,9,10)
y=c(7,8,6,7,4,6,4,6,3)
plot(x,y)

Record the equation of the least-squares line below: \(Y=-0.44X+8.25\) (found using put.points.demo)

  1. One problem with the least-squares is that it can be heavily influenced by a single point. Add a single point that will have a significant effect on the least-squares fit.
x1=c(2,2,4,5,6,7,8,9,10,12)
y1=c(7,8,6,7,4,6,4,6,3,18)
plot(x1,y1)

Point you added: \((12,18)\)

New least-squares fit: \(Y=0.38X+4.41\) (found using put.points.demo)

  1. The function rline in the LearnEDA package implements a resistant line described in the Lecture Notes. Find a resistant fit of both the original (x, y) data and of the new data with the new point added.

Original (x, y) data: Equation of the resistant line: \(Y=-0.43X+5.53\)

old_data<-data.frame(x,y)
old_rline<-rline(y~x, old_data)
print(old_rline)
## $a
## [1] 5.52381
## 
## $b
## [1] -0.4285714
## 
## $xC
## [1] 6
## 
## $half.slope.ratio
## [1] 2.666667
## 
## $residual
## [1] -0.2380952  0.7619048 -0.3809524  1.0476190 -1.5238095  0.9047619 -0.6666667
## [8]  1.7619048 -0.8095238
## 
## $spoints.x
## [1] 2 6 9
## 
## $spoints.y
## [1] 7 6 4

New data with additional point: Equation of the resistant line: \(Y=-0.125X+5.96\)

new_data<-data.frame(x1,y1)
new_rline<-rline(y1~x1, new_data)
print(new_rline)
## $a
## [1] 5.958333
## 
## $b
## [1] -0.125
## 
## $xC
## [1] 6.5
## 
## $half.slope.ratio
## [1] -0.6428571
## 
## $residual
##  [1]  0.4791667  1.4791667 -0.2708333  0.8541667 -2.0208333  0.1041667
##  [7] -1.7708333  0.3541667 -2.5208333 12.7291667
## 
## $spoints.x
## [1]  2.0  6.5 10.0
## 
## $spoints.y
## [1] 7 5 6
  1. Compare the least-squares and resistant fits for the original data and for the new data. Have you demonstrated in this example that the resistant fit is indeed resistant to outlying points? Adding one outlier made the slope of the least-squares line go from negative to positive, which is a very drastic change. In comparison, the resistant line equation hardly changed when this outlier was added to the data. This demonstrates that the resistant fit is more resistant to outlying points.

  2. Demonstrate the differences between the two fits for the new data by plotting the data and both lines on the same graph using contrasting colors. In the plot below, the blue line is the resistant line, and the red line is the least-squares line.

ggplot(new_data, aes(x1, y1)) +
  geom_point() +
  geom_abline(slope=-0.125, intercept=5.96, col="blue") +
  geom_abline(slope=0.38, intercept=4.41, col="red")

  1. There are other “resistant” fitting methods available. In the MASS package, both the functions rlm and lqs implement different “resistant fits”. Explain briefly the algorithms of these two fitting algorithms and use the rlm and lqs methods to fit lines to your new data. Contrast the resistant, rlm, lqs, and least-squares fits.
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:LearnEDAfunctions':
## 
##     farms
## The following object is masked from 'package:dplyr':
## 
##     select
fit_rlm<-rlm(y1~x1)
fit_lqs<-lqs(y1~x1)
summary(fit_rlm)
## 
## Call: rlm(formula = y1 ~ x1)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8623 -1.1078  0.2561  0.8966 13.6644 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept)  7.3889  1.1584     6.3784
## x1          -0.2544  0.1602    -1.5885
## 
## Residual standard error: 1.498 on 8 degrees of freedom
summary(fit_lqs)
##               Length Class      Mode     
## crit           1     -none-     numeric  
## sing           1     -none-     character
## coefficients   2     -none-     numeric  
## bestone        2     -none-     numeric  
## fitted.values 10     -none-     numeric  
## residuals     10     -none-     numeric  
## scale          2     -none-     numeric  
## terms          3     terms      call     
## call           2     -none-     call     
## xlevels        0     -none-     list     
## model          2     data.frame list