library(LearnEDAfunctions)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
head(boston.marathon.wtimes)
##   year minutes
## 1 1897     175
## 2 1898     162
## 3 1899     174
## 4 1900     159
## 5 1901     149
## 6 1902     163
ggplot(boston.marathon.wtimes,
aes(year, minutes)) +
geom_point()

  1. Using R, perform a resistant smooth (3RSSH, twice) to your data. Save the smooth (the fit) and the rough (the residuals).
boston.marathon.wtimes <- mutate(boston.marathon.wtimes,
smooth.3RSSH = han(as.vector(smooth(minutes,
kind="3RSS"))))
ggplot(boston.marathon.wtimes,
aes(year, minutes)) +
geom_point() +
geom_line(aes(year, smooth.3RSSH), color="red")

boston.marathon.wtimes <- mutate(boston.marathon.wtimes,
Rough = minutes - smooth.3RSSH)
slice(boston.marathon.wtimes, 1:10)
##    year minutes smooth.3RSSH  Rough
## 1  1897     175       175.00   0.00
## 2  1898     162       171.25  -9.25
## 3  1899     174       164.25   9.75
## 4  1900     159       160.25  -1.25
## 5  1901     149       160.00 -11.00
## 6  1902     163       159.25   3.75
## 7  1903     161       158.25   2.75
## 8  1904     158       158.00   0.00
## 9  1905     158       158.00   0.00
## 10 1906     165       158.00   7.00
boston.marathon.wtimes <- mutate(boston.marathon.wtimes,
                            smooth.3RS3R.twice = as.vector(smooth(minutes, kind="3RS3R",  twiceit=TRUE)))
  1. Plot the smooth (using a smooth curve) and describe the general patterns that you see. Don’t assume that anything is obvious – pretend that you are explaining this to someone who doesn’t have any background in statistics.
ggplot(boston.marathon.wtimes, aes(year, minutes)) +
geom_point() +
geom_line(aes(year, smooth.3RS3R.twice), col="red")

The winning time in the men’s Boston Marathon had an immediate and steep drop in the first 15 years, with the peak actually being the first recorded time ever around the year 1985. Then, from about 1900 to 1927, the winning times showed a generally slow increase with the exception of one lower winning time in around 1923. Maybe the contestants of the men’s Boston Marathon had to deal with particularly harsh weather conditions around this time frame leading the longer times. However, from 1927 until the year 1996 we see a general decrease in the winning times with the exception of a steep drop in about 1955. Towards the end of the graph, we see that the winning times have stayed about the same (1980-1996). In those years, athletes became more aware of the right technique and it seems this winning time will be the general winning time going forward. Of course it is possible that top athletes will always try to beat their records but humans are only capable of so much, so it seems reasonable that the winning times have stayed the same towards the end of the graph.

  1. Plot the rough (as a time series plotting individual points). Do you see any general patterns in the rough?
boston.marathon.wtimes <- mutate(boston.marathon.wtimes,
FinalRough = minutes -
smooth.3RS3R.twice)
ggplot(boston.marathon.wtimes,
aes(year, FinalRough)) +
geom_point() +
geom_hline(yintercept = 0, color = "blue")

There are a number of years, 6 to be precise, where the rough is greater than +-10 for these particular winning times for the men’s Boston Marathon. This means for these years, the winning times were significantly greater or less than 10 minutes than the general winning times. This could be due to a number of different reasons. Maybe the weather conditions were particularly hard one year making it harder to run the marathon and therefore leading to a higher winning time. It could also be that that particular year did not have particularly “good” or athletic contestants that would produce a winning time that is within the norm. On the other hand, for the unusually low winning time, it could be that a particularly skilled athlete topped his personal best that year leading to such a low winning time. Other than these 6 years where the rough is around +-10 minutes, we can’t really detect any pattern. Most of the rough fall within - and + 10 minutes.

  1. Construct a stemplot and letter value display of the sizes of the rough. (The size is the absolute value of the rough.) Set up fences and look for outliers. Summarize what you have learned. (What is a typical size of a residual? Are there any unusually large residuals?)
abs(boston.marathon.wtimes$FinalRough)
##  [1]  0 12 14  0  8  6  3  0  0  7 14  3 31  6  4  4  0  0  4  1  0  1  1  5  5
## [26]  0  0  4 11  3  0  1  0 12  0  1  0  0  0  0  2  0  0  2  2  0  3  1  1  6
## [51]  0  0  1  4  4  2  2  0  4  0  3  0  2  0  1  2  1  1  1  0  8  1  5  3  0
## [76]  1  2  5  7  4  0  1  3  0  1  0  0  4  3  2  1  0  1  2  1  0  2  0  0
aplpack::stem.leaf(abs(boston.marathon.wtimes$FinalRough), depth=TRUE)
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 99
##    35    0* | 00000000000000000000000000000000000
##          0. | 
##   (19)   1* | 0000000000000000000
##          1. | 
##    45    2* | 00000000000
##          2. | 
##    34    3* | 00000000
##          3. | 
##    26    4* | 000000000
##          4. | 
##    17    5* | 0000
##          5. | 
##    13    6* | 000
##          6. | 
##    10    7* | 00
##          7. | 
##     8    8* | 00
## HI: 11 12 12 14 14 31
fivenum(abs(boston.marathon.wtimes$FinalRough))
## [1]  0  0  1  4 31
lval(abs(boston.marathon.wtimes$FinalRough))
##   depth lo hi mids spreads
## M  50.0  1  1  1.0       0
## H  25.5  0  4  2.0       4
## E  13.0  0  6  3.0       6
## D   7.0  0  8  4.0       8
## C   4.0  0 12  6.0      12
## B   2.5  0 14  7.0      14
## A   1.0  0 31 15.5      31

Lo = 0
Fl = 0
M = 1 Fu = 4 HI = 31 STEP = 1.5 × (FU − FL) = 6 fencelower = FL − STEP = 0 - 6 = -6 fenceupper = FU + STEP = 10 FENCElower = FL − 2 × STEP = 0 - 2 x 6 = -12 FENCEupper = FU + 2 × STEP = 4 + 2 x 6 = 16

The typical size of a residual is about between 0 and 10 (we found that the inner lower and upper fences are -6 and 10 respectively). However, residuals between 0 and 16 are still within our outer fences (outer lower and upper fences are -12 and 16 respectively). Yes, we can clearly see that there is one residual that is outside of the inner and outer fences. This residual of 31 is unusually large and would be considered an outlier.