library(LearnEDAfunctions)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
head(boston.marathon.wtimes)
## year minutes
## 1 1897 175
## 2 1898 162
## 3 1899 174
## 4 1900 159
## 5 1901 149
## 6 1902 163
ggplot(boston.marathon.wtimes,
aes(year, minutes)) +
geom_point()
boston.marathon.wtimes <- mutate(boston.marathon.wtimes,
smooth.3RSSH = han(as.vector(smooth(minutes,
kind="3RSS"))))
ggplot(boston.marathon.wtimes,
aes(year, minutes)) +
geom_point() +
geom_line(aes(year, smooth.3RSSH), color="red")
boston.marathon.wtimes <- mutate(boston.marathon.wtimes,
Rough = minutes - smooth.3RSSH)
slice(boston.marathon.wtimes, 1:10)
## year minutes smooth.3RSSH Rough
## 1 1897 175 175.00 0.00
## 2 1898 162 171.25 -9.25
## 3 1899 174 164.25 9.75
## 4 1900 159 160.25 -1.25
## 5 1901 149 160.00 -11.00
## 6 1902 163 159.25 3.75
## 7 1903 161 158.25 2.75
## 8 1904 158 158.00 0.00
## 9 1905 158 158.00 0.00
## 10 1906 165 158.00 7.00
boston.marathon.wtimes <- mutate(boston.marathon.wtimes,
smooth.3RS3R.twice = as.vector(smooth(minutes, kind="3RS3R", twiceit=TRUE)))
ggplot(boston.marathon.wtimes, aes(year, minutes)) +
geom_point() +
geom_line(aes(year, smooth.3RS3R.twice), col="red")
The winning time in the men’s Boston Marathon had an immediate and steep drop in the first 15 years, with the peak actually being the first recorded time ever around the year 1985. Then, from about 1900 to 1927, the winning times showed a generally slow increase with the exception of one lower winning time in around 1923. Maybe the contestants of the men’s Boston Marathon had to deal with particularly harsh weather conditions around this time frame leading the longer times. However, from 1927 until the year 1996 we see a general decrease in the winning times with the exception of a steep drop in about 1955. Towards the end of the graph, we see that the winning times have stayed about the same (1980-1996). In those years, athletes became more aware of the right technique and it seems this winning time will be the general winning time going forward. Of course it is possible that top athletes will always try to beat their records but humans are only capable of so much, so it seems reasonable that the winning times have stayed the same towards the end of the graph.
boston.marathon.wtimes <- mutate(boston.marathon.wtimes,
FinalRough = minutes -
smooth.3RS3R.twice)
ggplot(boston.marathon.wtimes,
aes(year, FinalRough)) +
geom_point() +
geom_hline(yintercept = 0, color = "blue")
There are a number of years, 6 to be precise, where the rough is greater than +-10 for these particular winning times for the men’s Boston Marathon. This means for these years, the winning times were significantly greater or less than 10 minutes than the general winning times. This could be due to a number of different reasons. Maybe the weather conditions were particularly hard one year making it harder to run the marathon and therefore leading to a higher winning time. It could also be that that particular year did not have particularly “good” or athletic contestants that would produce a winning time that is within the norm. On the other hand, for the unusually low winning time, it could be that a particularly skilled athlete topped his personal best that year leading to such a low winning time. Other than these 6 years where the rough is around +-10 minutes, we can’t really detect any pattern. Most of the rough fall within - and + 10 minutes.
abs(boston.marathon.wtimes$FinalRough)
## [1] 0 12 14 0 8 6 3 0 0 7 14 3 31 6 4 4 0 0 4 1 0 1 1 5 5
## [26] 0 0 4 11 3 0 1 0 12 0 1 0 0 0 0 2 0 0 2 2 0 3 1 1 6
## [51] 0 0 1 4 4 2 2 0 4 0 3 0 2 0 1 2 1 1 1 0 8 1 5 3 0
## [76] 1 2 5 7 4 0 1 3 0 1 0 0 4 3 2 1 0 1 2 1 0 2 0 0
aplpack::stem.leaf(abs(boston.marathon.wtimes$FinalRough), depth=TRUE)
## 1 | 2: represents 1.2
## leaf unit: 0.1
## n: 99
## 35 0* | 00000000000000000000000000000000000
## 0. |
## (19) 1* | 0000000000000000000
## 1. |
## 45 2* | 00000000000
## 2. |
## 34 3* | 00000000
## 3. |
## 26 4* | 000000000
## 4. |
## 17 5* | 0000
## 5. |
## 13 6* | 000
## 6. |
## 10 7* | 00
## 7. |
## 8 8* | 00
## HI: 11 12 12 14 14 31
fivenum(abs(boston.marathon.wtimes$FinalRough))
## [1] 0 0 1 4 31
lval(abs(boston.marathon.wtimes$FinalRough))
## depth lo hi mids spreads
## M 50.0 1 1 1.0 0
## H 25.5 0 4 2.0 4
## E 13.0 0 6 3.0 6
## D 7.0 0 8 4.0 8
## C 4.0 0 12 6.0 12
## B 2.5 0 14 7.0 14
## A 1.0 0 31 15.5 31
Lo = 0
Fl = 0
M = 1 Fu = 4 HI = 31 STEP = 1.5 × (FU − FL) = 6 fencelower = FL − STEP =
0 - 6 = -6 fenceupper = FU + STEP = 10 FENCElower = FL − 2 × STEP = 0 -
2 x 6 = -12 FENCEupper = FU + 2 × STEP = 4 + 2 x 6 = 16
The typical size of a residual is about between 0 and 10 (we found that the inner lower and upper fences are -6 and 10 respectively). However, residuals between 0 and 16 are still within our outer fences (outer lower and upper fences are -12 and 16 respectively). Yes, we can clearly see that there is one residual that is outside of the inner and outer fences. This residual of 31 is unusually large and would be considered an outlier.