Part A

For both data sets:

  1. Using R, perform a resistant smooth (3RSSH, twice) to your data. Save the smooth (the fit) and the rough (the residuals).

  2. Plot the smooth (using a smooth curve) and describe the general patterns that you see. Don’t assume that anything is obvious – pretend that you are explaining this to someone who doesn’t have any background in statistics.

  3. Plot the rough (as a time series plotting individual points). Do you see any general patterns in the rough?

  4. Construct a stemplot and letter value display of the sizes of the rough. (The size is the absolute value of the rough.) Set up fences and look for outliers. Summarize what you have learned. (What is a typical size of a residual? Are there any unusually large residuals?)

Dataset boston.marathon.wtimes

boston.marathon.wtimes <- read.delim("~/data/boston.marathon.wtimes.txt")
head(boston.marathon.wtimes)
##   year minutes
## 1 1897     175
## 2 1898     162
## 3 1899     174
## 4 1900     159
## 5 1901     149
## 6 1902     163
smooth.3R <- smooth(boston.marathon.wtimes$minutes,kind="3R")
smooth.3RSS <-smooth(boston.marathon.wtimes$minutes,kind="3RSS")
plot(minutes~year,boston.marathon.wtimes,xlab="YEAR",ylab="MINUTES",main="Plot with 3R and 3RSS Smooths")
lines(smooth.3R~boston.marathon.wtimes$year, col="orange",lwd=2)
lines(smooth.3RSS~boston.marathon.wtimes$year, col="blue",lwd=2)
legend("topright",legend=c("3R", "3RSS"),lty=1, lwd=2,col=c("orange", "blue"))

smooth.3RSSH <- han(smooth.3RSS)
Rough <- boston.marathon.wtimes$minutes- smooth.3RSSH
head(cbind(boston.marathon.wtimes$minutes, smooth.3RSSH, Rough))
##      boston.marathon.wtimes$minutes smooth.3RSSH  Rough
## [1,]                            175       175.00   0.00
## [2,]                            162       171.25  -9.25
## [3,]                            174       164.25   9.75
## [4,]                            159       160.25  -1.25
## [5,]                            149       160.00 -11.00
## [6,]                            163       159.25   3.75
smooth.3RS3R.twice <- smooth(boston.marathon.wtimes$minutes,kind="3RS3R",twiceit=TRUE)
plot(smooth.3RS3R.twice~boston.marathon.wtimes$year,col="red", lwd=2, type="l",main="3RSSH Smooth VS 3RSSR, Twice Smooth")
lines(smooth.3RSSH~boston.marathon.wtimes$year,col="blue",lwd=2)
legend("topright",legend=c("3RSSH", "3RSSR.twice"),lty=1, lwd=2,col=c("blue", "Red"))

• It is a little difficult to see the general pattern across time from the scatter plot due to the variability in the vertical direction. To help see the general pattern, we use resistant smooths,such as 3RSSH smooth.

• Generally, the minutes of Boston Marathon cost less over time. They were especially uncommon in the 1910-1930 era.There was an increasing trend between 1914-1918 (World War I). There was dramatic minutes increases in 1920-1930, followed by a decreasing pattern until about 1940. That may be caused by the course lengthened to conform to the Olympic standard in 1920s.

• The increasing trend between 1940-1945 time matched the World War II. During 1955-1960, the minutes may increase by the severe weather.

• Recently (since 1980), Boston Marathon is seeing the greatest running minutes due to the scientific training,talented athletes and prize money awarded.

FinalRough <- boston.marathon.wtimes$minutes - smooth.3RS3R.twice
plot(FinalRough~boston.marathon.wtimes$year,pch=19, cex=1, xlab="YEAR", ylab="ROUGH")
abline(h=0, lwd=3, col="blue")

• There are a number of years (3 to be exact) where the rough is above/below +10/-10 –- for these particular years, the minutes was about 10 more/less than the general pattern.

• There is one year (looks like the year 1909) where the rough was about +30 –- this means that the minutes this year was about 30 larger than the general pattern in the smooth. Maybe it was due to the severe weather (1909: The temperature soared to 97 degrees–recorded in Boston Marathon Official website).

• I don’t see any general pattern in the rough when plotted as a function of years. Most of the rough values are between -6 and +10 which translate to Boston Marathon minutes values between -6 and +10.

FR<- abs(FinalRough)
aplpack::stem.leaf(FR)
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 99
##    35    0* | 00000000000000000000000000000000000
##          0. | 
##   (19)   1* | 0000000000000000000
##          1. | 
##    45    2* | 00000000000
##          2. | 
##    34    3* | 00000000
##          3. | 
##    26    4* | 000000000
##          4. | 
##    17    5* | 0000
##          5. | 
##    13    6* | 000
##          6. | 
##    10    7* | 00
##          7. | 
##     8    8* | 00
## HI: 11 12 12 14 14 31
data<-cbind(as.vector(FR),boston.marathon.wtimes)
lval(FR)
##   depth lo hi mids spreads
## M  50.0  1  1  1.0       0
## H  25.5  0  4  2.0       4
## E  13.0  0  6  3.0       6
## D   7.0  0  8  4.0       8
## C   4.0  0 12  6.0      12
## B   2.5  0 14  7.0      14
## A   1.0  0 31 15.5      31
d<-lval_plus(data,FR)
filter(d, OUT == TRUE)
##   as.vector(FR) year minutes Fence_LO Fence_HI  OUT
## 1            12 1898     162       -6       10 TRUE
## 2            14 1899     174       -6       10 TRUE
## 3            14 1907     144       -6       10 TRUE
## 4            31 1909     173       -6       10 TRUE
## 5            11 1926     145       -6       10 TRUE
## 6            12 1931     166       -6       10 TRUE
plot(FR~data$year,pch=19, cex=1, xlab="YEAR", ylab=" FINAL ROUGH")
abline(h=10,col="red")

As we can see from the stemplot, the distribution of the rough seems right skewed. The Median, Fourths, Eighths,Extremes are 1, 0 and 4, 0 and 6, 0(lowest) and 31(highest). The fences are (-6, 10) and there are 6 outliers in total as shown in the residual plot.

Dataset church.tseries

##   week year month worship
## 1    1 1993     1     336
## 2    2 1993     1     264
## 3    3 1993     1     382
## 4    4 1993     1     325
## 5    5 1993     1     346
## 6    6 1993     2     464

##      Data$worship smooth.3RSSH  Rough
## [1,]          336       336.00   0.00
## [2,]          264       336.00 -72.00
## [3,]          382       338.50  43.50
## [4,]          325       343.50 -18.50
## [5,]          346       346.00   0.00
## [6,]          464       333.75 130.25

• It is a little difficult to see the general pattern across time from the scatter plot due to the variability in the vertical direction.

• The worship had an immediate drop in the early and middle of 1993(week 1- 52) after the excitement of the establishment of the church and not enough enrollment at the beginning.

• After this early drop, the worship showed a dramatic increase from week 27 to about week 50(at the end of year 1993).

• The worship reached a peak about week 70, then it dropped off suddenly and largely and there was a valley about week 84 (In the middle of the year 1994). It was also discovered in every year(from 1993-1996), the reaching of the worship valley may result from the lack of the religious holiday.

• There was a sudden rise in worship at the end and the begin of each year(week 84-120, week 132-154, week 186-206) –- at this point, maybe the Christian were getting excited about the varieties of religious holiday begin.

FinalRough <- Data$worship - smooth.3RS3R.twice
plot(FinalRough~week,pch=19, cex=1, xlab="WEEK", ylab="ROUGH")
abline(h=0, lwd=3, col="blue")

• There are a number of weeks (7 to be exact) where the rough is around/above +85 –- for these particular weeks, the worship was about 85 more than the general pattern. I think there are typically more Christian celebrating during the holidays. I suspect that these extreme high rough values correspond to the special holidays,such as Easter,Christmas and New Year.

• There is one week where the rough was about -210(looks like week 108, Jan 1995) –- this means that the worship at this week was about 210 smaller than the general pattern in the smooth.

• I don’t see any general pattern in the rough when plotted as a function of time. Most of the rough values are between -51 and +85 which translate to worship values between -51 and +85.

FR<-abs(FinalRough)
aplpack::stem.leaf(FR)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 209
##    89    0* | 00000000000000000000000000000000000000000000000000000000000000000000000000001111222223334
##   (18)   0. | 566667777888888999
##   102    1* | 001112223344444
##    87    1. | 55566677789
##    76    2* | 000133
##    70    2. | 555566679
##    61    3* | 01133333444
##    50    3. | 5799
##    46    4* | 012234444
##    37    4. | 5667789
##    30    5* | 14
##    28    5. | 56678
##    23    6* | 11113
##    18    6. | 66
##    16    7* | 24
##          7. | 
##    14    8* | 0
## HI: 87 94 97 104 115 117 118 126 131 137 160 164 211
data<-cbind(as.vector(FR),Data)
lval(FR)
##   depth lo    hi   mids spreads
## M 105.0  9   9.0   9.00     0.0
## H  53.0  0  34.0  17.00    34.0
## E  27.0  0  56.0  28.00    56.0
## D  14.0  0  80.0  40.00    80.0
## C   7.5  0 117.5  58.75   117.5
## B   4.0  0 137.0  68.50   137.0
## A   2.5  0 162.0  81.00   162.0
## Z   1.0  0 211.0 105.50   211.0
d<-lval_plus(data,FR)
filter(d, OUT == TRUE)
##    as.vector(FR) week year month worship Fence_LO Fence_HI  OUT
## 1            118    6 1993     2     464      -51       85 TRUE
## 2            115   15 1993     4     458      -51       85 TRUE
## 3            131   52 1993    12     237      -51       85 TRUE
## 4             97   66 1994     4     493      -51       85 TRUE
## 5             94   74 1994     5     302      -51       85 TRUE
## 6            126  103 1994    12     495      -51       85 TRUE
## 7            164  104 1994    12     205      -51       85 TRUE
## 8            160  105 1995     1     234      -51       85 TRUE
## 9            211  108 1995     1     186      -51       85 TRUE
## 10            87  111 1995     2     310      -51       85 TRUE
## 11           137  120 1995     4     539      -51       85 TRUE
## 12           117  169 1996     3     535      -51       85 TRUE
## 13           104  171 1996     4     522      -51       85 TRUE
plot(FR~week,pch=19, cex=1, xlab="WEEK", ylab=" FINAL ROUGH")
abline(h=85,col="red")

As we can see from the stemplot, the distribution of the rough seems right skewed. The Median, Fourths, Eighths,Extremes are 9, 0(low) and 34(high), 0(low) and 56(high), 0(lowest) and 211(highest). The fence is (-51, 85) and there are 13 outliers in total as shown in the residual plot.

Part B:

Find some data collected over time (with at least 50 values) that would benefit with a smooth. Plot the smooth and describe the basic patterns that you see. Plot the rough and look for general patterns and any unusual values.

Here is a dataset describing Yearly Pitchers Statistics for Major League Baseball in American from 1871 to 2010 with 140 observations on the following 2 variables “Year” (baseball season year) and “N.Pitch”(number of pitchers used).

##      Data$N.Pitch smooth.3RSSH   Rough
## [1,]          541       651.00 -110.00
## [2,]          664       651.00   13.00
## [3,]          651       651.00    0.00
## [4,]          666       647.00   19.00
## [5,]          635       638.25   -3.25
## [6,]          606       627.75  -21.75

• It is not hard to see the general pattern across time, and we may suspect a positive linear relationship between year and number of pitchers– that means the number of pitchers increase as the time goes by.

• The number of the pitchers had an immediate increase in the early years between 1880 to 1890 after the development of the baseball association and the rules change.

• After that it decreased a little at 1890s, which may be resulted from the globe depression.

• The number of pitchers reached a peak at 1915, then dropped off suddenly and there was a valley about 1919. Maybe the World War I affected the environment of baseball.

• There was a sudden drop in 2010 –- at this point, maybe the 2009 financial crisis and “Occupy Wall Street” affected the economic operation of the MLB.

FinalRough <- Data$N.Pitch - smooth.3RS3R.twice
plot(FinalRough~Data$Year,
     pch=19, cex=1, xlab="Year", ylab="ROUGH")
abline(h=0,col="blue",lwd=2)

• There are a number of years (11 to be exact) where the rough is around/above +28 –- for these particular years, the number of pitchers was about 28 more than the general pattern. I suspect that these extreme high rough values correspond to the wars, depressions or other significant events happened at those years.

• There are two years where the rough was about +/-100(looks like year 1884 and year 2010) –- this means that the pitchers at those years were about 100 larger/smaller than the general pattern in the smooth. The reason why the number of pitcher is much larger from the usual in 1884 may be due to the rule change–The National League agree to allow overhead pitching. However, a big decreasing in 2010 may be economy related – maybe it was unusually globe financial crisis in 2010.

• I don’t see any general pattern in the rough when plotted as a function of time. Most of the rough values are between -16 and +28 which translate to pitch values between -16 and +28.

FR<-abs(FinalRough)
plot(FR~Data$Year,
     pch=19, cex=1, xlab="Year", ylab="ROUGH")
abline(h=28.125,col="red")

data<-cbind(as.vector(FR),Data)
aplpack::stem.leaf(FR,depth=FALSE)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 140
##    0* | 0000000000000000000000000000000000000000000000000011111111
##     t | 2223333
##     f | 44444444555555555
##     s | 66667777
##    0. | 8889999
##    1* | 00011111
##     t | 22233333
##     f | 44555
##     s | 7777
##    1. | 899
##    2* | 00
##     t | 2
##     f | 
##     s | 6
##    2. | 
##    3* | 0
## HI: 32 33 34 37 38 39 49 69 97 110
lval(FR)
##   depth lo    hi  mids spreads
## M  70.5  4   4.0  4.00     0.0
## H  35.5  0  11.5  5.75    11.5
## E  18.0  0  18.0  9.00    18.0
## D   9.5  0  32.5 16.25    32.5
## C   5.0  0  39.0 19.50    39.0
## B   3.0  0  69.0 34.50    69.0
## A   2.0  0  97.0 48.50    97.0
## Z   1.0  0 110.0 55.00   110.0
d<-lval_plus(data,FR)
filter(d, OUT == TRUE)
##     as.vector(FR) Year N.Pitch Fence_LO Fence_HI  OUT
## 1             110 2010     541  -16.875   28.125 TRUE
## 17             37 1994     470  -16.875   28.125 TRUE
## 18             32 1993     507  -16.875   28.125 TRUE
## 19             34 1992     441  -16.875   28.125 TRUE
## 42             30 1969     360  -16.875   28.125 TRUE
## 43             33 1968     287  -16.875   28.125 TRUE
## 65             39 1946     275  -16.875   28.125 TRUE
## 96             38 1915     285  -16.875   28.125 TRUE
## 111            49 1900      70  -16.875   28.125 TRUE
## 121            69 1890     186  -16.875   28.125 TRUE
## 127            97 1884     205  -16.875   28.125 TRUE

As we can see from the stemplot, the distribution of the rough seems right skewed. The Median, Fourths, Eighths,Extremes are 4, 0 and 11, 0 and 18, 0(lowest) and 110(highest). The fence is (-16.875, 28.125) and there are 11 outliers in total as shown in the residual plot.