Assignment 6: Smoothing

My data choice: church.tseries

For the purpose of this assignment, I will be assigning numbers to the attendance numbers, representing the week number that the corresponding data point was collected. The first week that the attendance was counted is week 1, and the last data point is week 209.

head(church.tseries)

##   year month worship
## 1 1993     1     336
## 2 1993     1     264
## 3 1993     1     382
## 4 1993     1     325
## 5 1993     1     346
## 6 1993     2     464

church.data<-data.frame(
  Week = 1:209,
  Attendance = church.tseries$worship
)
head(church.data)

##   Week Attendance
## 1    1        336
## 2    2        264
## 3    3        382
## 4    4        325
## 5    5        346
## 6    6        464

ggplot(church.data, aes(Week, Attendance)) +
  geom_point() +
  labs(title = "Weekly Worship Attendance", subtitle="from January 1993 to December 1996")

1. Using R, perform a resistant smooth (3RSSH, twice) to your data. Save the smooth (the fit) and the rough (the residuals).

church.data <- mutate(church.data, smooth.3RS3R.twice = as.vector(smooth(Attendance, kind="3RS3R", twiceit=TRUE)))

church.data <- mutate(church.data, 
          FinalRough = Attendance - smooth.3RS3R.twice)

head(church.data)

##   Week Attendance smooth.3RS3R.twice FinalRough
## 1    1        336                336          0
## 2    2        264                336        -72
## 3    3        382                336         46
## 4    4        325                346        -21
## 5    5        346                346          0
## 6    6        464                346        118

2. Plot the smooth (using a smooth curve) and describe the general patterns that you see. Don’t assume that anything is obvious – pretend that you are explaining this to someone who doesn’t have any background in statistics.

ggplot(church.data, aes(Week, Attendance)) +
  geom_point() +
  geom_line(aes(Week, smooth.3RS3R.twice), color="red")

Based on the smooth curve, the worship attendance numbers seem to increase and decrease in cycles. This makes me believe there may be certain times of the year where attendance increases or decreases. Despite the peaks and valleys, the attendance overall seems to be trending slightly upward.

3. Plot the rough (as a time series plotting individual points). Do you see any general patterns in the rough?

ggplot(church.data, aes(Week, FinalRough)) +
  geom_point() +
  geom_hline(yintercept=0, color="blue")

Similar to the graph of the smooth curve, the rough values seem to be somewhat oscillating between negative and positive values. There is a group of three extreme low rough values just after week 100, which may indicate some major event that would make attendance decrease for a few weeks.

4. Construct a stemplot and letter value display of the sizes of the rough. (The size is the absolute value of the rough.) Set up fences and look for outliers. Summarize what you have learned. (What is a typical size of a residual? Are there any unusually large residuals?)

church.data <- mutate(church.data, RoughSize = abs(FinalRough))
stem.leaf(church.data$RoughSize, m=2)

## 1 | 2: represents 12
##  leaf unit: 1
##             n: 209
##    89    0* | 00000000000000000000000000000000000000000000000000000000000000000000000000001111222223334
##   (18)   0. | 566667777888888999
##   102    1* | 001112223344444
##    87    1. | 55566677789
##    76    2* | 000133
##    70    2. | 555566679
##    61    3* | 01133333444
##    50    3. | 5799
##    46    4* | 012234444
##    37    4. | 5667789
##    30    5* | 14
##    28    5. | 56678
##    23    6* | 11113
##    18    6. | 66
##    16    7* | 24
##          7. | 
##    14    8* | 0
## HI: 87 94 97 104 115 117 118 126 131 137 160 164 211

fivenum(church.data$RoughSize)

## [1]   0   0   9  34 211

lval(church.data$RoughSize)

##   depth lo    hi   mids spreads
## M 105.0  9   9.0   9.00     0.0
## H  53.0  0  34.0  17.00    34.0
## E  27.0  0  56.0  28.00    56.0
## D  14.0  0  80.0  40.00    80.0
## C   7.5  0 117.5  58.75   117.5
## B   4.0  0 137.0  68.50   137.0
## A   2.5  0 162.0  81.00   162.0
## Z   1.0  0 211.0 105.50   211.0

Using the five-number summary, we see the rough sizes have \(F_L=0\) and \(F_U=34\). Using these values, we can the fences for outliers: \[dF=34; STEP=1.5(34)=51\] Because \(F_L=0\) and all rough sizes are non-negative, we only need to calculate upper fence values: \[fence_{upper}=34+51=85\\FENCE_{upper}=34+2(51)=136\] We have mild outliers of 87 (week 111), 94 (week 74), 97 (week 66), 104 (week 171), 115 (week 15), 117 (week 169), 118 (week 6), 126 (week 103), and 131 (week 52). We have extreme outliers of 137 (week 120), 160 (week 105), 164 (week 104), 211 (week 108).