library(LearnEDAfunctions)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  1. Find a two-way table that you are interested in with at least 4 rows and 4 columns. Analyze the table using both additive AND multiplicative fits. Plot the additive fit. Explain your additive and multiplicative fits (what do the common, row effects, and column effects mean). Is an additive or multiplicative fit more suitable for your data? Explain.

I chose this table showing the number of people employed in a few counties in Ohio during January, March, May, and July of 2023

Employment <- read.csv("C:/Users/eclai/Downloads/Employment.csv")
Employment
##            County January March   May  July
## 1    Allen County   44300 44400 45000 44700
## 2 Crawford County   16500 16700 17300 17200
## 3 Defiance County   16300 16300 16800 16800
## 4   Fulton County   20600 20800 21200 21000
## 5  Hancock County   38000 38800 39600 39000

we will now create an additive fit for this table and plot it

empl <- Employment[, -1]
dimnames(empl)[[1]] <- Employment[, 1]
additive.fit <- medpolish(empl)
## 1: 2050
## 2: 1925
## Final: 1912.5
additive.fit
## 
## Median Polish Results (Dataset: "empl")
## 
## Overall: 20850
## 
## Row Effects:
##    Allen County Crawford County Defiance County   Fulton County  Hancock County 
##         23700.0         -3987.5         -4350.0             0.0         18050.0 
## 
## Column Effects:
## January   March     May    July 
##  -250.0  -150.0   437.5   150.0 
## 
## Residuals:
##                 January March    May  July
## Allen County        0.0   0.0   12.5   0.0
## Crawford County  -112.5 -12.5    0.0 187.5
## Defiance County    50.0 -50.0 -137.5 150.0
## Fulton County       0.0 100.0  -87.5   0.0
## Hancock County   -650.0  50.0  262.5 -50.0
Row.Part <- with(additive.fit, row + overall)
Col.Part <- additive.fit$col
plot2way(Row.Part, Col.Part,
dimnames(empl)[[1]], dimnames(empl)[[2]])

Our fit has the form COMMON + ROWEFFECT + COLEFFECT + RESIDUAL

                 January     March   May    July  REFF

Allen County 23700.0
Crawford County -3987.5 Defiance County -4350.0
Fulton County 0.0
Hancock County 18050.0 CEFF -250.0 -150.0 437.5 150.0

If we wish to focus on the row effects (counties), we can combine the common and row effects to get the fit FIT = [COMMON + ROWEFFECT] + COLEFFECT = ROWFIT + COLEFFECT

                January  March   May    July  REFF

Allen County 44550
Crawford County 16862.5 Defiance County 16500
Fulton County 20850
Hancock County 38900 CEFF -150.0 437.5 150.0 150.0

Looking at this table, specifically the row fits, we see that • the average number of people employed in Allen County is temperature is 23700 • generally, Allen County has 44550 - 16862.5 = 27687.5 more people employed than Crawford County • Hancock County tends to have 18050 more people employed than Fulton County

if we were interested in the month effects (columns), we can combine the common and column effects to get the representation FIT = [COMMON + COLEFFECT] + ROWEFFECT = COLFIT + ROWEFFECT

                January  March   May    July  REFF

Allen County 23700.0
Crawford County -3987.5 Defiance County -4350.0
Fulton County 0.0
Hancock County 18050.0 CEFF 20600 20700 21287.5 21000

We see • the Employment in March is on average 20700 people. • There tend to be 21287.5 – 20700 = 587.5 people more employed in May than March • January and March have similar employment on average – March has 100 people more employed

we will now create a multiplicative fit for this table and plot the column and row effects (for log empl)

log.empl <- log10(empl)
log.empl$residuals
## NULL
additive.fit <- medpolish(log.empl)
## 1: 0.03997254
## Final: 0.03997254
options(width=60)
additive.fit
## 
## Median Polish Results (Dataset: "log.empl")
## 
## Overall: 4.320141
## 
## Row Effects:
##    Allen County Crawford County Defiance County 
##      0.32870393     -0.09200146     -0.10139287 
##   Fulton County  Hancock County 
##      0.00000000      0.26980685 
## 
## Column Effects:
##      January        March          May         July 
## -0.006560839 -0.002077980  0.006560839  0.002077980 
## 
## Residuals:
##                     January       March         May
## Allen County     0.00411932  0.00061570 -0.00219357
## Crawford County -0.00409507 -0.00334541  0.00334541
## Defiance County  0.00000000 -0.00448286  0.00000000
## Fulton County    0.00028674  0.00000000 -0.00036629
## Hancock County  -0.00360373  0.00096154  0.00118618
##                        July
## Allen County    -0.00061570
## Crawford County  0.00531061
## Defiance County  0.00448286
## Fulton County    0.00000000
## Hancock County  -0.00096154
COMMON <- 10 ^ additive.fit$overall
ROW <- 10 ^ additive.fit$row
COL <- 10 ^ additive.fit$col
RESIDUAL <- 10 ^ additive.fit$residual
COMMON
## [1] 20899.76
ROW
##    Allen County Crawford County Defiance County 
##       2.1315913       0.8090932       0.7917847 
##   Fulton County  Hancock County 
##       1.0000000       1.8612592
COL
##   January     March       May      July 
## 0.9850066 0.9952267 1.0152216 1.0047962
RESIDUAL
##                   January     March       May      July
## Allen County    1.0095302 1.0014187 0.9949618 0.9985833
## Crawford County 0.9906151 0.9923265 1.0077328 1.0123032
## Defiance County 1.0000000 0.9897309 1.0000000 1.0103756
## Fulton County   1.0006605 1.0000000 0.9991569 1.0000000
## Hancock County  0.9917364 1.0022165 1.0027350 0.9977884
County <- c("Allen County", "Crawford County", "Defiance County", "Fulton County", "Hancock County")
Month <- c("January", "March", "May", "July")
ggplot(data.frame(County, Row_Effect = additive.fit$row),
aes(County, Row_Effect)) + geom_point()

ggplot(data.frame(Month,
Col_Effect = additive.fit$col),
aes(Month, Col_Effect)) +
geom_point()

Our fit:

                January      March       May        July      REFF

Allen County 2.1315913
Crawford County -0.8090932
Defiance County -0.7917847
Fulton County 1.0000000
Hancock County 1.8612592
CEFF 0.9850066 0.9952267 1.0152216 1.0047962 20899.76

If we wish to get fitted times for each year, we multiply the row effects (reff) by the common value to get the following table. We use the abbreviation RFITS to stand for the row fits.

                January      March       May        July      RFITS

Allen County 6869.833
Crawford County -1922.808
Defiance County -2119.087
Fulton County 0
Hancock County 5638.898
CEFF 0.9850066 0.9952267 1.0152216 1.0047962

We observe that Allen County has about 21% more number of people employed than Hancock County on average (6869.833/5638.898 = 1.218294)

After looking at the two fits, we can tell that our additive fit is better suited for our Employment data. We can say that the difference between the employment in the first two columns is roughly constant across rows. Likewise, the difference between any two columns is roughly a constant. So, the relationship between the columns of the table is additive, rather than multiplicative for each row.

  1. Fit a multiplicative fit to ONE of these two datasets:

olympics.speed.skating olympics.run.txt

These datasets give the Olympics winning time in the men’s speed skating and men’s running races as a function of the year and length of race.

olympics.speed.skating
##   YEAR X500m X1000m X1500m X5000m X10000m
## 1 1976 39.17  79.32 119.38 444.48  890.59
## 2 1980 38.03  75.18 115.44 422.29  868.13
## 3 1984 38.19  75.80 118.36 432.28  879.90
## 4 1988 36.45  73.03 112.06 404.63  828.20
## 5 1992 37.14  74.85 114.81 419.97  852.12
## 6 1994 36.33  72.43 111.29 394.96  810.55
## 7 1998 35.59  70.64 107.87 382.20  795.33
## 8 2002 34.42  67.18 103.95 374.66  778.92
## 9 2006 34.82  68.89 105.97 374.68  781.57
olympics.run
##   year X100m X200m X400m  X800m X1500m
## 1 1972 10.14 20.00 44.66 105.90 216.30
## 2 1976 10.06 20.23 44.26 103.50 219.17
## 3 1980 10.25 20.19 44.60 105.40 218.40
## 4 1984  9.99 19.80 44.27 103.00 212.53
## 5 1988  9.92 19.75 43.87 103.45 215.96
## 6 1992  9.96 20.01 43.50 103.66 220.12
## 7 1996  9.84 19.32 43.49 102.58 215.78
## 8 2000  9.87 20.09 43.84 105.08 212.07
## 9 2004  9.85 19.79 44.00 104.45 214.18

I will choose the data set olympics.run to fit a multiplicative fit to.

times <- olympics.run[, -1]
row.names(times) <- olympics.run[, 1]
times
##      X100m X200m X400m  X800m X1500m
## 1972 10.14 20.00 44.66 105.90 216.30
## 1976 10.06 20.23 44.26 103.50 219.17
## 1980 10.25 20.19 44.60 105.40 218.40
## 1984  9.99 19.80 44.27 103.00 212.53
## 1988  9.92 19.75 43.87 103.45 215.96
## 1992  9.96 20.01 43.50 103.66 220.12
## 1996  9.84 19.32 43.49 102.58 215.78
## 2000  9.87 20.09 43.84 105.08 212.07
## 2004  9.85 19.79 44.00 104.45 214.18
log.times <- log10(times)
log.times
##          X100m    X200m    X400m    X800m   X1500m
## 1972 1.0060380 1.301030 1.649919 2.024896 2.335057
## 1976 1.0025980 1.305996 1.646011 2.014940 2.340781
## 1980 1.0107239 1.305136 1.649335 2.022841 2.339253
## 1984 0.9995655 1.296665 1.646110 2.012837 2.327420
## 1988 0.9965117 1.295567 1.642168 2.014730 2.334373
## 1992 0.9982593 1.301247 1.638489 2.015611 2.342660
## 1996 0.9929951 1.286007 1.638389 2.011063 2.334011
## 2000 0.9943172 1.302980 1.641871 2.021520 2.326479
## 2004 0.9934362 1.296446 1.643453 2.018908 2.330779
additive.fit <- medpolish(log.times)
## 1: 0.1357058
## 2: 0.114946
## 3: 0.1132811
## Final: 0.1132811
options(width=60)
additive.fit
## 
## Median Polish Results (Dataset: "log.times")
## 
## Overall: 1.643266
## 
## Row Effects:
##          1972          1976          1980          1984 
##  0.0066529940  0.0047265652  0.0069015991  0.0000000000 
##          1988          1992          1996          2000 
## -0.0010980903  0.0003879230 -0.0048763170 -0.0013951792 
##          2004 
## -0.0002193961 
## 
## Column Effects:
##      X100m      X200m      X400m      X800m     X1500m 
## -0.6453943 -0.3466005  0.0000000  0.3726733  0.6890853 
## 
## Residuals:
##            X100m      X200m       X400m       X800m
## 1972  0.00151355 -0.0022882  0.00000000  0.00230395
## 1976  0.00000000  0.0046041 -0.00198088 -0.00572523
## 1980  0.00595085  0.0015695 -0.00083247  0.00000000
## 1984  0.00169407  0.0000000  0.00284380 -0.00310179
## 1988 -0.00026165  0.0000000  0.00000000 -0.00011043
## 1992  0.00000000  0.0041940 -0.00516439 -0.00071573
## 1996  0.00000000 -0.0057818  0.00000000  0.00000000
## 2000 -0.00215908  0.0077099  0.00000000  0.00697623
## 2004 -0.00421579  0.0000000  0.00040635  0.00318883
##          X1500m
## 1972 -0.0039475
## 1976  0.0037035
## 1980  0.0000000
## 1984 -0.0049308
## 1988  0.0031204
## 1992  0.0099205
## 1996  0.0065365
## 2000 -0.0044766
## 2004 -0.0013527
COMMON <- 10 ^ additive.fit$overall
ROW <- 10 ^ additive.fit$row
COL <- 10 ^ additive.fit$col
RESIDUAL <- 10 ^ additive.fit$residual
COMMON
## [1] 43.98106
ROW
##      1972      1976      1980      1984      1988      1992 
## 1.0154370 1.0109428 1.0160185 1.0000000 0.9974747 1.0008936 
##      1996      2000      2004 
## 0.9888347 0.9967926 0.9994949
COL
##     X100m     X200m     X400m     X800m    X1500m 
## 0.2262589 0.4501938 1.0000000 2.3587032 4.8874836
RESIDUAL
##          X100m     X200m     X400m     X800m    X1500m
## 1972 1.0034911 0.9947451 1.0000000 1.0053191 0.9909517
## 1976 1.0000000 1.0106578 0.9954492 0.9869037 1.0085641
## 1980 1.0137966 1.0036205 0.9980850 1.0000000 1.0000000
## 1984 1.0039084 1.0000000 1.0065696 0.9928833 0.9887106
## 1988 0.9993977 1.0000000 1.0000000 0.9997458 1.0072108
## 1992 1.0000000 1.0097038 0.9881790 0.9983533 1.0231058
## 1996 1.0000000 0.9867753 1.0000000 1.0000000 1.0151646
## 2000 0.9950409 1.0179113 1.0000000 1.0161931 0.9897451
## 2004 0.9903398 1.0000000 1.0009361 1.0073696 0.9968901

We have fitted a multiplicative fit to our data set and can observe the common, row effects, column effects, and residuals.

  1. For the following data, analyze using an extended fit. Interpret the fit and the residuals. Response: average (median) attendance of worship at Norcrest church,Findlay,OH classified by month and year. Dataset: church.2way
church.2way
##    month y1993 y1994 y1995 y1996
## 1    Jan   336   335   394   381
## 2    Feb   295   369   358   385
## 3    Mar   308   395   364   418
## 4    Apr   321   420   388   413
## 5    May   323   399   379   393
## 6   June   305   355   343   375
## 7   July   265   310   309   338
## 8    Aug   301   301   344   335
## 9   Sept   288   338   399   429
## 10   Oct   345   349   402   399
## 11   Nov   405   376   398   424
## 12   Dec   378   386   401   381
attendance <- church.2way[, -1]
row.names(attendance) <- church.2way[, 1]
attendance
##      y1993 y1994 y1995 y1996
## Jan    336   335   394   381
## Feb    295   369   358   385
## Mar    308   395   364   418
## Apr    321   420   388   413
## May    323   399   379   393
## June   305   355   343   375
## July   265   310   309   338
## Aug    301   301   344   335
## Sept   288   338   399   429
## Oct    345   349   402   399
## Nov    405   376   398   424
## Dec    378   386   401   381
(additive.fit <- medpolish(attendance))
## 1: 786
## 2: 739.25
## Final: 735
## 
## Median Polish Results (Dataset: "attendance")
## 
## Overall: 370.5156
## 
## Row Effects:
##        Jan        Feb        Mar        Apr        May 
##  -2.203125 -12.484375   7.015625  16.515625   2.015625 
##       June       July        Aug       Sept        Oct 
## -20.703125 -59.734375 -44.484375  -2.015625  11.296875 
##        Nov        Dec 
##  27.015625  22.984375 
## 
## Column Effects:
##      y1993      y1994      y1995      y1996 
## -45.296875  -4.140625   3.718750  25.578125 
## 
## Residuals:
##          y1993    y1994    y1995     y1996
## Jan   12.98438 -29.1719  21.9688 -12.89062
## Feb  -17.73438  15.1094  -3.7500   1.39062
## Mar  -24.23438  21.6094 -17.2500  14.89062
## Apr  -20.73438  37.1094  -2.7500   0.39062
## May   -4.23438  30.6094   2.7500  -5.10938
## June   0.48438   9.3281 -10.5312  -0.39062
## July  -0.48438   3.3594  -5.5000   1.64062
## Aug   20.26562 -20.8906  14.2500 -16.60938
## Sept -35.20312 -26.3594  26.7812  34.92188
## Oct    8.48438 -28.6719  16.4688  -8.39062
## Nov   52.76562 -17.3906  -3.2500   0.89062
## Dec   29.79688  -3.3594   3.7812 -38.07812
attendance <- attendance[order(additive.fit$row), ]
additive.fit <- medpolish(attendance)
## 1: 786
## 2: 739.25
## Final: 735
additive.fit$residual
##           y1993      y1994     y1995      y1996
## July  -0.484375   3.359375  -5.50000   1.640625
## Aug   20.265625 -20.890625  14.25000 -16.609375
## June   0.484375   9.328125 -10.53125  -0.390625
## Feb  -17.734375  15.109375  -3.75000   1.390625
## Jan   12.984375 -29.171875  21.96875 -12.890625
## Sept -35.203125 -26.359375  26.78125  34.921875
## May   -4.234375  30.609375   2.75000  -5.109375
## Mar  -24.234375  21.609375 -17.25000  14.890625
## Oct    8.484375 -28.671875  16.46875  -8.390625
## Apr  -20.734375  37.109375  -2.75000   0.390625
## Dec   29.796875  -3.359375   3.78125 -38.078125
## Nov   52.765625 -17.390625  -3.25000   0.890625

We have fitted an additive fit to our data set church.2way.

Looking at the row and column effects, we see • November is the month with the most church attendance. Comparing effects, November has 27.015625 - 22.984375 which is about 4 more people attend the worship at the Norcrest church,Findlay than December. July and August are the months with the lowest attendance. • In 1996, on the average, there were about 21 more people attending the worship at the Norcrest Church in Findlay than in 1995. Overall, we can see that the attendance has increased over the years. • looking at the residuals, we do see a pattern

We will now add a single term to our additive model to account for the pattern that we see in the residuals. This new model, called an extended model, has the form FIT = COMMON + ROWEFF + COLEFF + kCV where CV is the comparison value and k is a constant chosen that the new model is a good fit to the data

cv <- with(additive.fit,
outer(row, col, "*") / overall)
df <- data.frame(Comparison_Value = as.vector(cv),
Residual = as.vector(additive.fit$residuals))
ggplot(df, aes(Comparison_Value, Residual)) + geom_point()

rline(Residual ~ Comparison_Value, df)$b
## [1] 1.460397
ggplot(df, aes(Comparison_Value,
Residual - 1.460397 * Comparison_Value)) +
geom_point() +
geom_hline(yintercept = 0, color="red")

We check the suitability of our line fit by looking at the residuals and seeing if we have removed the trend from the scatterplot. It seems there was a slight improvement. The slope of this line, 1.460397, is our estimate at the coefficient k. So our improved fit to these data has the form FIT = COMMON + ROWEFF + COLEFF + 1.460397CV