library(LearnEDAfunctions)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
I chose this table showing the number of people employed in a few counties in Ohio during January, March, May, and July of 2023
Employment <- read.csv("C:/Users/eclai/Downloads/Employment.csv")
Employment
## County January March May July
## 1 Allen County 44300 44400 45000 44700
## 2 Crawford County 16500 16700 17300 17200
## 3 Defiance County 16300 16300 16800 16800
## 4 Fulton County 20600 20800 21200 21000
## 5 Hancock County 38000 38800 39600 39000
we will now create an additive fit for this table and plot it
empl <- Employment[, -1]
dimnames(empl)[[1]] <- Employment[, 1]
additive.fit <- medpolish(empl)
## 1: 2050
## 2: 1925
## Final: 1912.5
additive.fit
##
## Median Polish Results (Dataset: "empl")
##
## Overall: 20850
##
## Row Effects:
## Allen County Crawford County Defiance County Fulton County Hancock County
## 23700.0 -3987.5 -4350.0 0.0 18050.0
##
## Column Effects:
## January March May July
## -250.0 -150.0 437.5 150.0
##
## Residuals:
## January March May July
## Allen County 0.0 0.0 12.5 0.0
## Crawford County -112.5 -12.5 0.0 187.5
## Defiance County 50.0 -50.0 -137.5 150.0
## Fulton County 0.0 100.0 -87.5 0.0
## Hancock County -650.0 50.0 262.5 -50.0
Row.Part <- with(additive.fit, row + overall)
Col.Part <- additive.fit$col
plot2way(Row.Part, Col.Part,
dimnames(empl)[[1]], dimnames(empl)[[2]])
Our fit has the form COMMON + ROWEFFECT + COLEFFECT + RESIDUAL
January March May July REFF
Allen County 23700.0
Crawford County -3987.5 Defiance County -4350.0
Fulton County 0.0
Hancock County 18050.0 CEFF -250.0 -150.0 437.5 150.0
If we wish to focus on the row effects (counties), we can combine the common and row effects to get the fit FIT = [COMMON + ROWEFFECT] + COLEFFECT = ROWFIT + COLEFFECT
January March May July REFF
Allen County 44550
Crawford County 16862.5 Defiance County 16500
Fulton County 20850
Hancock County 38900 CEFF -150.0 437.5 150.0 150.0
Looking at this table, specifically the row fits, we see that • the average number of people employed in Allen County is temperature is 23700 • generally, Allen County has 44550 - 16862.5 = 27687.5 more people employed than Crawford County • Hancock County tends to have 18050 more people employed than Fulton County
if we were interested in the month effects (columns), we can combine the common and column effects to get the representation FIT = [COMMON + COLEFFECT] + ROWEFFECT = COLFIT + ROWEFFECT
January March May July REFF
Allen County 23700.0
Crawford County -3987.5 Defiance County -4350.0
Fulton County 0.0
Hancock County 18050.0 CEFF 20600 20700 21287.5 21000
We see • the Employment in March is on average 20700 people. • There tend to be 21287.5 – 20700 = 587.5 people more employed in May than March • January and March have similar employment on average – March has 100 people more employed
we will now create a multiplicative fit for this table and plot the column and row effects (for log empl)
log.empl <- log10(empl)
log.empl$residuals
## NULL
additive.fit <- medpolish(log.empl)
## 1: 0.03997254
## Final: 0.03997254
options(width=60)
additive.fit
##
## Median Polish Results (Dataset: "log.empl")
##
## Overall: 4.320141
##
## Row Effects:
## Allen County Crawford County Defiance County
## 0.32870393 -0.09200146 -0.10139287
## Fulton County Hancock County
## 0.00000000 0.26980685
##
## Column Effects:
## January March May July
## -0.006560839 -0.002077980 0.006560839 0.002077980
##
## Residuals:
## January March May
## Allen County 0.00411932 0.00061570 -0.00219357
## Crawford County -0.00409507 -0.00334541 0.00334541
## Defiance County 0.00000000 -0.00448286 0.00000000
## Fulton County 0.00028674 0.00000000 -0.00036629
## Hancock County -0.00360373 0.00096154 0.00118618
## July
## Allen County -0.00061570
## Crawford County 0.00531061
## Defiance County 0.00448286
## Fulton County 0.00000000
## Hancock County -0.00096154
COMMON <- 10 ^ additive.fit$overall
ROW <- 10 ^ additive.fit$row
COL <- 10 ^ additive.fit$col
RESIDUAL <- 10 ^ additive.fit$residual
COMMON
## [1] 20899.76
ROW
## Allen County Crawford County Defiance County
## 2.1315913 0.8090932 0.7917847
## Fulton County Hancock County
## 1.0000000 1.8612592
COL
## January March May July
## 0.9850066 0.9952267 1.0152216 1.0047962
RESIDUAL
## January March May July
## Allen County 1.0095302 1.0014187 0.9949618 0.9985833
## Crawford County 0.9906151 0.9923265 1.0077328 1.0123032
## Defiance County 1.0000000 0.9897309 1.0000000 1.0103756
## Fulton County 1.0006605 1.0000000 0.9991569 1.0000000
## Hancock County 0.9917364 1.0022165 1.0027350 0.9977884
County <- c("Allen County", "Crawford County", "Defiance County", "Fulton County", "Hancock County")
Month <- c("January", "March", "May", "July")
ggplot(data.frame(County, Row_Effect = additive.fit$row),
aes(County, Row_Effect)) + geom_point()
ggplot(data.frame(Month,
Col_Effect = additive.fit$col),
aes(Month, Col_Effect)) +
geom_point()
Our fit:
January March May July REFF
Allen County 2.1315913
Crawford County -0.8090932
Defiance County -0.7917847
Fulton County 1.0000000
Hancock County 1.8612592
CEFF 0.9850066 0.9952267 1.0152216 1.0047962 20899.76
If we wish to get fitted times for each year, we multiply the row effects (reff) by the common value to get the following table. We use the abbreviation RFITS to stand for the row fits.
January March May July RFITS
Allen County 6869.833
Crawford County -1922.808
Defiance County -2119.087
Fulton County 0
Hancock County 5638.898
CEFF 0.9850066 0.9952267 1.0152216 1.0047962
We observe that Allen County has about 21% more number of people employed than Hancock County on average (6869.833/5638.898 = 1.218294)
After looking at the two fits, we can tell that our additive fit is better suited for our Employment data. We can say that the difference between the employment in the first two columns is roughly constant across rows. Likewise, the difference between any two columns is roughly a constant. So, the relationship between the columns of the table is additive, rather than multiplicative for each row.
olympics.speed.skating olympics.run.txt
These datasets give the Olympics winning time in the men’s speed skating and men’s running races as a function of the year and length of race.
olympics.speed.skating
## YEAR X500m X1000m X1500m X5000m X10000m
## 1 1976 39.17 79.32 119.38 444.48 890.59
## 2 1980 38.03 75.18 115.44 422.29 868.13
## 3 1984 38.19 75.80 118.36 432.28 879.90
## 4 1988 36.45 73.03 112.06 404.63 828.20
## 5 1992 37.14 74.85 114.81 419.97 852.12
## 6 1994 36.33 72.43 111.29 394.96 810.55
## 7 1998 35.59 70.64 107.87 382.20 795.33
## 8 2002 34.42 67.18 103.95 374.66 778.92
## 9 2006 34.82 68.89 105.97 374.68 781.57
olympics.run
## year X100m X200m X400m X800m X1500m
## 1 1972 10.14 20.00 44.66 105.90 216.30
## 2 1976 10.06 20.23 44.26 103.50 219.17
## 3 1980 10.25 20.19 44.60 105.40 218.40
## 4 1984 9.99 19.80 44.27 103.00 212.53
## 5 1988 9.92 19.75 43.87 103.45 215.96
## 6 1992 9.96 20.01 43.50 103.66 220.12
## 7 1996 9.84 19.32 43.49 102.58 215.78
## 8 2000 9.87 20.09 43.84 105.08 212.07
## 9 2004 9.85 19.79 44.00 104.45 214.18
I will choose the data set olympics.run to fit a multiplicative fit to.
times <- olympics.run[, -1]
row.names(times) <- olympics.run[, 1]
times
## X100m X200m X400m X800m X1500m
## 1972 10.14 20.00 44.66 105.90 216.30
## 1976 10.06 20.23 44.26 103.50 219.17
## 1980 10.25 20.19 44.60 105.40 218.40
## 1984 9.99 19.80 44.27 103.00 212.53
## 1988 9.92 19.75 43.87 103.45 215.96
## 1992 9.96 20.01 43.50 103.66 220.12
## 1996 9.84 19.32 43.49 102.58 215.78
## 2000 9.87 20.09 43.84 105.08 212.07
## 2004 9.85 19.79 44.00 104.45 214.18
log.times <- log10(times)
log.times
## X100m X200m X400m X800m X1500m
## 1972 1.0060380 1.301030 1.649919 2.024896 2.335057
## 1976 1.0025980 1.305996 1.646011 2.014940 2.340781
## 1980 1.0107239 1.305136 1.649335 2.022841 2.339253
## 1984 0.9995655 1.296665 1.646110 2.012837 2.327420
## 1988 0.9965117 1.295567 1.642168 2.014730 2.334373
## 1992 0.9982593 1.301247 1.638489 2.015611 2.342660
## 1996 0.9929951 1.286007 1.638389 2.011063 2.334011
## 2000 0.9943172 1.302980 1.641871 2.021520 2.326479
## 2004 0.9934362 1.296446 1.643453 2.018908 2.330779
additive.fit <- medpolish(log.times)
## 1: 0.1357058
## 2: 0.114946
## 3: 0.1132811
## Final: 0.1132811
options(width=60)
additive.fit
##
## Median Polish Results (Dataset: "log.times")
##
## Overall: 1.643266
##
## Row Effects:
## 1972 1976 1980 1984
## 0.0066529940 0.0047265652 0.0069015991 0.0000000000
## 1988 1992 1996 2000
## -0.0010980903 0.0003879230 -0.0048763170 -0.0013951792
## 2004
## -0.0002193961
##
## Column Effects:
## X100m X200m X400m X800m X1500m
## -0.6453943 -0.3466005 0.0000000 0.3726733 0.6890853
##
## Residuals:
## X100m X200m X400m X800m
## 1972 0.00151355 -0.0022882 0.00000000 0.00230395
## 1976 0.00000000 0.0046041 -0.00198088 -0.00572523
## 1980 0.00595085 0.0015695 -0.00083247 0.00000000
## 1984 0.00169407 0.0000000 0.00284380 -0.00310179
## 1988 -0.00026165 0.0000000 0.00000000 -0.00011043
## 1992 0.00000000 0.0041940 -0.00516439 -0.00071573
## 1996 0.00000000 -0.0057818 0.00000000 0.00000000
## 2000 -0.00215908 0.0077099 0.00000000 0.00697623
## 2004 -0.00421579 0.0000000 0.00040635 0.00318883
## X1500m
## 1972 -0.0039475
## 1976 0.0037035
## 1980 0.0000000
## 1984 -0.0049308
## 1988 0.0031204
## 1992 0.0099205
## 1996 0.0065365
## 2000 -0.0044766
## 2004 -0.0013527
COMMON <- 10 ^ additive.fit$overall
ROW <- 10 ^ additive.fit$row
COL <- 10 ^ additive.fit$col
RESIDUAL <- 10 ^ additive.fit$residual
COMMON
## [1] 43.98106
ROW
## 1972 1976 1980 1984 1988 1992
## 1.0154370 1.0109428 1.0160185 1.0000000 0.9974747 1.0008936
## 1996 2000 2004
## 0.9888347 0.9967926 0.9994949
COL
## X100m X200m X400m X800m X1500m
## 0.2262589 0.4501938 1.0000000 2.3587032 4.8874836
RESIDUAL
## X100m X200m X400m X800m X1500m
## 1972 1.0034911 0.9947451 1.0000000 1.0053191 0.9909517
## 1976 1.0000000 1.0106578 0.9954492 0.9869037 1.0085641
## 1980 1.0137966 1.0036205 0.9980850 1.0000000 1.0000000
## 1984 1.0039084 1.0000000 1.0065696 0.9928833 0.9887106
## 1988 0.9993977 1.0000000 1.0000000 0.9997458 1.0072108
## 1992 1.0000000 1.0097038 0.9881790 0.9983533 1.0231058
## 1996 1.0000000 0.9867753 1.0000000 1.0000000 1.0151646
## 2000 0.9950409 1.0179113 1.0000000 1.0161931 0.9897451
## 2004 0.9903398 1.0000000 1.0009361 1.0073696 0.9968901
We have fitted a multiplicative fit to our data set and can observe the common, row effects, column effects, and residuals.
church.2way
## month y1993 y1994 y1995 y1996
## 1 Jan 336 335 394 381
## 2 Feb 295 369 358 385
## 3 Mar 308 395 364 418
## 4 Apr 321 420 388 413
## 5 May 323 399 379 393
## 6 June 305 355 343 375
## 7 July 265 310 309 338
## 8 Aug 301 301 344 335
## 9 Sept 288 338 399 429
## 10 Oct 345 349 402 399
## 11 Nov 405 376 398 424
## 12 Dec 378 386 401 381
attendance <- church.2way[, -1]
row.names(attendance) <- church.2way[, 1]
attendance
## y1993 y1994 y1995 y1996
## Jan 336 335 394 381
## Feb 295 369 358 385
## Mar 308 395 364 418
## Apr 321 420 388 413
## May 323 399 379 393
## June 305 355 343 375
## July 265 310 309 338
## Aug 301 301 344 335
## Sept 288 338 399 429
## Oct 345 349 402 399
## Nov 405 376 398 424
## Dec 378 386 401 381
(additive.fit <- medpolish(attendance))
## 1: 786
## 2: 739.25
## Final: 735
##
## Median Polish Results (Dataset: "attendance")
##
## Overall: 370.5156
##
## Row Effects:
## Jan Feb Mar Apr May
## -2.203125 -12.484375 7.015625 16.515625 2.015625
## June July Aug Sept Oct
## -20.703125 -59.734375 -44.484375 -2.015625 11.296875
## Nov Dec
## 27.015625 22.984375
##
## Column Effects:
## y1993 y1994 y1995 y1996
## -45.296875 -4.140625 3.718750 25.578125
##
## Residuals:
## y1993 y1994 y1995 y1996
## Jan 12.98438 -29.1719 21.9688 -12.89062
## Feb -17.73438 15.1094 -3.7500 1.39062
## Mar -24.23438 21.6094 -17.2500 14.89062
## Apr -20.73438 37.1094 -2.7500 0.39062
## May -4.23438 30.6094 2.7500 -5.10938
## June 0.48438 9.3281 -10.5312 -0.39062
## July -0.48438 3.3594 -5.5000 1.64062
## Aug 20.26562 -20.8906 14.2500 -16.60938
## Sept -35.20312 -26.3594 26.7812 34.92188
## Oct 8.48438 -28.6719 16.4688 -8.39062
## Nov 52.76562 -17.3906 -3.2500 0.89062
## Dec 29.79688 -3.3594 3.7812 -38.07812
attendance <- attendance[order(additive.fit$row), ]
additive.fit <- medpolish(attendance)
## 1: 786
## 2: 739.25
## Final: 735
additive.fit$residual
## y1993 y1994 y1995 y1996
## July -0.484375 3.359375 -5.50000 1.640625
## Aug 20.265625 -20.890625 14.25000 -16.609375
## June 0.484375 9.328125 -10.53125 -0.390625
## Feb -17.734375 15.109375 -3.75000 1.390625
## Jan 12.984375 -29.171875 21.96875 -12.890625
## Sept -35.203125 -26.359375 26.78125 34.921875
## May -4.234375 30.609375 2.75000 -5.109375
## Mar -24.234375 21.609375 -17.25000 14.890625
## Oct 8.484375 -28.671875 16.46875 -8.390625
## Apr -20.734375 37.109375 -2.75000 0.390625
## Dec 29.796875 -3.359375 3.78125 -38.078125
## Nov 52.765625 -17.390625 -3.25000 0.890625
We have fitted an additive fit to our data set church.2way.
Looking at the row and column effects, we see • November is the month with the most church attendance. Comparing effects, November has 27.015625 - 22.984375 which is about 4 more people attend the worship at the Norcrest church,Findlay than December. July and August are the months with the lowest attendance. • In 1996, on the average, there were about 21 more people attending the worship at the Norcrest Church in Findlay than in 1995. Overall, we can see that the attendance has increased over the years. • looking at the residuals, we do see a pattern
We will now add a single term to our additive model to account for the pattern that we see in the residuals. This new model, called an extended model, has the form FIT = COMMON + ROWEFF + COLEFF + kCV where CV is the comparison value and k is a constant chosen that the new model is a good fit to the data
cv <- with(additive.fit,
outer(row, col, "*") / overall)
df <- data.frame(Comparison_Value = as.vector(cv),
Residual = as.vector(additive.fit$residuals))
ggplot(df, aes(Comparison_Value, Residual)) + geom_point()
rline(Residual ~ Comparison_Value, df)$b
## [1] 1.460397
ggplot(df, aes(Comparison_Value,
Residual - 1.460397 * Comparison_Value)) +
geom_point() +
geom_hline(yintercept = 0, color="red")
We check the suitability of our line fit by looking at the residuals and seeing if we have removed the trend from the scatterplot. It seems there was a slight improvement. The slope of this line, 1.460397, is our estimate at the coefficient k. So our improved fit to these data has the form FIT = COMMON + ROWEFF + COLEFF + 1.460397CV