DATASET: Video games sales in millions by game and region
ADDITIVE FIT:
library(LearnEDA)
Loading required package: vcd
Loading required package: grid
Loading required package: manipulate
top10vgsales<-read.csv("https://docs.google.com/spreadsheets/d/1ZfiCXvg6FVEvGfNYiSCABtIetzvjBoHKfoRXh65f-uc/pub?output=csv")
top10 <- top10vgsales[, -1]
dimnames(top10)[[1]] <- top10vgsales[, 1]
additive.fit <- medpolish(top10)
1: 122.24
2: 111.14
Final: 110.32
additive.fit
Median Polish Results (Dataset: "top10")
Overall: 7.345
Row Effects:
Wii Sports Super Mario Bros.
13.080 0.000
Mario Kart Wii Wii Sports Resort
0.885 0.660
Pokemon Red/Pokemon Blue Tetris
-0.715 -1.390
New Super Mario Bros. Wii Play
0.405 -0.635
New Super Mario Bros. Wii
-0.735
Column Effects:
NA_Sales EU_Sales JP_Sales Other_Sales
7.745 2.260 -1.910 -5.045
Residuals:
NA_Sales EU_Sales JP_Sales Other_Sales
Wii Sports 13.320 6.335 -14.745 -6.920
Super Mario Bros. 13.990 -6.025 1.375 -1.530
Mario Kart Wii -0.125 2.390 -2.530 0.125
Wii Sports Resort 0.000 0.745 -2.815 0.000
Pokemon Red/Pokemon Blue -3.105 0.000 5.500 -0.585
Tetris 9.500 -5.955 0.175 -0.330
New Super Mario Bros. -4.115 -0.780 0.660 0.195
Wii Play -0.425 0.230 -1.870 1.185
New Super Mario Bros. Wii 0.235 -1.810 0.000 0.695
Common: Overall sales of 7345 millions
Row Effects:
-For example, the sales of Wii Sports is on average COMMON+REFF=20.425 millions -Overall, the difference in sales of Wii Sports and Super Mario Bros. is 13.080 millions -No other video games apart from Wii Sports seem to have any significant row effect
Column Effects:
-For example, the sales in NA region is on average COMMON+CEFF=15.09 millions -Overall, the difference in sales in NA and EU is 5.485 millions -There appears to be a significant impact on sales due to region; NA is has the highest sales overall followed by EU, then Japan and other locations at last which have significantly low
RECTANGLE PLOT:
Row.Part <- with(additive.fit, row + overall)
Col.Part <- additive.fit$col
plot2way(Row.Part, Col.Part,dimnames(top10)[[1]], dimnames(top10)[[2]])
Interpretation of the above plot:
-The highest number of sales in millions according to the additive fit were of Wii Sports in North Americas. Consecutively, the least number of sales were of Tetris (least in this dataset) in the other regions (apart from NA, EU and Japan).
-It can be observed that apart from WiiSports, all other video games overall are close in number of video games sold.
-The rectangle above is significantly rotated off the vertical, meaning that the variablity in sales of video games is partly due to video games and more so due to the region in which they are sold.
RESIDUALS:
aplpack::stem.leaf(c(additive.fit$residual))
1 | 2: represents 1.2
leaf unit: 0.1
n: 36
LO: -14.745 -6.92 -6.025 -5.955
5 -4* | 1
-3. |
6 -3* | 1
8 -2. | 85
-2* |
11 -1. | 885
-1* |
13 -0. | 75
16 -0* | 431
(9) 0* | 000011122
11 0. | 667
8 1* | 13
1. |
6 2* | 3
HI: 5.5 6.335 9.5 13.32 13.99
plot(as.vector(additive.fit$residuals))
abline(1,0)
plot((top10vgsales[,2]))
Residuals: There are unusually 5 high as well as 4 low residuals. Also, there appears to be non-linear relationship in data as shown in the graph above which represents sales for NA region (y axis) by video games (x axis).
MULTIPLICATIVE FIT:
top10 <- top10vgsales[, -1]
dimnames(top10)[[1]] <- top10vgsales[, 1]
log.top10<-log10(top10)
additive.fit <- medpolish(log.top10)
1: 6.179511
2: 5.598928
3: 5.52453
Final: 5.508273
additive.fit
Median Polish Results (Dataset: "log.top10")
Overall: 0.8421102
Row Effects:
Wii Sports Super Mario Bros.
0.48489767 -0.14859513
Mario Kart Wii Wii Sports Resort
0.08280953 0.05716704
Pokemon Red/Pokemon Blue Tetris
-0.05500991 -0.35239976
New Super Mario Bros. Wii Play
0.02137152 0.00000000
New Super Mario Bros. Wii
-0.05070340
Column Effects:
NA_Sales EU_Sales JP_Sales Other_Sales
0.2980033 0.1216776 -0.1193089 -0.4279855
Residuals:
NA_Sales EU_Sales JP_Sales Other_Sales
Wii Sports -0.0070678 0.014012 -0.631358 0.0283480
Super Mario Bros. 0.4720760 -0.261310 0.258941 -0.3790388
Mario Kart Wii -0.0228938 0.063319 -0.226972 0.0228938
Wii Sports Resort 0.0000000 0.020832 -0.264094 0.0000000
Pokemon Red/Pokemon Blue -0.0331797 0.040124 0.341660 -0.3591148
Tetris 0.5777742 -0.257280 0.254911 -0.2982969
New Super Mario Bros. -0.1053428 -0.019958 0.068741 0.0269018
Wii Play 0.0069442 0.000000 -0.255934 0.0407202
New Super Mario Bros. Wii 0.0746452 -0.064280 0.000000 -0.0093128
The following can be made out of the fit for log sales:
COMMON <- 10 ^ additive.fit$overall
ROW <- 10 ^ additive.fit$row
COL <- 10 ^ additive.fit$col
RESIDUAL <- 10 ^ additive.fit$residual
Based on the above values, a fit can be calculated as:
FIT = [COMMON] x [ROW] x [COL] x [RESIDUAL]
plot(additive.fit$row)
plot(additive.fit$col)
-From the plot of row effects, log sales is again very high for Wii Sports and zero or negative for other video games
-From the plot of column effects, log sales decreases constantly with region in order pf NA, EU, JP and Other_Sales
We’ll round the residuals to look for any trend or unusual observations
round(RESIDUAL,2)
NA_Sales EU_Sales JP_Sales Other_Sales
Wii Sports 0.98 1.03 0.23 1.07
Super Mario Bros. 2.97 0.55 1.82 0.42
Mario Kart Wii 0.95 1.16 0.59 1.05
Wii Sports Resort 1.00 1.05 0.54 1.00
Pokemon Red/Pokemon Blue 0.93 1.10 2.20 0.44
Tetris 3.78 0.55 1.80 0.50
New Super Mario Bros. 0.78 0.96 1.17 1.06
Wii Play 1.02 1.00 0.55 1.10
New Super Mario Bros. Wii 1.19 0.86 1.00 0.98
Based on the residuals, a lot of values deviate from 1. This certainly due to extremely high sales of Wii Sports (at least in this dataset). Sideling this video game may help us better compare the other games
-The residuals for multiplicative fit are still better than that of additive fit. There also seems non-linear relationship in data, and hence I would use multiplicative fit for the given dataset.
olympics.speed.skating
These datasets give the Olympics winning time in the men’s speed skating.
library(LearnEDA)
times <- olympics.speed.skating[, -1]
row.names(times) <- olympics.speed.skating[, 1]
times
As the skating times increase with distance, an additive model is not appropriate here. Hence we use a multiplicative model.
log.times <- log10(times)
additive.fit <- medpolish(log.times)
1: 0.135343
2: 0.1275123
Final: 0.1272076
additive.fit
Median Polish Results (Dataset: "log.times")
Overall: 2.049451
Row Effects:
1976 1980 1984 1988 1992
0.029706154 0.014653710 0.023754338 0.000000000 0.010529099
1994 1998 2002 2006
-0.005207217 -0.016549939 -0.032626124 -0.025345116
Column Effects:
X500m X1000m X1500m X5000m X10000m
-0.4839780 -0.1859493 0.0000000 0.5576075 0.8705210
Residuals:
X500m X1000m X1500m X5000m X10000m
1976 -0.0022252 0.00617525 -0.0022252 0.01108799 0.00000000
1980 0.0000000 -0.00205269 -0.0017480 0.00389901 0.00395939
1984 -0.0072773 -0.00758643 0.0000000 0.00495273 0.00070732
1988 -0.0037751 0.00000000 0.0000000 0.00000000 -0.00183643
1992 -0.0061598 0.00016141 0.0000000 0.00563109 0.00000000
1994 0.0000000 0.00162440 0.0022127 -0.00529775 -0.00598463
1998 0.0024053 0.00209933 0.0000000 -0.00821746 -0.00287436
2002 0.0039644 -0.00363518 0.0000000 -0.00079463 0.00414732
2006 0.0017013 0.00000000 0.0010774 -0.00805245 -0.00165866
The following can be made out of the fit for log times:
To get the fit in original scale, we take 10 to the power Common, Row effects, Column effects and Residuals.
COMMON <- 10 ^ additive.fit$overall
ROW <- 10 ^ additive.fit$row
COL <- 10 ^ additive.fit$col
RESIDUAL <- 10 ^ additive.fit$residual
Based on the above values, a fit can be calculated as:
FIT = [COMMON] x [ROW] x [COL] x [RESIDUAL]
plot(additive.fit$row)
plot(additive.fit$col)
From the above graph:
-The skating times fluctuate but are generally decreasing. Does this suggest better overall performance?
-The second graph shows increasing time with distance. This doesn’t tell us much as we already expect that. Looking at residuals might help.
We’ll round the residuals to look for any trend or unusual observations
round(RESIDUAL,2)
X500m X1000m X1500m X5000m X10000m
1976 0.99 1.01 0.99 1.03 1.00
1980 1.00 1.00 1.00 1.01 1.01
1984 0.98 0.98 1.00 1.01 1.00
1988 0.99 1.00 1.00 1.00 1.00
1992 0.99 1.00 1.00 1.01 1.00
1994 1.00 1.00 1.01 0.99 0.99
1998 1.01 1.00 1.00 0.98 0.99
2002 1.01 0.99 1.00 1.00 1.01
2006 1.00 1.00 1.00 0.98 1.00
If we consider the 500m skating times, 1984 was generally faster and 1998 & 2002 were generally slower considering the residuals below 0.99 and above 1.0 as ususual.
attend <- church.2way[, -1]
row.names(attend) <- church.2way[, 1]
attend
additive.fit <- medpolish(attend)
1: 786
2: 739.25
Final: 735
additive.fit
Median Polish Results (Dataset: "attend")
Overall: 370.5156
Row Effects:
Jan Feb Mar Apr May June
-2.203125 -12.484375 7.015625 16.515625 2.015625 -20.703125
July Aug Sept Oct Nov Dec
-59.734375 -44.484375 -2.015625 11.296875 27.015625 22.984375
Column Effects:
y1993 y1994 y1995 y1996
-45.296875 -4.140625 3.718750 25.578125
Residuals:
y1993 y1994 y1995 y1996
Jan 12.98438 -29.1719 21.9688 -12.89062
Feb -17.73438 15.1094 -3.7500 1.39062
Mar -24.23438 21.6094 -17.2500 14.89062
Apr -20.73438 37.1094 -2.7500 0.39062
May -4.23438 30.6094 2.7500 -5.10938
June 0.48438 9.3281 -10.5312 -0.39062
July -0.48438 3.3594 -5.5000 1.64062
Aug 20.26562 -20.8906 14.2500 -16.60938
Sept -35.20312 -26.3594 26.7812 34.92188
Oct 8.48438 -28.6719 16.4688 -8.39062
Nov 52.76562 -17.3906 -3.2500 0.89062
Dec 29.79688 -3.3594 3.7812 -38.07812
Ordering them by both row and column effects (not necessary by column here).
attend <- attend[order(additive.fit$row), ]
attend <- attend[,order(additive.fit$col)]
additive.fit <- medpolish(attend)
1: 786
2: 739.25
Final: 735
additive.fit
Median Polish Results (Dataset: "attend")
Overall: 370.5156
Row Effects:
July Aug June Feb Jan Sept
-59.734375 -44.484375 -20.703125 -12.484375 -2.203125 -2.015625
May Mar Oct Apr Dec Nov
2.015625 7.015625 11.296875 16.515625 22.984375 27.015625
Column Effects:
y1993 y1994 y1995 y1996
-45.296875 -4.140625 3.718750 25.578125
Residuals:
y1993 y1994 y1995 y1996
July -0.48438 3.3594 -5.5000 1.64062
Aug 20.26562 -20.8906 14.2500 -16.60938
June 0.48438 9.3281 -10.5312 -0.39062
Feb -17.73438 15.1094 -3.7500 1.39062
Jan 12.98438 -29.1719 21.9688 -12.89062
Sept -35.20312 -26.3594 26.7812 34.92188
May -4.23438 30.6094 2.7500 -5.10938
Mar -24.23438 21.6094 -17.2500 14.89062
Oct 8.48438 -28.6719 16.4688 -8.39062
Apr -20.73438 37.1094 -2.7500 0.39062
Dec 29.79688 -3.3594 3.7812 -38.07812
Nov 52.76562 -17.3906 -3.2500 0.89062
From the above fit, we see that there are a lot of residuals greater than 25 in magnitude which close to some effects. I think we can find a better fit for the data than this.
Finding the coefficient k of comparision value.
cv <- with(additive.fit,outer(row, col, "*") / overall)
plot(as.vector(cv), as.vector(additive.fit$residuals), xlab = "COMPARISON VALUES", ylab ="RESIDUALS")
There might be a decreasing trend in residuals. Calculating the slope of fit to residuals.
rline(as.vector(cv), as.vector(additive.fit$residuals))$b
[1] 1.460397
Now, checking to see if any trend in residuals is removed or not.
plot(as.vector(cv),as.vector(additive.fit$residuals) -1.46 * as.vector(cv), xlab = "COMPARISON VALUES", ylab = "RESIDUAL FROM FIT", main="Checking Suitability of Line Fit")
abline(h=0, col="red")
The model would be: FIT = COMMON +ROW EFF +COLEFF +1.46CV
From the above plot, it seems though that the residuals were unaffected overall. This implies that extended fit does not help fix the residuals. ,