The dataset football in the LearnEDA package gives the number of points scored by the winning team (team1) and the losing team (team2) for a large number of American football games.
football <- read.delim("~/data/football.txt")
bins <- c(-0.5, 6.5, 13.5, 20.5, 27.5, 34.5, 41.5, 48.5, 55.5, 62.5, 69.5, 76.5)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2
h<- with(football,
hist(winner, breaks = bins, xlab = "Scored Points", main = ""))
–The shape of the data distribution looks a bit right-skewed. We notice it is the approximate bell-shape of the scored points.
–The most popular number of points scored by the winning team are around 30.
h$counts <- sqrt(h$counts)
plot(h, xlab = "Scored Points", ylab = "ROOT FREQUENCY", main = "")
fivenum(football$winner)
## [1] 0 21 30 39 73
Find Mean and Standard Deviation of Matching Gaussian curve.
We use the fourth-spread of the sample to estimate the Gaussian mean and standard deviation. m = (FU + FL)/2 = 30, s = (FU − FL)/1.349 = (39-21)/1.349 = 13.3. So our matching Gaussian curve is N(30, 13.3).
s <- fit.gaussian(football$winner, bins, 30, 13.3)
options(digits=3)
data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),Residual=s$residual)
## Mid d sqrt.d Prob e sqrt.e Residual
## 1 3 7 2.65 0.02770 12.882 3.589 -0.9434
## 2 10 30 5.48 0.06876 31.971 5.654 -0.1771
## 3 17 58 7.62 0.13015 60.519 7.779 -0.1636
## 4 24 92 9.59 0.18792 87.385 9.348 0.2437
## 5 31 110 10.49 0.20700 96.254 9.811 0.6772
## 6 38 71 8.43 0.17394 80.882 8.993 -0.5673
## 7 45 49 7.00 0.11150 51.846 7.200 -0.2004
## 8 52 26 5.10 0.05452 25.350 5.035 0.0642
## 9 59 15 3.87 0.02033 9.453 3.075 0.7984
## 10 66 6 2.45 0.00578 2.688 1.640 0.8099
## 11 73 1 1.00 0.00125 0.583 0.763 0.2366
Fitting a Gaussian comparison curve, we got the table above shown the observed count (d) and expected count (e) for all the intervals. As we know, the column RawRes contains the raw residuals that are simply d – e (no roots). The DRRes contains the so-called “double-root residuals”. Here we calculate them below.
data.frame(Count=s$counts, RawRes= s$counts-s$expected, DRRes=2*(s$residual))
## Count RawRes DRRes
## 1 7 -5.882 -1.887
## 2 30 -1.971 -0.354
## 3 58 -2.519 -0.327
## 4 92 4.615 0.487
## 5 110 13.746 1.354
## 6 71 -9.882 -1.135
## 7 49 -2.846 -0.401
## 8 26 0.650 0.128
## 9 15 5.547 1.597
## 10 6 3.312 1.620
## 11 1 0.417 0.473
plot(h, xlab = "Scored Points", ylab = "ROOT FREQUENCY", main = "")
lines(bin.mids, sqrt(s$expected))
rootogram(s$counts, s$expected, type="deviation")
plot(bin.mids, s$residual)
abline(h=0)
Although the number of points scored by the winning team are roughly normal in shape, we see that there are some large positive residuals on the right tail. If we explore further by inspecting residuals and look for patterns and unusually large and small values( Here it is harder to gauge “large or small” since these simple residuals don’t have the right scale.), we do see some consistent pattern — the residuals are at two ends (two tails) and in the middle for bins at the extremes.
Here we see some lack of fit of the Gaussian curve:
• the number of small pointed scores seems a bit low
• we see too many pointed scores around 30 and larger than 60.
R<-football[,1]
bins <- seq(min(R), max(R), length.out = 15)
N <- length(bins)
bin.midpts <- (bins[1:(N - 1)] + bins[2:N]) / 2
h <- hist(R, bins, plot = F)
d <- data.frame(BIN.LO=bins[1:(N - 1)],
BIN.HI=bins[2:N],
BIN.MID=bin.midpts,
COUNT=h$counts)
options(digits=3)
d$ROOT <- sqrt(d$COUNT)
##smooth the curve by applying 3RSSH smooth twice to the sequence of root counts##
d$Smooth.Root <- smooth(d$ROOT, twiceit=TRUE)
d
## BIN.LO BIN.HI BIN.MID COUNT ROOT Smooth.Root
## 1 0.00 5.21 2.61 6 2.45 3.80
## 2 5.21 10.43 7.82 23 4.80 4.80
## 3 10.43 15.64 13.04 28 5.29 5.29
## 4 15.64 20.86 18.25 38 6.16 6.16
## 5 20.86 26.07 23.46 79 8.89 7.76
## 6 26.07 31.29 28.68 89 9.43 7.76
## 7 31.29 36.50 33.89 54 7.35 7.35
## 8 36.50 41.71 39.11 51 7.14 7.14
## 9 41.71 46.93 44.32 36 6.00 6.00
## 10 46.93 52.14 49.54 31 5.57 5.57
## 11 52.14 57.36 54.75 15 3.87 3.87
## 12 57.36 62.57 59.96 8 2.83 2.83
## 13 62.57 67.79 65.18 3 1.73 2.00
## 14 67.79 73.00 70.39 4 2.00 2.00
plot(d$BIN.MID, d$ROOT,
xlab = "MIDPOINTS", ylab = "ROOTS",
type = "l", lwd = 3)
lines(d$BIN.MID, d$Smooth.Root, col="red", lwd=3)
##fit a symmetric curve, histogram appears to peak = 30 ##
d$Shift.Sq <- round((bin.midpts - 30) ^ 2, 2)
d
## BIN.LO BIN.HI BIN.MID COUNT ROOT Smooth.Root Shift.Sq
## 1 0.00 5.21 2.61 6 2.45 3.80 750.37
## 2 5.21 10.43 7.82 23 4.80 4.80 491.89
## 3 10.43 15.64 13.04 28 5.29 5.29 287.79
## 4 15.64 20.86 18.25 38 6.16 6.16 138.06
## 5 20.86 26.07 23.46 79 8.89 7.76 42.72
## 6 26.07 31.29 28.68 89 9.43 7.76 1.75
## 7 31.29 36.50 33.89 54 7.35 7.35 15.15
## 8 36.50 41.71 39.11 51 7.14 7.14 82.94
## 9 41.71 46.93 44.32 36 6.00 6.00 205.10
## 10 46.93 52.14 49.54 31 5.57 5.57 381.64
## 11 52.14 57.36 54.75 15 3.87 3.87 612.56
## 12 57.36 62.57 59.96 8 2.83 2.83 897.86
## 13 62.57 67.79 65.18 3 1.73 2.00 1237.53
## 14 67.79 73.00 70.39 4 2.00 2.00 1631.58
with(d, plot(Shift.Sq, Smooth.Root))
We see strong curvature in the graph and a reexpression is needed to straighten. We plot the shift squared against different power transformations of the smoothed roots using reexpressions p = 0.5 (roots), p = .001 (logs), p = -0.5 (reciprocal roots), and p = -1 (reciprocals), and look for a power p that seems to straighten the graph.
par(mfrow=c(2, 2))
with(d, plot(Shift.Sq, power.t(Smooth.Root, 0.5),
main="Power = 0.5", ylab="New Y"))
with(d, plot(Shift.Sq, power.t(Smooth.Root, 0),
main="Power = 0", ylab="New Y"))
with(d, plot(Shift.Sq, power.t(Smooth.Root, -0.5),
main="Power = -0.5", ylab="New Y"))
with(d, plot(Shift.Sq, power.t(Smooth.Root, -1),
main="Power = -1", ylab="New Y"))
Looking at the four graphs, we see substantial curvature in the p=0.5 and p= -1 plots and the straightening reexpression appears to fall between the values p = -0.5 and p = 0.001. We use a reexpression p = -0.25 and fit a resistant line below.
## Fit a resistant line with a square root reexpression ##
fit <- lm(I(Smooth.Root^(-0.25)) ~ Shift.Sq, data=d)
b <- coef(fit)
Final.Smooth <- (b[1] + b[2] * d$Shift.Sq) ^ (-2 / 0.25)
d$Final.S.Root <- sqrt(Final.Smooth)
d
## BIN.LO BIN.HI BIN.MID COUNT ROOT Smooth.Root Shift.Sq Final.S.Root
## 1 0.00 5.21 2.61 6 2.45 3.80 750.37 3.59
## 2 5.21 10.43 7.82 23 4.80 4.80 491.89 4.57
## 3 10.43 15.64 13.04 28 5.29 5.29 287.79 5.59
## 4 15.64 20.86 18.25 38 6.16 6.16 138.06 6.53
## 5 20.86 26.07 23.46 79 8.89 7.76 42.72 7.23
## 6 26.07 31.29 28.68 89 9.43 7.76 1.75 7.55
## 7 31.29 36.50 33.89 54 7.35 7.35 15.15 7.44
## 8 36.50 41.71 39.11 51 7.14 7.14 82.94 6.92
## 9 41.71 46.93 44.32 36 6.00 6.00 205.10 6.08
## 10 46.93 52.14 49.54 31 5.57 5.57 381.64 5.09
## 11 52.14 57.36 54.75 15 3.87 3.87 612.56 4.07
## 12 57.36 62.57 59.96 8 2.83 2.83 897.86 3.14
## 13 62.57 67.79 65.18 3 1.73 2.00 1237.53 2.36
## 14 67.79 73.00 70.39 4 2.00 2.00 1631.58 1.73
plot(d$BIN.MID, d$ROOT,
xlab = "MIDPOINTS", ylab = "ROOTS",
type = "l", lwd = 3)
lines(d$BIN.MID, d$Final.S.Root)
##Plot the residuals against the bin midpoints##
d$Residual = d$ROOT - d$Final.S.Root
plot(d$BIN.MID, d$Residual)
abline(h=0)
To check the goodness of fit, we compute residuals. Although there are several extremes at the bins between 20-30 and smaller than 10, we can hardly see clear pattern in the residuals, which appears to be a reasonable fit.
Use the EDA methods to fit a Gaussian comparison curve to the heights for a sample of college women who attend introductory statistics classes. The datafile studentdata in the LearnEDA package contains the data and the relevant variables are Height and Gender.
studentdata <- read.delim("~/data/studentdata.txt")
## Select Variables "gender" and "height"##
Data<-studentdata[which(studentdata$Gender=='female' ),c(2:3)]
## Checking missing values ##
summary(Data)
## Height Gender
## Min. :54.0 Length:435
## 1st Qu.:63.0 Class :character
## Median :64.5 Mode :character
## Mean :64.8
## 3rd Qu.:67.0
## Max. :84.0
## NA's :7
## Substitute missing values with median##
med.height <- median(Data$Height, na.rm = TRUE)
Data$Height[is.na(Data$Height)] <- med.height
## Bin the data and construct a histogram ##
bins <- seq(52.5,85.5,3)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2
h<- with(Data,
hist(Height, breaks = bins, xlab = "Height", main = ""))
## Rootogram the data ##
h$counts <- sqrt(h$counts)
plot(h, xlab = "Height", ylab = "ROOT FREQUENCY", main = "")
##Fit a Gaussian comparison curve using matching mean and standard deviation##
s <- fit.gaussian(Data$Height, bins, 64.5, 2.97)
options(digits=2)
##observed count (d) and expected count (e) table##
data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),Residual=s$residual)
## Mid d sqrt.d Prob e sqrt.e Residual
## 1 54 3 1.7 1.2e-03 5.2e-01 7.2e-01 1.0111
## 2 57 7 2.6 2.0e-02 8.9e+00 3.0e+00 -0.3375
## 3 60 47 6.9 1.3e-01 5.9e+01 7.7e+00 -0.7946
## 4 63 165 12.8 3.4e-01 1.5e+02 1.2e+01 0.6165
## 5 66 142 11.9 3.4e-01 1.5e+02 1.2e+01 -0.3124
## 6 69 54 7.3 1.3e-01 5.9e+01 7.7e+00 -0.3018
## 7 72 12 3.5 2.0e-02 8.9e+00 3.0e+00 0.4808
## 8 75 2 1.4 1.2e-03 5.2e-01 7.2e-01 0.6933
## 9 78 2 1.4 2.6e-05 1.2e-02 1.1e-01 1.3069
## 10 81 0 0.0 2.2e-07 9.6e-05 9.8e-03 -0.0098
## 11 84 1 1.0 6.8e-10 2.9e-07 5.4e-04 0.9995
## Check Gaussian curve fitting the histogram ##
plot(h, xlab = "Height", ylab = "ROOT FREQUENCY", main = "")
lines(bin.mids, sqrt(s$expected))
## Hanging rootogram & checking residuals##
rootogram(s$counts, s$expected, type="deviation")
plot(bin.mids, s$residual)
abline(h=0)
Although the height of college women who attend introductory statistics classes are roughly normal in shape, we see that there are some large positive residuals at left and right ends. The height of college women have a long right-tail which means one tends to see more tall women than predicted from the G curve. Also, there is a negative deviation in the middle of the distribution.
In addition, we see a consistent pattern in the residual plot — the residuals are positive for bins at the extremes.This indicates that the distribution of height has heavier tails than the normal.