HW8

1. Exploring Football Scores

The dataset football in the LearnEDA package gives the number of points scored by the winning team (team1) and the losing team (team2) for a large number of American football games.

Using the bin boundaries -0.5, 6.5, 13.5, 20.5, 27.5, 34.5, 41.5, 48.5, 55.5, 62.5, 69.5, 76.5 , have R construct a histogram of the scores of the winning team (variable team1).

football <- read.delim("~/data/football.txt")
bins <- c(-0.5, 6.5, 13.5, 20.5, 27.5, 34.5, 41.5, 48.5, 55.5, 62.5, 69.5, 76.5)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2
h<- with(football,
         hist(winner, breaks = bins, xlab = "Scored Points", main = ""))

–The shape of the data distribution looks a bit right-skewed. We notice it is the approximate bell-shape of the scored points.

–The most popular number of points scored by the winning team are around 30.

Fit a Gaussian comparison curve to these data. Use R to compute and display the raw residuals (RawRes) and the double root residuals (DRRes) for all bins of the data.

h$counts <- sqrt(h$counts)
plot(h, xlab = "Scored Points", ylab = "ROOT FREQUENCY", main = "")

fivenum(football$winner)

## [1]  0 21 30 39 73

Find Mean and Standard Deviation of Matching Gaussian curve.

We use the fourth-spread of the sample to estimate the Gaussian mean and standard deviation. m = (FU + FL)/2 = 30, s = (FU − FL)/1.349 = (39-21)/1.349 = 13.3. So our matching Gaussian curve is N(30, 13.3).

s <- fit.gaussian(football$winner, bins, 30, 13.3)
options(digits=3)
data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),Residual=s$residual)

##    Mid   d sqrt.d    Prob      e sqrt.e Residual
## 1    3   7   2.65 0.02770 12.882  3.589  -0.9434
## 2   10  30   5.48 0.06876 31.971  5.654  -0.1771
## 3   17  58   7.62 0.13015 60.519  7.779  -0.1636
## 4   24  92   9.59 0.18792 87.385  9.348   0.2437
## 5   31 110  10.49 0.20700 96.254  9.811   0.6772
## 6   38  71   8.43 0.17394 80.882  8.993  -0.5673
## 7   45  49   7.00 0.11150 51.846  7.200  -0.2004
## 8   52  26   5.10 0.05452 25.350  5.035   0.0642
## 9   59  15   3.87 0.02033  9.453  3.075   0.7984
## 10  66   6   2.45 0.00578  2.688  1.640   0.8099
## 11  73   1   1.00 0.00125  0.583  0.763   0.2366

Fitting a Gaussian comparison curve, we got the table above shown the observed count (d) and expected count (e) for all the intervals. As we know, the column RawRes contains the raw residuals that are simply d – e (no roots). The DRRes contains the so-called “double-root residuals”. Here we calculate them below.

data.frame(Count=s$counts, RawRes= s$counts-s$expected, DRRes=2*(s$residual))

##    Count RawRes  DRRes
## 1      7 -5.882 -1.887
## 2     30 -1.971 -0.354
## 3     58 -2.519 -0.327
## 4     92  4.615  0.487
## 5    110 13.746  1.354
## 6     71 -9.882 -1.135
## 7     49 -2.846 -0.401
## 8     26  0.650  0.128
## 9     15  5.547  1.597
## 10     6  3.312  1.620
## 11     1  0.417  0.473

Use the R rootogram function to plot the residuals. Interpret the residuals in the display. Are there any extreme residuals? Is there any distinctive pattern in the residuals? Based on your comments, is a normal curve a good model for football scores of winning teams? If the normal curve is not a good model, explain why.

plot(h, xlab = "Scored Points", ylab = "ROOT FREQUENCY", main = "")
lines(bin.mids, sqrt(s$expected))

rootogram(s$counts, s$expected, type="deviation")

plot(bin.mids, s$residual)
abline(h=0)

Although the number of points scored by the winning team are roughly normal in shape, we see that there are some large positive residuals on the right tail. If we explore further by inspecting residuals and look for patterns and unusually large and small values( Here it is harder to gauge “large or small” since these simple residuals don’t have the right scale.), we do see some consistent pattern — the residuals are at two ends (two tails) and in the middle for bins at the extremes.

Here we see some lack of fit of the Gaussian curve:

• the number of small pointed scores seems a bit low

• we see too many pointed scores around 30 and larger than 60.

For the team1 data, the data can be made more symmetric by applying a square root reexpression. Fit a Gaussian comparison curve to the root team1 data. (Bin the data using an appropriate set of bins, fit the Gaussian curve, and plot the residuals.) Comment on the goodness of the normal curve fit to the root data.

R<-football[,1]
bins <- seq(min(R), max(R), length.out = 15)
N <- length(bins)
bin.midpts <- (bins[1:(N - 1)] + bins[2:N]) / 2
h <- hist(R, bins, plot = F)
d <- data.frame(BIN.LO=bins[1:(N - 1)],
                BIN.HI=bins[2:N],
                BIN.MID=bin.midpts,
                COUNT=h$counts)
options(digits=3)
d$ROOT <- sqrt(d$COUNT)
##smooth the curve by applying 3RSSH smooth twice to the sequence of root counts##
d$Smooth.Root <- smooth(d$ROOT, twiceit=TRUE)
d

##    BIN.LO BIN.HI BIN.MID COUNT ROOT Smooth.Root
## 1    0.00   5.21    2.61     6 2.45        3.80
## 2    5.21  10.43    7.82    23 4.80        4.80
## 3   10.43  15.64   13.04    28 5.29        5.29
## 4   15.64  20.86   18.25    38 6.16        6.16
## 5   20.86  26.07   23.46    79 8.89        7.76
## 6   26.07  31.29   28.68    89 9.43        7.76
## 7   31.29  36.50   33.89    54 7.35        7.35
## 8   36.50  41.71   39.11    51 7.14        7.14
## 9   41.71  46.93   44.32    36 6.00        6.00
## 10  46.93  52.14   49.54    31 5.57        5.57
## 11  52.14  57.36   54.75    15 3.87        3.87
## 12  57.36  62.57   59.96     8 2.83        2.83
## 13  62.57  67.79   65.18     3 1.73        2.00
## 14  67.79  73.00   70.39     4 2.00        2.00

plot(d$BIN.MID, d$ROOT,
     xlab = "MIDPOINTS", ylab = "ROOTS",
     type = "l", lwd = 3)
lines(d$BIN.MID, d$Smooth.Root, col="red", lwd=3)

##fit a symmetric curve, histogram appears to peak = 30 ##
d$Shift.Sq <- round((bin.midpts - 30) ^ 2, 2)
d

##    BIN.LO BIN.HI BIN.MID COUNT ROOT Smooth.Root Shift.Sq
## 1    0.00   5.21    2.61     6 2.45        3.80   750.37
## 2    5.21  10.43    7.82    23 4.80        4.80   491.89
## 3   10.43  15.64   13.04    28 5.29        5.29   287.79
## 4   15.64  20.86   18.25    38 6.16        6.16   138.06
## 5   20.86  26.07   23.46    79 8.89        7.76    42.72
## 6   26.07  31.29   28.68    89 9.43        7.76     1.75
## 7   31.29  36.50   33.89    54 7.35        7.35    15.15
## 8   36.50  41.71   39.11    51 7.14        7.14    82.94
## 9   41.71  46.93   44.32    36 6.00        6.00   205.10
## 10  46.93  52.14   49.54    31 5.57        5.57   381.64
## 11  52.14  57.36   54.75    15 3.87        3.87   612.56
## 12  57.36  62.57   59.96     8 2.83        2.83   897.86
## 13  62.57  67.79   65.18     3 1.73        2.00  1237.53
## 14  67.79  73.00   70.39     4 2.00        2.00  1631.58

with(d, plot(Shift.Sq, Smooth.Root))

We see strong curvature in the graph and a reexpression is needed to straighten. We plot the shift squared against different power transformations of the smoothed roots using reexpressions p = 0.5 (roots), p = .001 (logs), p = -0.5 (reciprocal roots), and p = -1 (reciprocals), and look for a power p that seems to straighten the graph.

par(mfrow=c(2, 2))
with(d, plot(Shift.Sq, power.t(Smooth.Root, 0.5),
             main="Power = 0.5", ylab="New Y"))
with(d, plot(Shift.Sq, power.t(Smooth.Root, 0),
             main="Power = 0", ylab="New Y"))
with(d, plot(Shift.Sq, power.t(Smooth.Root, -0.5),
             main="Power = -0.5", ylab="New Y"))
with(d, plot(Shift.Sq, power.t(Smooth.Root, -1),
             main="Power = -1", ylab="New Y"))

Looking at the four graphs, we see substantial curvature in the p=0.5 and p= -1 plots and the straightening reexpression appears to fall between the values p = -0.5 and p = 0.001. We use a reexpression p = -0.25 and fit a resistant line below.

## Fit a resistant line with a square root reexpression ##
fit <- lm(I(Smooth.Root^(-0.25)) ~ Shift.Sq, data=d)
b <- coef(fit)
Final.Smooth <- (b[1] + b[2] * d$Shift.Sq) ^ (-2 / 0.25)
d$Final.S.Root <- sqrt(Final.Smooth)
d

##    BIN.LO BIN.HI BIN.MID COUNT ROOT Smooth.Root Shift.Sq Final.S.Root
## 1    0.00   5.21    2.61     6 2.45        3.80   750.37         3.59
## 2    5.21  10.43    7.82    23 4.80        4.80   491.89         4.57
## 3   10.43  15.64   13.04    28 5.29        5.29   287.79         5.59
## 4   15.64  20.86   18.25    38 6.16        6.16   138.06         6.53
## 5   20.86  26.07   23.46    79 8.89        7.76    42.72         7.23
## 6   26.07  31.29   28.68    89 9.43        7.76     1.75         7.55
## 7   31.29  36.50   33.89    54 7.35        7.35    15.15         7.44
## 8   36.50  41.71   39.11    51 7.14        7.14    82.94         6.92
## 9   41.71  46.93   44.32    36 6.00        6.00   205.10         6.08
## 10  46.93  52.14   49.54    31 5.57        5.57   381.64         5.09
## 11  52.14  57.36   54.75    15 3.87        3.87   612.56         4.07
## 12  57.36  62.57   59.96     8 2.83        2.83   897.86         3.14
## 13  62.57  67.79   65.18     3 1.73        2.00  1237.53         2.36
## 14  67.79  73.00   70.39     4 2.00        2.00  1631.58         1.73

plot(d$BIN.MID, d$ROOT,
     xlab = "MIDPOINTS", ylab = "ROOTS",
     type = "l", lwd = 3)
lines(d$BIN.MID, d$Final.S.Root)

##Plot the residuals against the bin midpoints##
d$Residual = d$ROOT - d$Final.S.Root
plot(d$BIN.MID, d$Residual)
abline(h=0)

To check the goodness of fit, we compute residuals. Although there are several extremes at the bins between 20-30 and smaller than 10, we can hardly see clear pattern in the residuals, which appears to be a reasonable fit.

2. Exploring Heights of Women

Use the EDA methods to fit a Gaussian comparison curve to the heights for a sample of college women who attend introductory statistics classes. The datafile studentdata in the LearnEDA package contains the data and the relevant variables are Height and Gender.

studentdata <- read.delim("~/data/studentdata.txt")
## Select Variables "gender" and "height"##
Data<-studentdata[which(studentdata$Gender=='female' ),c(2:3)]
## Checking missing values ##
summary(Data)

##      Height        Gender         
##  Min.   :54.0   Length:435        
##  1st Qu.:63.0   Class :character  
##  Median :64.5   Mode  :character  
##  Mean   :64.8                     
##  3rd Qu.:67.0                     
##  Max.   :84.0                     
##  NA's   :7

## Substitute missing values with median##
med.height <- median(Data$Height, na.rm = TRUE)
Data$Height[is.na(Data$Height)] <- med.height
## Bin the data and construct a histogram ##
bins <- seq(52.5,85.5,3)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2
h<- with(Data,
         hist(Height, breaks = bins, xlab = "Height", main = ""))

## Rootogram the data ##
h$counts <- sqrt(h$counts)
plot(h, xlab = "Height", ylab = "ROOT FREQUENCY", main = "")
##Fit a Gaussian comparison curve using matching mean and standard deviation##
s <- fit.gaussian(Data$Height, bins, 64.5, 2.97)
options(digits=2)
##observed count (d) and expected count (e) table##
data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),Residual=s$residual)

##    Mid   d sqrt.d    Prob       e  sqrt.e Residual
## 1   54   3    1.7 1.2e-03 5.2e-01 7.2e-01   1.0111
## 2   57   7    2.6 2.0e-02 8.9e+00 3.0e+00  -0.3375
## 3   60  47    6.9 1.3e-01 5.9e+01 7.7e+00  -0.7946
## 4   63 165   12.8 3.4e-01 1.5e+02 1.2e+01   0.6165
## 5   66 142   11.9 3.4e-01 1.5e+02 1.2e+01  -0.3124
## 6   69  54    7.3 1.3e-01 5.9e+01 7.7e+00  -0.3018
## 7   72  12    3.5 2.0e-02 8.9e+00 3.0e+00   0.4808
## 8   75   2    1.4 1.2e-03 5.2e-01 7.2e-01   0.6933
## 9   78   2    1.4 2.6e-05 1.2e-02 1.1e-01   1.3069
## 10  81   0    0.0 2.2e-07 9.6e-05 9.8e-03  -0.0098
## 11  84   1    1.0 6.8e-10 2.9e-07 5.4e-04   0.9995

## Check Gaussian curve fitting the histogram ##
plot(h, xlab = "Height", ylab = "ROOT FREQUENCY", main = "")
lines(bin.mids, sqrt(s$expected))

## Hanging rootogram & checking residuals##
rootogram(s$counts, s$expected, type="deviation")

plot(bin.mids, s$residual)
abline(h=0)

Although the height of college women who attend introductory statistics classes are roughly normal in shape, we see that there are some large positive residuals at left and right ends. The height of college women have a long right-tail which means one tends to see more tall women than predicted from the G curve. Also, there is a negative deviation in the middle of the distribution.

In addition, we see a consistent pattern in the residual plot — the residuals are positive for bins at the extremes.This indicates that the distribution of height has heavier tails than the normal.

HW8

Xuejun Gu

11/8/2021

1. Exploring Football Scores

2. Exploring Heights of Women