library(LearnEDAfunctions)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vcd)
## Warning: package 'vcd' was built under R version 4.3.2
## Loading required package: grid
  1. Exploring Football Scores The dataset football in the LearnEDA package gives the number of points scored by the winning team (team1) and the losing team (team2) for a large number of American football games.
head(football)
##   winner loser
## 1     50     0
## 2      0     0
## 3     55     0
## 4     24     3
## 5     28    20
## 6      8     7
view(football)
  1. Using the bin boundaries -0.5, 6.5, 13.5, 20.5, 27.5, 34.5, 41.5, 48.5, 55.5, 62.5, 69.5, 76.5 , have R construct a histogram of the scores of the winning team (variable team1).
bins <- seq(-0.5, 76.5, 7)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2

ggplot(football, aes(winner)) +
geom_histogram(breaks = bins,
fill = "white",
color = "red")

  1. Fit a Gaussian comparison curve to these data. Use R to compute and display the raw residuals (RawRes) and the double root residuals (DRRes) for all bins of the data.
fivenum(football$winner)
## [1]  0 21 30 39 73

Let’s find the matching Gaussian parameters for our football winner times data. Here the fourths are 21 and 39, So the matching mean is m = (21 + 39) / 2 = 30 and the matching standard deviation is s = (39 − 21) / 1.349 = 13.34 So our matching Gaussian curve is N(30, 13.34).

The table below gives the observed count (d) and expected count (e) for all the intervals

s <- fit.gaussian(football$winner, bins, 30, 13.34)
options(digits=3)
(df <- data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),
Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),
Residual=s$residual))
##    Mid   d sqrt.d    Prob      e sqrt.e Residual
## 1    3   7   2.65 0.02795 12.997  3.605  -0.9594
## 2   10  30   5.48 0.06900 32.084  5.664  -0.1871
## 3   17  58   7.62 0.13012 60.507  7.779  -0.1628
## 4   24  92   9.59 0.18748 87.180  9.337   0.2547
## 5   31 110  10.49 0.20640 95.974  9.797   0.6915
## 6   38  71   8.43 0.17361 80.728  8.985  -0.5587
## 7   45  49   7.00 0.11157 51.882  7.203  -0.2029
## 8   52  26   5.10 0.05478 25.474  5.047   0.0518
## 9   59  15   3.87 0.02055  9.555  3.091   0.7819
## 10  66   6   2.45 0.00589  2.737  1.654   0.7950
## 11  73   1   1.00 0.00129  0.599  0.774   0.2262
  1. Use the R rootogram function to plot the residuals. Interpret the residuals in the display. Are there any extreme residuals? Is there any distinctive pattern in the residuals? Based on your comments, is a normal curve a good model for football scores of winning teams? If the normal curve is not a good model, explain why.

The figure below is a rootogram with a smooth curve on top that corresponds to the root expected counts

p <- ggplot(football, aes(winner)) +
geom_histogram(breaks = bins)
out <- ggplot_build(p)$data[[1]]
select(out, count, x, xmin, xmax)
##    count  x xmin xmax
## 1      7  3 -0.5  6.5
## 2     30 10  6.5 13.5
## 3     58 17 13.5 20.5
## 4     92 24 20.5 27.5
## 5    110 31 27.5 34.5
## 6     71 38 34.5 41.5
## 7     49 45 41.5 48.5
## 8     26 52 48.5 55.5
## 9     15 59 55.5 62.5
## 10     6 66 62.5 69.5
## 11     1 73 69.5 76.5
ggplot(out, aes(x, sqrt(count))) +
geom_col() +
geom_line(data = df,
aes(bin.mids, sqrt.e), color="red")

We will now look at residuals which compare the counts with the expected counts using the normal mode to see if the normal curve is a good model for football scores of winning teams.

Looking at the table as generated above previously, we have a few residuals like -0.9594 that are quite large. Let’s look at the hanging rootogram for more insight.

rootogram(s$counts, s$expected)

rootogram(s$counts, s$expected, type="deviation")

Here, we see that the number of small winning scores seems a bit low, and the number of large winning scores a bit high. We have some inconsistent residuals, which is why I think it is fair to say that the normal curve is not a decent approximation for our model as our data seems to have some larger residuals. Putting this into perspective in terms of the football game, it makes sense that there are fewer games with lower scores as there are plenty of opportunities in football to score. Of course, considering these are winning scores, we would expect a large number of winning scores as higher scores usually win compared to lower scores.

  1. For the team1 data, the data can be made more symmetric by applying a square root reexpression. Fit a Gaussian comparison curve to the root team1 data. (Bin the data using an appropriate set of bins, fit the Gaussian curve, and plot the residuals.) Comment on the goodness of the normal curve fit to the root data.
football$winner <- sqrt(football$winner)
view(football)

bins <- seq(-0.5, 9.5, 1)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2

ggplot(football, aes(winner)) +
geom_histogram(breaks = bins,
fill = "white",
color = "red")

fivenum(football$winner)
## [1] 0.00 4.58 5.48 6.24 8.54

Let’s find the matching Gaussian parameters for our new football winner times data. Here the fourths are 4.58 and 6.24 , So the matching mean is m = (4.58 + 6.24) / 2 = 5.41 and the matching standard deviation is s = (6.24 − 4.58) / 1.349 = 1.23 So our matching Gaussian curve is N(5.41, 1.23).

s <- fit.gaussian(football$winner, bins, 5.41, 1.23)
options(digits=3)
(df <- data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),
Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),
Residual=s$residual))
##    Mid   d sqrt.d     Prob        e sqrt.e Residual
## 1    0   1   1.00 0.000032   0.0149  0.122   0.8780
## 2    1   1   1.00 0.000706   0.3285  0.573   0.4268
## 3    2   5   2.24 0.008255   3.8386  1.959   0.2768
## 4    3  27   5.20 0.051236  23.8246  4.881   0.3151
## 5    4  61   7.81 0.169469  78.8032  8.877  -1.0669
## 6    5 145  12.04 0.299466 139.2515 11.800   0.2411
## 7    6 143  11.96 0.283075 131.6297 11.473   0.4853
## 8    7  65   8.06 0.143119  66.5501  8.158  -0.0956
## 9    8  16   4.00 0.038643  17.9689  4.239  -0.2390
## 10   9   1   1.00 0.005557   2.5842  1.608  -0.6075
p <- ggplot(football, aes(winner)) +
geom_histogram(breaks = bins)
out <- ggplot_build(p)$data[[1]]
select(out, count, x, xmin, xmax)
##    count x xmin xmax
## 1      1 0 -0.5  0.5
## 2      1 1  0.5  1.5
## 3      5 2  1.5  2.5
## 4     27 3  2.5  3.5
## 5     61 4  3.5  4.5
## 6    145 5  4.5  5.5
## 7    143 6  5.5  6.5
## 8     65 7  6.5  7.5
## 9     16 8  7.5  8.5
## 10     1 9  8.5  9.5
ggplot(out, aes(x, sqrt(count))) +
geom_col() +
geom_line(data = df,
aes(bin.mids, sqrt.e), color="red")

rootogram(s$counts, s$expected)

rootogram(s$counts, s$expected, type="deviation")

As we can see, the residuals as well as fit of the curve seem to have improved after using the root reexpression on the football winning scores. We still have some residuals but overall, the normal curve seems to fit the data better and we can see a clear bell curve. Overall, the normal curve seems to fit this root data better.

  1. Exploring Heights of Women Use the EDA methods to fit a Gaussian comparison curve to the heights for a sample of college women who attend introductory statistics classes. The datafile studentdata in the LearnEDA package contains the data and the relevant variables are Height and Gender.
View(studentdata)
women <- studentdata[studentdata$Gender == "female", ]
view(women)
bins <- seq(52.5, 85.5, 3)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2


ggplot(women, aes(Height)) +
geom_histogram(breaks = bins,
fill = "white",
color = "red")
## Warning: Removed 7 rows containing non-finite values (`stat_bin()`).

fivenum(women$Height)
## [1] 54.0 63.0 64.5 67.0 84.0

Let’s find the matching Gaussian parameters for our football winner times data. Here the fourths are 63.0 and 67.0 , So the matching mean is m = (63 + 67) / 2 = 65 and the matching standard deviation is s = (67 − 63) / 1.349 = 2.97 So our matching Gaussian curve is N(65, 2.97).

The table below gives the observed count (d) and expected count (e) for all the intervals

s <- fit.gaussian(women$Height, bins, 65, 2.97)
options(digits=3)
(df <- data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),
Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),
Residual=s$residual))
##    Mid   d sqrt.d     Prob        e   sqrt.e Residual
## 1   54   3   1.73 6.78e-04 2.95e-01  0.54288   1.1892
## 2   57   7   2.65 1.36e-02 5.93e+00  2.43446   0.2113
## 3   60  47   6.86 1.05e-01 4.57e+01  6.75813   0.0975
## 4   63 158  12.57 3.14e-01 1.37e+02 11.68430   0.8855
## 5   66 142  11.92 3.67e-01 1.60e+02 12.63305  -0.7167
## 6   69  54   7.35 1.68e-01 7.31e+01  8.54716  -1.1987
## 7   72  12   3.46 2.99e-02 1.30e+01  3.60756  -0.1435
## 8   75   2   1.41 2.05e-03 8.92e-01  0.94461   0.4696
## 9   78   2   1.41 5.34e-05 2.32e-02  0.15246   1.2618
## 10  81   0   0.00 5.23e-07 2.27e-04  0.01508  -0.0151
## 11  84   1   1.00 1.90e-09 8.27e-07  0.00091   0.9991
p <- ggplot(women, aes(Height)) +
geom_histogram(breaks = bins)
out <- ggplot_build(p)$data[[1]]
## Warning: Removed 7 rows containing non-finite values (`stat_bin()`).
select(out, count, x, xmin, xmax)
##    count  x xmin xmax
## 1      3 54 52.5 55.5
## 2      7 57 55.5 58.5
## 3     47 60 58.5 61.5
## 4    158 63 61.5 64.5
## 5    142 66 64.5 67.5
## 6     54 69 67.5 70.5
## 7     12 72 70.5 73.5
## 8      2 75 73.5 76.5
## 9      2 78 76.5 79.5
## 10     0 81 79.5 82.5
## 11     1 84 82.5 85.5
ggplot(out, aes(x, sqrt(count))) +
geom_col() +
geom_line(data = df,
aes(bin.mids, sqrt.e), color="red")

rootogram(s$counts, s$expected)

rootogram(s$counts, s$expected, type="deviation")

Looking at the residuals, we see that the number of women with small heights as well as large heights seems to be fairly high. However, there seem to be less women with heights that are somewhat in the middle of small and large heights.