1. Exploring Football Scores The dataset football in the LearnEDA package gives the number of points scored by the winning team (team1) and the losing team (team2) for a large number of American football games.
library(LearnEDAfunctions)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vcd)
## Loading required package: grid
head(football)
##   winner loser
## 1     50     0
## 2      0     0
## 3     55     0
## 4     24     3
## 5     28    20
## 6      8     7
  1. Using the bin boundaries -0.5, 6.5, 13.5, 20.5, 27.5, 34.5, 41.5, 48.5, 55.5, 62.5, 69.5, 76.5 , have R construct a histogram of the scores of the winning team (variable team1).
B_Bound <- seq(-0.5, 76.5, 7)
Bim <- (B_Bound[-1] + B_Bound[-length(B_Bound)]) / 2

ggplot(football, aes(winner)) +
geom_histogram(breaks = B_Bound,
fill = "white",
color = "green")

  1. Fit a Gaussian comparison curve to these data. Use R to compute and display the raw residuals (RawRes) and the double root residuals (DRRes) for all bins of the data.
fivenum(football$winner)
## [1]  0 21 30 39 73

We use the fourths (21,39) to determine the Gaussian parameters for our football winner data with its matching mean calculated as:

m = (21 + 39) / 2 = 30

The matching standard deviation is calculated as:

s = (39 − 21) / 1.349 = 13.34

The matching Gaussian curve will therefore be N(30, 13.34).

# Load required library
library(ggplot2)

# Define parameters
W_Mean <- 30
S_Win <- 13.3
bins <- c(-0.5, 6.5, 13.5, 20.5, 27.5, 34.5, 41.5, 48.5, 55.5, 62.5, 69.5, 76.5)
bnmids <- (bins[-length(bins)] + bins[-1]) / 2

# Plot histogram with Gaussian comparison curve
ggplot(football, aes(x = winner)) +
  geom_histogram(aes(y = ..density..), breaks = bins, fill = "grey", color = "red", binwidth = 6) +
  stat_function(fun = dnorm, args = list(mean = W_Mean, sd = S_Win), 
                color = "green", size = 1.5) +
  labs(x = "Winner Score", y = "Density", title = "Gaussian Comparison Curve")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The folowing table shows the observed count (d) and expected count (e) at all the intervals

s <- fit.gaussian(football$winner, B_Bound, 30, 13.34)
options(digits=3)
(df <- data.frame(Mid=bnmids, d=s$counts, sqrt.d=sqrt(s$counts),
Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),
Residual=s$residual))
##    Mid   d sqrt.d    Prob      e sqrt.e Residual
## 1    3   7   2.65 0.02795 12.997  3.605  -0.9594
## 2   10  30   5.48 0.06900 32.084  5.664  -0.1871
## 3   17  58   7.62 0.13012 60.507  7.779  -0.1628
## 4   24  92   9.59 0.18748 87.180  9.337   0.2547
## 5   31 110  10.49 0.20640 95.974  9.797   0.6915
## 6   38  71   8.43 0.17361 80.728  8.985  -0.5587
## 7   45  49   7.00 0.11157 51.882  7.203  -0.2029
## 8   52  26   5.10 0.05478 25.474  5.047   0.0518
## 9   59  15   3.87 0.02055  9.555  3.091   0.7819
## 10  66   6   2.45 0.00589  2.737  1.654   0.7950
## 11  73   1   1.00 0.00129  0.599  0.774   0.2262

(c)Use the R rootogram function to plot the residuals. Interpret the residuals in the display. Are there any extreme residuals? Is there any distinctive pattern in the residuals? Based on your comments, is a normal curve a good model for football scores of winning teams? If the normal curve is not a good model, explain why.

RG <- ggplot(football, aes(winner)) +
geom_histogram(breaks = B_Bound)
out <- ggplot_build(RG)$data[[1]]
select(out, count, x, xmin, xmax)
##    count  x xmin xmax
## 1      7  3 -0.5  6.5
## 2     30 10  6.5 13.5
## 3     58 17 13.5 20.5
## 4     92 24 20.5 27.5
## 5    110 31 27.5 34.5
## 6     71 38 34.5 41.5
## 7     49 45 41.5 48.5
## 8     26 52 48.5 55.5
## 9     15 59 55.5 62.5
## 10     6 66 62.5 69.5
## 11     1 73 69.5 76.5
ggplot(out, aes(x, sqrt(count))) +
geom_col() +
geom_line(data = df,
aes(bnmids, sqrt.e), color="green")

Analyzing the residuals to help us compare the observed counts with the expected counts which shows if the normal curve gives a good fit for the football scores of the winning teams or not.

We will be using a hanging rootogram to examine the data because there appears to be some noticeable large residuals(deviation) from the previous table.

rootogram(s$counts, s$expected)

rootogram(s$counts, s$expected, type="deviation")

Here, the number of small winning scores appears a bit lower than expected and the number of large winning scores is slightly higher.

This observation shows that football games provide numerous scoring opportunities and winning scores are typically higher. we therefore expect fewer games with lower winning scores with a high centering of higher scores.

  1. For the team1 data, the data can be made more symmetric by applying a square root reexpression. Fit a Gaussian comparison curve to the root team1 data. (Bin the data using an appropriate set of bins, fit the Gaussian curve, and plot the residuals.) Comment on the goodness of the normal curve fit to the root data.
football$root_winner <- sqrt(football$winner)

# Fit a normal distribution to the transformed data
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:LearnEDAfunctions':
## 
##     farms
## The following object is masked from 'package:dplyr':
## 
##     select
fit <- fitdistr(football$root_winner, "normal")

# Create a histogram and overlay the Gaussian fit
library(ggplot2)
ggplot(football, aes(x = root_winner)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "yellow", color = "blue") +
  stat_function(fun = dnorm, 
                args = list(mean = fit$estimate['mean'], sd = fit$estimate['sd']), 
                color = "green", size = 1.5) +
  labs(title = "Histogram with Gaussian Fit",
       x = "Square Root of Winner Scores",
       y = "Density") +
  theme_minimal()

The Gaussian curve (a bell-shaped curve) fits well with the square root of the winner scores, showing that it’s a good match. The highest point of the curve lines up with the peak of the histogram, meaning the transformed scores follow a normal distribution.

However, like in our previous analysis, the fit isn’t as good for the lower scores. The Gaussian model doesn’t capture these low scores well, indicating that the fit could be improved in this area.

Q2

Use the EDA methods to fit a Gaussian comparison curve to the heights for a sample of college women who attend introductory statistics classes. The datafile studentdata in the LearnEDA package contains the data and the relevant variables are Height and Gender.

View(studentdata)
CW <- studentdata[studentdata$Gender == "female", ]
view(CW)
bins <- seq(52.5, 85.5, 3)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2


ggplot(CW, aes(Height)) +
geom_histogram(breaks = bins,
fill = "blue",
color = "green")
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).

fivenum(CW$Height)
## [1] 54.0 63.0 64.5 67.0 84.0

The fourths are 63.0 and 67.0, with a mean of 65 and a SD of 2.97. The Gaussian curve is N(65, 2.97).

The table below shows the observed count (d) and expected count (e) at all the intervals

Obs <- fit.gaussian(CW$Height, bins, 65, 2.97)
options(digits=3)
(df <- data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),
Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),
Residual=s$residual))
##    Mid   d sqrt.d    Prob      e sqrt.e Residual
## 1   54   7   2.65 0.02795 12.997  3.605  -0.9594
## 2   57  30   5.48 0.06900 32.084  5.664  -0.1871
## 3   60  58   7.62 0.13012 60.507  7.779  -0.1628
## 4   63  92   9.59 0.18748 87.180  9.337   0.2547
## 5   66 110  10.49 0.20640 95.974  9.797   0.6915
## 6   69  71   8.43 0.17361 80.728  8.985  -0.5587
## 7   72  49   7.00 0.11157 51.882  7.203  -0.2029
## 8   75  26   5.10 0.05478 25.474  5.047   0.0518
## 9   78  15   3.87 0.02055  9.555  3.091   0.7819
## 10  81   6   2.45 0.00589  2.737  1.654   0.7950
## 11  84   1   1.00 0.00129  0.599  0.774   0.2262
plot <- ggplot(CW, aes(Height)) +
geom_histogram(breaks = bins)
Out <- ggplot_build(plot)$data[[1]]
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).
library(ggplot2)
p <- ggplot(CW, aes(Height)) +
geom_histogram(breaks = bins)
Out <- ggplot_build(p)$data[[1]]
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(out, aes(x, sqrt(count))) +
geom_col() +
geom_line(data = df,
aes(bin.mids, sqrt.e), color="green")

library(vcd)
rootogram(Obs$counts, Obs$expected)

rootogram(Obs$counts, Obs$expected, type="deviation")

In conclusion, looking at the residuals, we see that there are more women who are either very short or very tall. However, there are fewer women with heights that are in the middle range between short and tall.