library(LearnEDAfunctions)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vcd)
## Warning: package 'vcd' was built under R version 4.3.2
## Loading required package: grid
head(football)
## winner loser
## 1 50 0
## 2 0 0
## 3 55 0
## 4 24 3
## 5 28 20
## 6 8 7
view(football)
bins <- seq(-0.5, 76.5, 7)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2
ggplot(football, aes(winner)) +
geom_histogram(breaks = bins,
fill = "white",
color = "red")
fivenum(football$winner)
## [1] 0 21 30 39 73
Let’s find the matching Gaussian parameters for our football winner times data. Here the fourths are 21 and 39, So the matching mean is m = (21 + 39) / 2 = 30 and the matching standard deviation is s = (39 − 21) / 1.349 = 13.34 So our matching Gaussian curve is N(30, 13.34).
The table below gives the observed count (d) and expected count (e) for all the intervals
s <- fit.gaussian(football$winner, bins, 30, 13.34)
options(digits=3)
(df <- data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),
Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),
Residual=s$residual))
## Mid d sqrt.d Prob e sqrt.e Residual
## 1 3 7 2.65 0.02795 12.997 3.605 -0.9594
## 2 10 30 5.48 0.06900 32.084 5.664 -0.1871
## 3 17 58 7.62 0.13012 60.507 7.779 -0.1628
## 4 24 92 9.59 0.18748 87.180 9.337 0.2547
## 5 31 110 10.49 0.20640 95.974 9.797 0.6915
## 6 38 71 8.43 0.17361 80.728 8.985 -0.5587
## 7 45 49 7.00 0.11157 51.882 7.203 -0.2029
## 8 52 26 5.10 0.05478 25.474 5.047 0.0518
## 9 59 15 3.87 0.02055 9.555 3.091 0.7819
## 10 66 6 2.45 0.00589 2.737 1.654 0.7950
## 11 73 1 1.00 0.00129 0.599 0.774 0.2262
The figure below is a rootogram with a smooth curve on top that corresponds to the root expected counts
p <- ggplot(football, aes(winner)) +
geom_histogram(breaks = bins)
out <- ggplot_build(p)$data[[1]]
select(out, count, x, xmin, xmax)
## count x xmin xmax
## 1 7 3 -0.5 6.5
## 2 30 10 6.5 13.5
## 3 58 17 13.5 20.5
## 4 92 24 20.5 27.5
## 5 110 31 27.5 34.5
## 6 71 38 34.5 41.5
## 7 49 45 41.5 48.5
## 8 26 52 48.5 55.5
## 9 15 59 55.5 62.5
## 10 6 66 62.5 69.5
## 11 1 73 69.5 76.5
ggplot(out, aes(x, sqrt(count))) +
geom_col() +
geom_line(data = df,
aes(bin.mids, sqrt.e), color="red")
We will now look at residuals which compare the counts with the expected counts using the normal mode to see if the normal curve is a good model for football scores of winning teams.
Looking at the table as generated above previously, we have a few residuals like -0.9594 that are quite large. Let’s look at the hanging rootogram for more insight.
rootogram(s$counts, s$expected)
rootogram(s$counts, s$expected, type="deviation")
Here, we see that the number of small winning scores seems a bit low, and the number of large winning scores a bit high. We have some inconsistent residuals, which is why I think it is fair to say that the normal curve is not a decent approximation for our model as our data seems to have some larger residuals. Putting this into perspective in terms of the football game, it makes sense that there are fewer games with lower scores as there are plenty of opportunities in football to score. Of course, considering these are winning scores, we would expect a large number of winning scores as higher scores usually win compared to lower scores.
football$winner <- sqrt(football$winner)
view(football)
bins <- seq(-0.5, 9.5, 1)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2
ggplot(football, aes(winner)) +
geom_histogram(breaks = bins,
fill = "white",
color = "red")
fivenum(football$winner)
## [1] 0.00 4.58 5.48 6.24 8.54
Let’s find the matching Gaussian parameters for our new football winner times data. Here the fourths are 4.58 and 6.24 , So the matching mean is m = (4.58 + 6.24) / 2 = 5.41 and the matching standard deviation is s = (6.24 − 4.58) / 1.349 = 1.23 So our matching Gaussian curve is N(5.41, 1.23).
s <- fit.gaussian(football$winner, bins, 5.41, 1.23)
options(digits=3)
(df <- data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),
Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),
Residual=s$residual))
## Mid d sqrt.d Prob e sqrt.e Residual
## 1 0 1 1.00 0.000032 0.0149 0.122 0.8780
## 2 1 1 1.00 0.000706 0.3285 0.573 0.4268
## 3 2 5 2.24 0.008255 3.8386 1.959 0.2768
## 4 3 27 5.20 0.051236 23.8246 4.881 0.3151
## 5 4 61 7.81 0.169469 78.8032 8.877 -1.0669
## 6 5 145 12.04 0.299466 139.2515 11.800 0.2411
## 7 6 143 11.96 0.283075 131.6297 11.473 0.4853
## 8 7 65 8.06 0.143119 66.5501 8.158 -0.0956
## 9 8 16 4.00 0.038643 17.9689 4.239 -0.2390
## 10 9 1 1.00 0.005557 2.5842 1.608 -0.6075
p <- ggplot(football, aes(winner)) +
geom_histogram(breaks = bins)
out <- ggplot_build(p)$data[[1]]
select(out, count, x, xmin, xmax)
## count x xmin xmax
## 1 1 0 -0.5 0.5
## 2 1 1 0.5 1.5
## 3 5 2 1.5 2.5
## 4 27 3 2.5 3.5
## 5 61 4 3.5 4.5
## 6 145 5 4.5 5.5
## 7 143 6 5.5 6.5
## 8 65 7 6.5 7.5
## 9 16 8 7.5 8.5
## 10 1 9 8.5 9.5
ggplot(out, aes(x, sqrt(count))) +
geom_col() +
geom_line(data = df,
aes(bin.mids, sqrt.e), color="red")
rootogram(s$counts, s$expected)
rootogram(s$counts, s$expected, type="deviation")
As we can see, the residuals as well as fit of the curve seem to have improved after using the root reexpression on the football winning scores. We still have some residuals but overall, the normal curve seems to fit the data better and we can see a clear bell curve. Overall, the normal curve seems to fit this root data better.
View(studentdata)
women <- studentdata[studentdata$Gender == "female", ]
view(women)
bins <- seq(52.5, 85.5, 3)
bin.mids <- (bins[-1] + bins[-length(bins)]) / 2
ggplot(women, aes(Height)) +
geom_histogram(breaks = bins,
fill = "white",
color = "red")
## Warning: Removed 7 rows containing non-finite values (`stat_bin()`).
fivenum(women$Height)
## [1] 54.0 63.0 64.5 67.0 84.0
Let’s find the matching Gaussian parameters for our football winner times data. Here the fourths are 63.0 and 67.0 , So the matching mean is m = (63 + 67) / 2 = 65 and the matching standard deviation is s = (67 − 63) / 1.349 = 2.97 So our matching Gaussian curve is N(65, 2.97).
The table below gives the observed count (d) and expected count (e) for all the intervals
s <- fit.gaussian(women$Height, bins, 65, 2.97)
options(digits=3)
(df <- data.frame(Mid=bin.mids, d=s$counts, sqrt.d=sqrt(s$counts),
Prob=s$probs, e=s$expected, sqrt.e=sqrt(s$expected),
Residual=s$residual))
## Mid d sqrt.d Prob e sqrt.e Residual
## 1 54 3 1.73 6.78e-04 2.95e-01 0.54288 1.1892
## 2 57 7 2.65 1.36e-02 5.93e+00 2.43446 0.2113
## 3 60 47 6.86 1.05e-01 4.57e+01 6.75813 0.0975
## 4 63 158 12.57 3.14e-01 1.37e+02 11.68430 0.8855
## 5 66 142 11.92 3.67e-01 1.60e+02 12.63305 -0.7167
## 6 69 54 7.35 1.68e-01 7.31e+01 8.54716 -1.1987
## 7 72 12 3.46 2.99e-02 1.30e+01 3.60756 -0.1435
## 8 75 2 1.41 2.05e-03 8.92e-01 0.94461 0.4696
## 9 78 2 1.41 5.34e-05 2.32e-02 0.15246 1.2618
## 10 81 0 0.00 5.23e-07 2.27e-04 0.01508 -0.0151
## 11 84 1 1.00 1.90e-09 8.27e-07 0.00091 0.9991
p <- ggplot(women, aes(Height)) +
geom_histogram(breaks = bins)
out <- ggplot_build(p)$data[[1]]
## Warning: Removed 7 rows containing non-finite values (`stat_bin()`).
select(out, count, x, xmin, xmax)
## count x xmin xmax
## 1 3 54 52.5 55.5
## 2 7 57 55.5 58.5
## 3 47 60 58.5 61.5
## 4 158 63 61.5 64.5
## 5 142 66 64.5 67.5
## 6 54 69 67.5 70.5
## 7 12 72 70.5 73.5
## 8 2 75 73.5 76.5
## 9 2 78 76.5 79.5
## 10 0 81 79.5 82.5
## 11 1 84 82.5 85.5
ggplot(out, aes(x, sqrt(count))) +
geom_col() +
geom_line(data = df,
aes(bin.mids, sqrt.e), color="red")
rootogram(s$counts, s$expected)
rootogram(s$counts, s$expected, type="deviation")
Looking at the residuals, we see that the number of women with small heights as well as large heights seems to be fairly high. However, there seem to be less women with heights that are somewhat in the middle of small and large heights.