Exercise 3.1 See the Arbuthnot data in HistData library. The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the christening records for children born in London for every year from 1629 to 1710. It also contains the variable Ratio, giving the ratio of male to female births.

library("HistData")
library("plotly")
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
summary(Arbuthnot)
##       Year          Males         Females         Plague        
##  Min.   :1629   Min.   :2890   Min.   :2722   Min.   :    0.00  
##  1st Qu.:1649   1st Qu.:4759   1st Qu.:4457   1st Qu.:    0.00  
##  Median :1670   Median :6073   Median :5718   Median :    3.00  
##  Mean   :1670   Mean   :5907   Mean   :5535   Mean   : 1240.70  
##  3rd Qu.:1690   3rd Qu.:7576   3rd Qu.:7150   3rd Qu.:   22.25  
##  Max.   :1710   Max.   :8426   Max.   :7779   Max.   :68596.00  
##    Mortality         Ratio           Total       
##  Min.   : 8393   Min.   :1.011   Min.   : 5.612  
##  1st Qu.:12739   1st Qu.:1.048   1st Qu.: 9.199  
##  Median :17867   Median :1.065   Median :11.813  
##  Mean   :17816   Mean   :1.071   Mean   :11.442  
##  3rd Qu.:21030   3rd Qu.:1.088   3rd Qu.:14.723  
##  Max.   :97306   Max.   :1.156   Max.   :16.145

Make a plot of Ratio over Year. What features stand out? Which plot do you prefer to display the tendency for more male births?

plot1 = ggplot(Arbuthnot, aes(x = Year, y = Ratio)) + 
          geom_line() +
          geom_hline(yintercept = mean(Arbuthnot$Ratio, na.rm = T)) +
          xlab('Year') + 
          ggtitle('Ratio Over Time') + 
          theme_classic()

ggplotly(plot1)

The Ratio Over Time chart shows the gender ratio timeline, and I noticed that the mean value is 1.07, whereby the minimum is greater than 1. Male ratio is higher than female over the entire time. I used the timeline chart with a line indicating mean to display this imbalance in gender.

plot2 = ggplot(Arbuthnot, aes(x = Year, y = Total)) + 
          geom_line() +
          xlab('Year') + 
          ylab('(Thousands)') +
          ggtitle('Total # of Christenings Over Time') + 
          theme_classic()

ggplotly(plot2)

Overall, there is an upward trend which shows increasing number of christenings. There are some huge declines as well in Year 1650, 1659 and 1704. There might be some big historical events or there were data issues.

Use the data set WomenQueue to: (c) Make a reasonable plot showing departure from the binomial distribution.

data('WomenQueue', package = 'vcd')
library(vcd)
## Loading required package: grid
str(WomenQueue)
##  table [1:11(1d)] 1 3 4 23 25 19 18 5 1 1 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ nWomen: chr [1:11] "0" "1" "2" "3" ...
barplot(WomenQueue, xlab= "Counts'", col= "pink", main = 'Barplot for Women Queue Distribution', cex.lab= 1)

plot_fit = goodfit(WomenQueue, type = "binomial")
## Warning in goodfit(WomenQueue, type = "binomial"): size was not given,
## taken as maximum count

Warning in goodfit(WomenQueue, type = “binomial”): size was not given,

taken as maximum count

plot(plot_fit, xlab="Counts", ylab = 'Square Root of Frequency', main = 'Binomial Fit Comparison')

After plotting the fitted values for binomial, we could see the gaps between actual and fitted values, further proven the deviation from binomial distribution.

Suggest some reasons why the number of women in queues of length 10 might depart from a binomial distribution, Bin(n = 10, p = 1/2).

plot_fit10 = goodfit(WomenQueue, type = "binomial", par = list(prob = .5,size = 10))
plot(plot_fit10, xlab="Counts", main = 'Binomial Fit Comparison - n = 10, p = 0.5', ylab = 'Square Root of Frequency')

Using a size of 10 and probability of 0.5, the fit plot shows gaps between fitted and actual values. Binomial is not a good fit for this distribution.

Exercise 3.4 Work on the distribution of male children in families in “Saxony” by fitting a binomial distribution, specifying equal probability for boys and girls. [Hint: you need to specify both size and prob values for goodfit ().] Carry out the GOF test for this fixed binomial distribution. What is the ratio of, What do you conclude?

data('Saxony', package = 'vcd')
str(Saxony)
##  table [1:13(1d)] 3 24 104 286 670 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ nMales: chr [1:13] "0" "1" "2" "3" ...

table [1:13(1d)] 3 24 104 286 670 …

- attr(*, “dimnames”)=List of 1 ## ..$ nMales: chr [1:13] “0” “1” “2” “3” …

saxony.fit = goodfit(Saxony, type = "binomial", par = list(prob = .5,size = 12))
result = summary(saxony.fit)
## Warning in summary.goodfit(saxony.fit): Chi-squared approximation may be
## incorrect
## 
##   Goodness-of-fit test for binomial distribution
## 
##                       X^2 df     P(> X^2)
## Pearson          249.1954 12 2.013281e-46
## Likelihood Ratio 205.4060 12 2.493625e-37

Warning in summary.goodfit(saxony.fit): Chi-squared approximation may be

incorrect

Goodness-of-fit test for binomial distribution

X^2 df P(> X^2)

Pearson 249.1954 12 2.013281e-46

Likelihood Ratio 205.4060 12 2.493625e-37

The GOF test p-value is 2.49^-37 which is less than 0.5. We reject the null hypothesis that the data has a binomial distribution. Test the additional lack of fit for the model.

saxony.fit2 = goodfit(Saxony, type = "binomial", par = list(size = 12))
summary(saxony.fit2)
## 
##   Goodness-of-fit test for binomial distribution
## 
##                      X^2 df     P(> X^2)
## Likelihood Ratio 97.0065 11 6.978187e-16
plot(saxony.fit, xlab="Number of Males", main = 'Visualization of Binomial Fit using p = 0.5', ylab = 'Square Root of Frequency')

plot(saxony.fit2, xlab="Number of Males", main = 'Visualization of Binomial Fit using estimated p from data', ylab = 'Square Root of Frequency')

Both plots shows departure from binomial distribution, using estimated p works slightly better than a p-value of 5