Exercise 3.1 See the Arbuthnot data in HistData library. The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the christening records for children born in London for every year from 1629 to 1710. It also contains the variable Ratio, giving the ratio of male to female births.
library("HistData")
library("plotly")
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
summary(Arbuthnot)
## Year Males Females Plague
## Min. :1629 Min. :2890 Min. :2722 Min. : 0.00
## 1st Qu.:1649 1st Qu.:4759 1st Qu.:4457 1st Qu.: 0.00
## Median :1670 Median :6073 Median :5718 Median : 3.00
## Mean :1670 Mean :5907 Mean :5535 Mean : 1240.70
## 3rd Qu.:1690 3rd Qu.:7576 3rd Qu.:7150 3rd Qu.: 22.25
## Max. :1710 Max. :8426 Max. :7779 Max. :68596.00
## Mortality Ratio Total
## Min. : 8393 Min. :1.011 Min. : 5.612
## 1st Qu.:12739 1st Qu.:1.048 1st Qu.: 9.199
## Median :17867 Median :1.065 Median :11.813
## Mean :17816 Mean :1.071 Mean :11.442
## 3rd Qu.:21030 3rd Qu.:1.088 3rd Qu.:14.723
## Max. :97306 Max. :1.156 Max. :16.145
Make a plot of Ratio over Year. What features stand out? Which plot do you prefer to display the tendency for more male births?
plot1 = ggplot(Arbuthnot, aes(x = Year, y = Ratio)) +
geom_line() +
geom_hline(yintercept = mean(Arbuthnot$Ratio, na.rm = T)) +
xlab('Year') +
ggtitle('Ratio Over Time') +
theme_classic()
ggplotly(plot1)
plot2 = ggplot(Arbuthnot, aes(x = Year, y = Total)) +
geom_line() +
xlab('Year') +
ylab('(Thousands)') +
ggtitle('Total # of Christenings Over Time') +
theme_classic()
ggplotly(plot2)
Use the data set WomenQueue to: (c) Make a reasonable plot showing departure from the binomial distribution.
data('WomenQueue', package = 'vcd')
library(vcd)
## Loading required package: grid
str(WomenQueue)
## table [1:11(1d)] 1 3 4 23 25 19 18 5 1 1 ...
## - attr(*, "dimnames")=List of 1
## ..$ nWomen: chr [1:11] "0" "1" "2" "3" ...
barplot(WomenQueue, xlab= "Counts'", col= "pink", main = 'Barplot for Women Queue Distribution', cex.lab= 1)
plot_fit = goodfit(WomenQueue, type = "binomial")
## Warning in goodfit(WomenQueue, type = "binomial"): size was not given,
## taken as maximum count
plot(plot_fit, xlab="Counts", ylab = 'Square Root of Frequency', main = 'Binomial Fit Comparison')
After plotting the fitted values for binomial, we could see the gaps between actual and fitted values, further proven the deviation from binomial distribution.
Suggest some reasons why the number of women in queues of length 10 might depart from a binomial distribution, Bin(n = 10, p = 1/2).
plot_fit10 = goodfit(WomenQueue, type = "binomial", par = list(prob = .5,size = 10))
plot(plot_fit10, xlab="Counts", main = 'Binomial Fit Comparison - n = 10, p = 0.5', ylab = 'Square Root of Frequency')
Using a size of 10 and probability of 0.5, the fit plot shows gaps between fitted and actual values. Binomial is not a good fit for this distribution.
Exercise 3.4 Work on the distribution of male children in families in “Saxony” by fitting a binomial distribution, specifying equal probability for boys and girls. [Hint: you need to specify both size and prob values for goodfit ().] Carry out the GOF test for this fixed binomial distribution. What is the ratio of, What do you conclude?
data('Saxony', package = 'vcd')
str(Saxony)
## table [1:13(1d)] 3 24 104 286 670 ...
## - attr(*, "dimnames")=List of 1
## ..$ nMales: chr [1:13] "0" "1" "2" "3" ...
saxony.fit = goodfit(Saxony, type = "binomial", par = list(prob = .5,size = 12))
result = summary(saxony.fit)
## Warning in summary.goodfit(saxony.fit): Chi-squared approximation may be
## incorrect
##
## Goodness-of-fit test for binomial distribution
##
## X^2 df P(> X^2)
## Pearson 249.1954 12 2.013281e-46
## Likelihood Ratio 205.4060 12 2.493625e-37
The GOF test p-value is 2.49^-37 which is less than 0.5. We reject the null hypothesis that the data has a binomial distribution. Test the additional lack of fit for the model.
saxony.fit2 = goodfit(Saxony, type = "binomial", par = list(size = 12))
summary(saxony.fit2)
##
## Goodness-of-fit test for binomial distribution
##
## X^2 df P(> X^2)
## Likelihood Ratio 97.0065 11 6.978187e-16
plot(saxony.fit, xlab="Number of Males", main = 'Visualization of Binomial Fit using p = 0.5', ylab = 'Square Root of Frequency')
plot(saxony.fit2, xlab="Number of Males", main = 'Visualization of Binomial Fit using estimated p from data', ylab = 'Square Root of Frequency')
Both plots shows departure from binomial distribution, using estimated p works slightly better than a p-value of 5