Exercise 3.1 See the Arbuthnot data in HistData library. The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the christening records for children born in London for every year from 1629 to 1710. It also contains the variable Ratio, giving the ratio of male to female births.
library(HistData)
library(ggplot2)
data('Arbuthnot', package = 'HistData')
maxRatio <- max(Arbuthnot$Ratio)
minRatio <- min(Arbuthnot$Ratio)
- Make a plot of Ratio over Year. What features stand out? Which plot do you prefer to display the tendency for more male births
ggplot(Arbuthnot, aes(x = Arbuthnot$Year, y=Arbuthnot$Ratio))+
geom_line() +
geom_point(aes(color=Ratio)) +
labs(title='Male to Female Birth Ratio',x="Year", y="Ratio") +
theme_classic()+
ylim(1, maxRatio)+
geom_hline(yintercept = maxRatio,color="green", linetype="dashed")+
geom_hline(yintercept = minRatio,color="red", linetype="dashed")+
geom_smooth(method = "lm")
The Ratio was always greater than 1 which meant that male births were always greater than female births. However, the trendline shows that male to female ratio has been declining over time.
The same line plot could be used to display the tendancy for more male births as it clearly shows the ratio being greater than 1 over time.
- Plot the total number of christenings, Males + Females or Total (in 000s) over time. What unusual features do you see?
ggplot(Arbuthnot, aes(x = Arbuthnot$Year, y=Arbuthnot$Total))+
geom_line() +
geom_point(aes(color=Total)) +
labs(title='Total Christenings year over year',x="Year", y="Total Christenings") +
theme_classic()+
geom_smooth(method = "lm")
It is observed that there was a decline in the total number of christenings from 1640 to 1650 but overall since 1660 it has been on a rise.
Exercise 3.3 Use the data set WomenQueue to:
- Make a reasonable plot showing departure from the binomial distribution.
library(vcd)
## Loading required package: grid
data('WomenQueue', package = 'vcd')
barplot(WomenQueue,xlab="Counts",ylab="Frequency")
fit <- goodfit(WomenQueue, type="binomial",par = list(size = 10))
summary(fit)
##
## Goodness-of-fit test for binomial distribution
##
## X^2 df P(> X^2)
## Likelihood Ratio 8.650999 8 0.3725869
plot(fit)
The Rootogram shows that there are deviations from the actual values and hence we can say that there are deviations from the binomial distribution
- Suggest some reasons why the number of women in queues of length 10 might depart from a binomial distribution, Bin(n = 10, p = 1/2).
fit2 <- goodfit(WomenQueue, type= "binomial", par = list(prob = .5,size = 10))
summary(fit2)
##
## Goodness-of-fit test for binomial distribution
##
## X^2 df P(> X^2)
## Pearson 32.56737 10 0.0003219281
## Likelihood Ratio 25.59893 9 0.0023753984
plot(fit2,xlab="Counts",main = 'Bin(n=10, p=1/2)')
distplot(WomenQueue, type = c("binomial"), conf_level = 0.95)
The Rootogram again shows that there is departure from Binomial Distribution. The Slope and the intercept of the distplot also suggests the same.
Exercise 3.4 Work on the distribution of male children in families in “Saxony” by fitting a binomial distribution, (n=12,p=1/2) , specifying equal probability for boys and girls. [Hint: you need to specify both size and prob values for goodfit ().]
- Carry out the GOF test for this fixed binomial distribution. What is the ratio of χ2/df? What do you conclude?
data('Saxony', package = 'vcd')
fit3 <- goodfit(Saxony, type = "binomial", par = list(prob = .5,size = 12))
summary(fit3)
##
## Goodness-of-fit test for binomial distribution
##
## X^2 df P(> X^2)
## Pearson 249.1954 12 2.013281e-46
## Likelihood Ratio 205.4060 12 2.493625e-37
The ratio is approximately 17. Since p-value is below the critical value, we can say that it is not a binomial distribution.
- Test the additional lack of fit for the model Bin(n=12,p=1/2) compared to the model Bin(n=12,p=p ̂) where p ̂ is estimated from the data
fit4 <- goodfit(Saxony, type = "binomial", par = list(size = 12))
summary(fit4)
##
## Goodness-of-fit test for binomial distribution
##
## X^2 df P(> X^2)
## Likelihood Ratio 97.0065 11 6.978187e-16
χ2/df Ratio is around 9. p-value is below the critical value, we can say that it is not a binomial distribution.
- Use the plot(goodfit ()) method to visualize these two models
plot(fit3, xlab="Males", main = 'probability= 0.5')
plot(fit4, xlab="Males", main = 'probability=p')