ANLY545 HW2_Tian Tian

Exercise 3.1 The Arbuthnot data in HistData also contains the variable Ratio, giving the ratio of male to female births.

(a) Make a plot of Ratio over Year. What features stand out? Which plot do you prefer to display the tendency for more male births?

library(HistData)
library(ggplot2)
data('Arbuthnot', package = 'HistData')
maxRatio <- max(Arbuthnot$Ratio)
maxRatio

## [1] 1.156075

minRatio <- min(Arbuthnot$Ratio)
minRatio

## [1] 1.010673

ggplot(Arbuthnot, aes(x = Arbuthnot$Year, y=Arbuthnot$Ratio))+ 
  geom_line() +
  geom_point(aes(color=Ratio)) +
  labs(title='Male to Female Birth Ratio Over Year',x="Year", y="Male to Female Birth Ratio") +
  theme_classic()+
  ylim(1, maxRatio)+
  geom_hline(yintercept = maxRatio,color="blue", linetype="dashed")+
  geom_hline(yintercept = minRatio,color="red", linetype="dashed")+
  geom_smooth(method = "lm")

First, the ratio has a min greater than 1.0, meaning that the birth of male was alwasys higher than that of female during the period of our data. Second, the trend line shows a downward slope, meaning that the birth ratio of male to female has been declining over time in our data.

I used line plot with point to show the tendency since the trend changes over time and the ratio has alwasys been above 1.0.

(b) Plot the total number of christenings, Males + Females or Total (in 000s) over time. What unusual features do you see?

ggplot(Arbuthnot, aes(x = Arbuthnot$Year, y=Arbuthnot$Total))+ 
  geom_line() +
  geom_point(aes(color=Total)) +
  labs(title='Total Christenings Over Year',x="Year", y="Total Christenings") +
  theme_classic()+
  geom_smooth(method = "lm")

From the upward-sloping trend line we can see that the total christenings have been increasing over time, but there was a period of outliers. The number started dropping rapidly in 1640 and didn’t start to go back until 1660.

Exercise 3.3 Use the data set WomenQueue to:

(c) Make a reasonable plot showing departure from the binomial distribution.

library(vcd)

## Warning: package 'vcd' was built under R version 3.4.3

## Loading required package: grid

data('WomenQueue', package = 'vcd')
barplot(WomenQueue, main="Women Queue Depature Distribution",xlab="Counts",ylab="Frequency")

WQ.fit <- goodfit(WomenQueue, type="binomial")

## Warning in goodfit(WomenQueue, type = "binomial"): size was not given,
## taken as maximum count

unlist(WQ.fit$par)

##   prob   size 
##  0.435 10.000

plot(WQ.fit, xlab="Counts",main = 'Rootogram')

From the rootgram, we can see that there are a lot of deviations from the actual value.

(d) Suggest some reasons why the number of women in queues of length 10 might depart from a binomial distribution, Bin(n = 10, p = 1/2).

WQ.fit1 <- goodfit(WomenQueue, type = "binomial", par = list(prob = .5,size = 10))
plot(WQ.fit1,xlab="Counts",main = 'Rootogram - Bin(n=10, p=1/2)')

distplot(WomenQueue, type = "binomial", conf_level = 0.95)

## Warning in distplot(WomenQueue, type = "binomial", conf_level = 0.95): size
## was not given, taken as maximum count

Again, here from the rootgram we can see that the fitted values deviated from the actual values, which means it departs from the binomial distribution. Also, from the Distplot we can see the intercept and slope is suggesting it departs from a binomial distribution.

Exercise 3.4 Work on the distribution of male children in families in “Saxony” by fitting a binomial distribution, in(n=12, p=1/2), specifying equal probability for boys and girls. [Hint: you need to specify both size and prob values for goodfit ().]

(a) Carry out the GOF test for this fixed binomial distribution. What is the ratio of χ2/df? What do you conclude?

data('Saxony', package = 'vcd')
Sa.fit <- goodfit(Saxony, type = "binomial", par = list(prob = .5,size = 12))
summary(Sa.fit)

## Warning in summary.goodfit(Sa.fit): Chi-squared approximation may be
## incorrect

## 
##   Goodness-of-fit test for binomial distribution
## 
##                       X^2 df     P(> X^2)
## Pearson          249.1954 12 2.013281e-46
## Likelihood Ratio 205.4060 12 2.493625e-37

Ratio of χ2/df is 205/12=17. Because of the ratio and the p-value (less than the critical value, thus reject null hypothesis of being a binomial distribution), we believe it is not a binomial distribution.

(b) Test the additional lack of fit for the model Bin(n=12, p=1/2) compared to the model 𝐵𝑖𝑛(𝑛 = 12, 𝑝 = 𝑝̂) where 𝑝̂ is estimated from the data.

Sa.fit1 <- goodfit(Saxony, type = "binomial", par = list(size = 12))
summary(Sa.fit1)

## 
##   Goodness-of-fit test for binomial distribution
## 
##                      X^2 df     P(> X^2)
## Likelihood Ratio 97.0065 11 6.978187e-16

Ratio of χ2/df is 97/11=8.8 and p-value is still less than 0.05, thus we reject the null hypothesis of being a binomial distribution. The chi-square ratio is smaller than question (a).

(c) Use the plot.goodfit () method to visualize these two models.

plot(Sa.fit, xlab="Number of Males", main = 'Visualization - Probability of 0.5')

plot(Sa.fit1, xlab="Number of Males", main = 'Visualization - Probability of p')

Both fit grams show there are deviations from actual values.