In this exercise we are going to investigate a dataset that consists of the number of spelling mistakes found in 150 short paragraphs written by elementary school students

a)

spelling<-scan(file = "/Users/Leland/Desktop/Stat 200/Datasets/spelling.txt")
fivenum(spelling) 
## [1] 0 2 2 3 7
boxplot(spelling, main = 'Number of spelling mistakes',ylab = 'Number of mistakes' )

hist(spelling,xlab = 'Number of Mistakes',prob = T)

The histogram shows that the data are skewed to the right. The five number summary shows that the IQR is pretty narrow (3-2 = 1).

b)

hist(spelling,xlab = 'Number of Mistakes',prob = T)
sd<-sd(spelling); sd
## [1] 1.440249
m<-mean(spelling); m
## [1] 2.446667
curve(dnorm(x, mean=m, sd=sd),add=T, col="red", lty=2)

The normal distribution is not a suitable choice. The data are skewed to the right. To further investigate, we will compare quantiles using qqplot, and the shapiro test to investigate.

qqnorm(spelling)
qqline(spelling)

shapiro.test(spelling)
## 
##  Shapiro-Wilk normality test
## 
## data:  spelling
## W = 0.93631, p-value = 2.804e-06

A perfect normal sample would fall in a 45 degree straight line. The fact that the RV spelling is not continuous makes the comparison between the sample and the normal distribution somewhat hard to see, but it is obvious that the sample does not match the quantile line very well, suggesting that the normal distribution is not a very good description of the data. Additionally, the very low p value of the shapiro test suggests that it is very unlikely that the data is normally distributed.

c)

table(spelling>=3)
## 
## FALSE  TRUE 
##    85    65
SampProportion<-65/150
q<-quantile(spelling)
TheoreticalProportion<-1 - pnorm(2,mean = m,sd=sd) #notice that the way the data is binned, that you need to specify the quantile as two to get the true probability of 3 or more... 

The number of kids who had 3 or more mistakes is

SampProportion 
## [1] 0.4333333

The estimate of the probability using the normal distribution is

TheoreticalProportion
## [1] 0.6217695

d)

The poission might work better than the normal. Let’s check…

hist(spelling,xlab = 'Number of Mistakes',prob = T)
set.seed(100)
y<-rpois(10000,m)
lines(density(y,bw=1), col='red', lwd=3)

The poisson does appear to be a better approximation for the general shape of the data.

e)

SampProportion<-65/150
TheoreticalProportionP<-1 - ppois(2,m) #notice that the way the data is binned, that you need to specify the quantile as two to get the true probability of 3 or more... 

The number of kids who had 3 or more mistakes is

SampProportion 
## [1] 0.4333333

The estimate of the probability using the Poisson distribution is

TheoreticalProportionP
## [1] 0.4424349

The Poisson is a better approximation for this data set because the estimated theoretical probability is a much better match than the estimated probability with the normal distribution