Guiding Question #1: Is a Normal model appropriate for the given data?
Guiding Question #2: Do we have major reasons to doubt that the sample could have come from a Normal population? (This is an important assumption for t procedures in AP Stats)
set.seed(16) ## to generate the same graphs I generated, use this command. To generate different random samples, change this number to whatever you'd like. I chose 16 because it's my lucky number!
x<-rnorm(100)
hist(x)
qqnorm(x)
y<-rexp(100) ## extra credit: this is an exponential distribution. What transformation could we apply so that a Normal approximation is appropriate? Scroll to the bottom to see the answer.
hist(y)
qqnorm(y)
c<-rchisq(100,df=2)
hist(c)
qqnorm(c)
Let’s generate several examples of NPP made from a population which we KNOW is Normally distributed.
for(i in 1:9){
set.seed(i)
x<-rnorm(20) ## change the sample size to different sizes to see how the NPP is affected
par(mfrow=c(1,2))
hist(x,main="R.S. from NORMAL pop.")
qqnorm(x)
}
Now, let’s take a look at some examples of Normal Probability Plots from a population that we know is not Normal. For these examples, we’ll look at random samples from an Exponential distribution
for(i in 1:10){
set.seed(i)
x<-rexp(20)
par(mfrow=c(1,2))
hist(x,main="R.S. from Exponential Pop.")
qqnorm(x)
}
Now, let’s simulate some samples from a Cauchy population distribution:
for(i in 1:9){
set.seed(i)
x<-rcauchy(20)
par(mfrow=c(1,2))
hist(x,main="R.S. from CAUCHY Pop.")
qqnorm(x)
}
Finally, let’s simulate some samples from a Uniform population distribution. This distribution can be difficult to identify in small samples!
for(i in 1:9){
set.seed(i)
x<-runif(200) ## with a large sample size (n=200 here), this is clearly not Normal. Change the sample size to 20 though and you'll see that it can be very difficult to identify that the samples do not come from a Normal distribution
par(mfrow=c(1,2))
hist(x,main="R.S. from UNIFORM Pop.")
qqnorm(x)
}
hist(y)
logy=log(y) ## since the sample comes from an exponential population, a log transformation seems reasonable
hist(logy) ## the transformed data are much less skewed...
qqnorm(logy) ## the QQ plot is more linear. Do you think t methods are appropriate?