Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.

Citation:

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

Install the ARM package for running regression

Chapter 2 Exersize Questions

1

A test is graded from 0 to 50, with an average score of 35 and a standard deviation of 10. For comparison to other tests, it would be convenient to rescale to a mean of 100 and standard deviation of 15.

  1. How can the scores be linearly transformed to have this new mean and standard deviation?

Answer

old_mean <- 35
old_sd <- 10
old_var<-10^2

new_mean<- 100
new_sd <- 15
new_var<-15^2

#TRANSFORM!
#vary=a^2*varx
a<-sqrt(new_var/old_var)
b<-old_var - a*old_mean
a
## [1] 1.5
b
## [1] 47.5
#y=ax+b



#sqrt((sum(new_mean-old_mean)^2)()
#linear_transform<- function(x){
  #(x-old_mean) * (new_sd / old_sd) + new_mean
  #}


#original_scores<-c(10,20,30,40,50)
#new_scores<-linear_transform(original_scores)
#new_scores

A linear trandformation can be completed using the equation for a line

\[ y = aX + b \] Thus, E[Y]= a E[X] + b and Var[Y] = a^{2} * Var[X]

We want E[Y] = 100 and Var[Y] = 15^{2} = 225.

Using these, we get that: 100 = a * 35 + b 225 = a^{2} * 100

# Solving for a and b we get?
#y=ax+b

So the linear transformation required is \[Y = aX + b\]

Extra Credit

  1. There is another linear transformation that also rescales the scores to have mean 100 and standard deviation 15. What is it, and why would you not want to use it for this purpose?

Answer

2

The following are the proportions of girl births in Vienna for each month in 1908 and 1909 (out of an average of 3900 births per month):

girls.1908 <- c(.4777, .4875, .4859, .4754, .4874, .4864, .4813, .4787, .4895, .4797, .4876, .4859)
girls.1909 <- c(.4857, .4907, .5010, .4903, .4860, .4911, .4871, .4725, .4822, .4870, .4823, .4973)
     
n <- 3900
     

girls <- c(girls.1908, girls.1909)

Compute the standard deviation of these proportions and compare to the standard deviation that would be expected if the sexes of babies were independently decided with a constant probability over the 24-month period.

Answer

sdobs<-sd(girls)
sdobs
## [1] 0.006409724
pobs<-mean(girls)
pobs
## [1] 0.485675
stderror.expected<- sqrt(pobs*(1-pobs)/n)
stderror.expected
## [1] 0.008003121

3

Demonstration of the Central Limit Theorem: let x = x1 + · · · + x20, the sum of 20 independent Uniform(0,1) random variables. In R, create 1000 simulations of x and plot their histogram. On the histogram, overlay a graph of the normal density function. Comment on any differences between the histogram and the curve.

Answer

sampleN<-20
c<-0
d<-1

single.mean<-(c+d)/2
single.mean
## [1] 0.5
sum.mean1<- sampleN*single.mean
sum.mean1
## [1] 10
single.var<-(sum((c-.5)^2+(d-.5)^2))/(sampleN-1)
# We saw that you multiplied by 1/12 for single.var. Why 12 and not n-1=19?
single.var
## [1] 0.02631579
sum.sd<-sqrt(single.var)*sampleN
#Why was the sample size included in your sqrt() function?
sum.sd
## [1] 3.244428
simulations<-1000
obs1 <- runif(sampleN*simulations)
indv<-matrix(obs1,nrow=simulations, ncol=sampleN, byrow=F)
data1<-data.frame(out=rowSums(indv))
typeof(data1)
## [1] "list"
ggplot(data1, aes(x=out),family="Courier") +
  geom_histogram(aes(y=after_stat(density)), binwidth=1,alpha=.1)+
  geom_density()+
  stat_function(fun=dnorm)

Comment:

4

Distribution of averages and differences: the heights of men in the United States are approximately normally distributed with mean 69.1 inches and standard deviation 2.9 inches. The heights of women are approximately normally distributed with mean 63.7 inches and standard deviation 2.7 inches. Let x be the average height of 100 randomly sampled men, and y be the average height of 100 randomly sampled women. In R, create 1000 simulations of x − y and plot their histogram. Using the simulations, compute the mean and standard deviation of the distribution of x − y and compare to their exact values.

samplen<-100
mean_height_men<-69.1
stdev_height_men<-2.9

mean_height_women<-63.7
stdev_height_women<-2.7
simulations2<-1000

# Create a variable for normally distributed data for men and one for women 
obs_men<-rnorm(samplen*simulations2,mean=mean_height_men,sd=stdev_height_men)
obs_women<-rnorm(samplen*simulations2,mean=mean_height_women,sd=stdev_height_women)
# Create a Matrix of the observations data. 
indv_men<-matrix(obs_men,nrow=simulations2,ncol=samplen,byrow=F)
indv_women<-matrix(obs_women,nrow=simulations2,ncol=samplen,byrow=F)
     
# create data set or data frome. 
data2<-data.frame(out_men=rowMeans(indv_men), out_women=rowMeans(indv_women))
data2['diff']=data2['out_men']-data2['out_women']
data2[0:3,]
##    out_men out_women     diff
## 1 68.95972  63.82221 5.137515
## 2 69.05698  63.93523 5.121752
## 3 69.15701  63.76389 5.393128

Now plot it

ggplot(data2,aes(x=diff))+
  geom_histogram()+
  ggtitle("x-y")+
  xlab("Value")+
  ylab("Counts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

tmean<-mean_height_men-mean_height_women
tmean
## [1] 5.4
meandata<-mean(data2$diff)
meandata
## [1] 5.399659
tsd<-sqrt(stdev_height_men^2+stdev_height_women^2)
tsd
## [1] 3.962323
datasd<-sd(data2$diff)
datasd
## [1] 0.3932614

Comments

5

Correlated random variables: suppose that the heights of husbands and wives have a correlation of 0.3. Let x and y be the heights of a married couple chosen at random. What are the mean and standard deviation of the average height, (x + y)/2?

Answer

e<-mean_height_men
f<-mean_height_women
sd_e<-stdev_height_men
sd_f<-stdev_height_women
rhwc<-.3
# plug into formula from chapter 2
tmean<-(e+f)/2
tmean
## [1] 66.4
tsd1<-(sd_e^2+sd_f^2+rhwc*sd_e*sd_f)# Why did you multiply correlation by 2 and then ^.5?
tsd1
## [1] 18.049