Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.
Citation:
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
Install the ARM package for running regression
A test is graded from 0 to 50, with an average score of 35 and a standard deviation of 10. For comparison to other tests, it would be convenient to rescale to a mean of 100 and standard deviation of 15.
old_mean <- 35
old_sd <- 10
old_var<-10^2
new_mean<- 100
new_sd <- 15
new_var<-15^2
#TRANSFORM!
#vary=a^2*varx
a<-sqrt(new_var/old_var)
b<-old_var - a*old_mean
a
## [1] 1.5
b
## [1] 47.5
#y=ax+b
#sqrt((sum(new_mean-old_mean)^2)()
#linear_transform<- function(x){
#(x-old_mean) * (new_sd / old_sd) + new_mean
#}
#original_scores<-c(10,20,30,40,50)
#new_scores<-linear_transform(original_scores)
#new_scores
A linear trandformation can be completed using the equation for a line
\[ y = aX + b \] Thus, E[Y]= a E[X] + b and Var[Y] = a^{2} * Var[X]
We want E[Y] = 100 and Var[Y] = 15^{2} = 225.
Using these, we get that: 100 = a * 35 + b 225 = a^{2} * 100
# Solving for a and b we get?
#y=ax+b
So the linear transformation required is \[Y = aX + b\]
The following are the proportions of girl births in Vienna for each month in 1908 and 1909 (out of an average of 3900 births per month):
girls.1908 <- c(.4777, .4875, .4859, .4754, .4874, .4864, .4813, .4787, .4895, .4797, .4876, .4859)
girls.1909 <- c(.4857, .4907, .5010, .4903, .4860, .4911, .4871, .4725, .4822, .4870, .4823, .4973)
n <- 3900
girls <- c(girls.1908, girls.1909)
Compute the standard deviation of these proportions and compare to the standard deviation that would be expected if the sexes of babies were independently decided with a constant probability over the 24-month period.
sdobs<-sd(girls)
sdobs
## [1] 0.006409724
pobs<-mean(girls)
pobs
## [1] 0.485675
stderror.expected<- sqrt(pobs*(1-pobs)/n)
stderror.expected
## [1] 0.008003121
Demonstration of the Central Limit Theorem: let x = x1 + · · · + x20, the sum of 20 independent Uniform(0,1) random variables. In R, create 1000 simulations of x and plot their histogram. On the histogram, overlay a graph of the normal density function. Comment on any differences between the histogram and the curve.
sampleN<-20
c<-0
d<-1
single.mean<-(c+d)/2
single.mean
## [1] 0.5
sum.mean1<- sampleN*single.mean
sum.mean1
## [1] 10
single.var<-(sum((c-.5)^2+(d-.5)^2))/(sampleN-1)
# We saw that you multiplied by 1/12 for single.var. Why 12 and not n-1=19?
single.var
## [1] 0.02631579
sum.sd<-sqrt(single.var)*sampleN
#Why was the sample size included in your sqrt() function?
sum.sd
## [1] 3.244428
simulations<-1000
obs1 <- runif(sampleN*simulations)
indv<-matrix(obs1,nrow=simulations, ncol=sampleN, byrow=F)
data1<-data.frame(out=rowSums(indv))
typeof(data1)
## [1] "list"
ggplot(data1, aes(x=out),family="Courier") +
geom_histogram(aes(y=after_stat(density)), binwidth=1,alpha=.1)+
geom_density()+
stat_function(fun=dnorm)
Distribution of averages and differences: the heights of men in the United States are approximately normally distributed with mean 69.1 inches and standard deviation 2.9 inches. The heights of women are approximately normally distributed with mean 63.7 inches and standard deviation 2.7 inches. Let x be the average height of 100 randomly sampled men, and y be the average height of 100 randomly sampled women. In R, create 1000 simulations of x − y and plot their histogram. Using the simulations, compute the mean and standard deviation of the distribution of x − y and compare to their exact values.
samplen<-100
mean_height_men<-69.1
stdev_height_men<-2.9
mean_height_women<-63.7
stdev_height_women<-2.7
simulations2<-1000
# Create a variable for normally distributed data for men and one for women
obs_men<-rnorm(samplen*simulations2,mean=mean_height_men,sd=stdev_height_men)
obs_women<-rnorm(samplen*simulations2,mean=mean_height_women,sd=stdev_height_women)
# Create a Matrix of the observations data.
indv_men<-matrix(obs_men,nrow=simulations2,ncol=samplen,byrow=F)
indv_women<-matrix(obs_women,nrow=simulations2,ncol=samplen,byrow=F)
# create data set or data frome.
data2<-data.frame(out_men=rowMeans(indv_men), out_women=rowMeans(indv_women))
data2['diff']=data2['out_men']-data2['out_women']
data2[0:3,]
## out_men out_women diff
## 1 68.95972 63.82221 5.137515
## 2 69.05698 63.93523 5.121752
## 3 69.15701 63.76389 5.393128
Now plot it
ggplot(data2,aes(x=diff))+
geom_histogram()+
ggtitle("x-y")+
xlab("Value")+
ylab("Counts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
tmean<-mean_height_men-mean_height_women
tmean
## [1] 5.4
meandata<-mean(data2$diff)
meandata
## [1] 5.399659
tsd<-sqrt(stdev_height_men^2+stdev_height_women^2)
tsd
## [1] 3.962323
datasd<-sd(data2$diff)
datasd
## [1] 0.3932614
Correlated random variables: suppose that the heights of husbands and wives have a correlation of 0.3. Let x and y be the heights of a married couple chosen at random. What are the mean and standard deviation of the average height, (x + y)/2?
e<-mean_height_men
f<-mean_height_women
sd_e<-stdev_height_men
sd_f<-stdev_height_women
rhwc<-.3
# plug into formula from chapter 2
tmean<-(e+f)/2
tmean
## [1] 66.4
tsd1<-(sd_e^2+sd_f^2+rhwc*sd_e*sd_f)# Why did you multiply correlation by 2 and then ^.5?
tsd1
## [1] 18.049
Comments