Please put the answers for Part I next to the question number (2pts each):
7a. Describe the two distributions (2pts).
A: right skewed, unimodal
B: normal distribution, unimodal
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
The standard deviation (sd) of the observation describes the spread of its values in the sample. The standard deviation of the sample mean is the standard error of the mean (SE). It describes the accuracy as an estimate of the population mean \(\mu\).
\(SE = sd_B = \frac{\sigma}{\sqrt{n}} = \frac{3.22}{\sqrt{30}} =\) 0.5878889
7c. What is the statistical principal that describes this phenomenon (2 pts)?
It’s the Central Limit Theorem.
Consider the four datasets, each with two columns (x and y), provided below.
#options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
data.frame(data=c("data1","data2","data3","data4"), mean_x=c(round(mean(data1$x),2),round(mean(data2$x),2),round(mean(data3$x),2),round(mean(data4$x),2)), mean_y=c(round(mean(data1$y),2),round(mean(data2$y),2),round(mean(data3$y),2),round(mean(data4$y),2)))
data.frame(data=c("data1","data2","data3","data4"), median_x=c(round(median(data1$x),2),round(median(data2$x),2),round(median(data3$x),2),round(median(data4$x),2)), median_y=c(round(median(data1$y),2),round(median(data2$y),2),round(median(data3$y),2),round(median(data4$y),2)))
data.frame(data=c("data1","data2","data3","data4"), sd_x=c(round(sd(data1$x),2),round(sd(data2$x),2),round(sd(data3$x),2),round(sd(data4$x),2)), sd_y=c(round(sd(data1$y),2),round(sd(data2$y),2),round(sd(data3$y),2),round(sd(data4$y),2)))
data.frame(data1 = cor(data1))
data.frame(data2 = cor(data2))
data.frame(data3 = cor(data3))
data.frame(data4 = cor(data4))
lm_data1 <- lm(data1$y ~ data1$x)
lm_data2 <- lm(data2$y ~ data2$x)
lm_data3 <- lm(data3$y ~ data3$x)
lm_data4 <- lm(data4$y ~ data4$x)
Linear regression equation (\(y = {\beta}_0 + {\beta}_1 * x\)) for:
data1: \(y = 3 + 0.5 * x\)
data2: \(y = 3.001 + 0.5 * x\)
data3 :\(y = 3.002 + 0.5 * x\)
data4: \(y = 3.002 + 0.5 * x\)
data.frame(data=c("data1","data2","data3","data4"), r_squared=c(summary(lm_data1)$r.squared,summary(lm_data2)$r.squared,summary(lm_data3)$r.squared,summary(lm_data4)$r.squared))
par(mfrow=c(2,2))
hist(lm_data1$residuals)
qqnorm(lm_data1$residuals)
qqline(lm_data1$residuals)
par(mfrow=c(2,2))
hist(lm_data2$residuals)
qqnorm(lm_data2$residuals)
qqline(lm_data2$residuals)
par(mfrow=c(2,2))
hist(lm_data3$residuals)
qqnorm(lm_data3$residuals)
qqline(lm_data3$residuals)
par(mfrow=c(2,2))
hist(lm_data4$residuals)
qqnorm(lm_data4$residuals)
qqline(lm_data4$residuals)