Yun Mai
May 19, 2017
car. 1 = compact, 2 = standard size, 3 = mini van, 4 = SUV, and 5 = truck
color. red, blue, green, black, white
daysDrive. number of days per week the student drives
gasMonth. the amount of money the student spends on gas per month
car
daysDrive
daysDrive, car
daysDrive, gasMonth
car, daysDrive, gasMonth
Answer: c
mean = 3.3, median = 3.5
mean = 3.5, median = 3.3
mean = 2.9, median = 3.8
mean = 3.8, median = 2.9
mean = 2.5, median = 3.8
Answer: c
Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups.
Identify Ebola patients who received the new treatment and those who did not, and then compare the fever of those two groups.
Identify clusters of villages and then stratify them by gender and compare the fevers of male and female groups.
Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.
Answer: a
there is a difference between average eye color and average hair color.
a person's hair color is determined by his or her eye color.
there is an association between natural hair color and eye color.
eye color and natural hair color are independent
Answer:c
a<-data.frame(min=26, Q1=37, median=45, Q3=49.8, max=65, mean=44.4, sd=8.4, n=50
)
kable(a)
min | Q1 | median | Q3 | max | mean | sd | n |
---|---|---|---|---|---|---|---|
26 | 37 | 45 | 49.8 | 65 | 44.4 | 8.4 | 50 |
37.0 and 49.8
17.8 and 69.0
36.0 and 52.8
26.0 and 50.0
19.2 and 69.9
Answer: b
mean and median; standard deviation and interquartile range
mean and standard deviation; median and interquartile range
standard deviation and interquartile range; mean and median
median and interquartile range; mean and standard deviation
median and standard deviation; mean and interquartile range
Answer: c
Figure A: the distribution is unimodal and right skewed.
Figure B: the distribution is unimodal and nearly normal.
The means of these two distributions are similar because the sample means should tend to fall around the population mean.
Standard deviations of these two distributions are not similar because the standard deviation of the sample mean tells us how far the typical estimate of population mean is away from the actual population mean while the standard deviation of an observed variable is how far an observation is to the population mean.
The statistical principal that describes this phenomenon is Central Limit Theorem.If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model.
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
data <- as.data.frame(cbind(data1,data2,data3,data4))
colnames(data) <- c("data1.x","data1.y","data2.x","data2.y","data3.x","data3.y","data4.x","data4.y")
kable(data)
data1.x | data1.y | data2.x | data2.y | data3.x | data3.y | data4.x | data4.y |
---|---|---|---|---|---|---|---|
10 | 8.0 | 10 | 9.1 | 10 | 7.5 | 8 | 6.6 |
8 | 7.0 | 8 | 8.1 | 8 | 6.8 | 8 | 5.8 |
13 | 7.6 | 13 | 8.7 | 13 | 12.7 | 8 | 7.7 |
9 | 8.8 | 9 | 8.8 | 9 | 7.1 | 8 | 8.8 |
11 | 8.3 | 11 | 9.3 | 11 | 7.8 | 8 | 8.5 |
14 | 10.0 | 14 | 8.1 | 14 | 8.8 | 8 | 7.0 |
6 | 7.2 | 6 | 6.1 | 6 | 6.1 | 8 | 5.2 |
4 | 4.3 | 4 | 3.1 | 4 | 5.4 | 19 | 12.5 |
12 | 10.8 | 12 | 9.1 | 12 | 8.2 | 8 | 5.6 |
7 | 4.8 | 7 | 7.3 | 7 | 6.4 | 8 | 7.9 |
5 | 5.7 | 5 | 4.7 | 5 | 5.7 | 8 | 6.9 |
For each column, calculate (to two decimal places):
mean_data1.x <- format(round(mean(data$data1.x),2),nsmall=2)
paste("mean_data1.x:",mean_data1.x)
## [1] "mean_data1.x: 9.00"
mean_data1.y <- format(round(mean(data$data1.y),2),nsmall=2)
paste("mean_data1.y:",mean_data1.y)
## [1] "mean_data1.y: 7.50"
mean_data2.x <- format(round(mean(data$data2.x),2),nsmall=2)
paste("mean_data2.x:",mean_data2.x)
## [1] "mean_data2.x: 9.00"
mean_data2.y <- format(round(mean(data$data2.y),2),nsmall=2)
paste("mean_data2.y:",mean_data2.y)
## [1] "mean_data2.y: 7.50"
mean_data3.x <- format(round(mean(data$data3.x),2),nsmall=2)
paste("mean_data3.x:",mean_data3.x)
## [1] "mean_data3.x: 9.00"
mean_data3.y <- format(round(mean(data$data3.y),2),nsmall=2)
paste("mean_data3.y:",mean_data3.y)
## [1] "mean_data3.y: 7.50"
mean_data4.x <- format(round(mean(data$data4.x),2),nsmall=2)
paste("mean_data4.x:",mean_data4.x)
## [1] "mean_data4.x: 9.00"
mean_data4.y <- format(round(mean(data$data4.y),2),nsmall=2)
paste("mean_data4.y:",mean_data4.y)
## [1] "mean_data4.y: 7.50"
median_data1.x <- format(round(median(data$data1.x),1),nsmall=1)
paste("median_data1.x:",median_data1.x)
## [1] "median_data1.x: 9.0"
median_data1.y <- format(round(median(data$data1.y),1),nsmall=1)
paste("median_data1.y:",median_data1.y)
## [1] "median_data1.y: 7.6"
median_data2.x <- format(round(median(data$data2.x),1),nsmall=1)
paste("median_data2.x:",median_data2.x)
## [1] "median_data2.x: 9.0"
median_data2.y <- format(round(median(data$data2.y),1),nsmall=1)
paste("median_data2.y:",median_data2.y)
## [1] "median_data2.y: 8.1"
median_data3.x <- format(round(median(data$data3.x),1),nsmall=1)
paste("median_data3.x:",median_data3.x)
## [1] "median_data3.x: 9.0"
median_data3.y <- format(round(median(data$data3.y),1),nsmall=1)
paste("median_data3.y:",median_data3.y)
## [1] "median_data3.y: 7.1"
median_data4.x <- format(round(median(data$data4.x),1),nsmall=1)
paste("median_data4.x:",median_data4.x)
## [1] "median_data4.x: 8.0"
median_data4.y <- format(round(median(data$data4.y),3),nsmall=1)
paste("median_data4.y:",median_data4.y)
## [1] "median_data4.y: 7.0"
sd_data1.x <- format(round(sd(data$data1.x),2),nsmall=2)
paste("sd_data1.x:",sd_data1.x)
## [1] "sd_data1.x: 3.32"
sd_data1.y <- format(round(sd(data$data1.y),2),nsmall=2)
paste("sd_data1.y:",sd_data1.y)
## [1] "sd_data1.y: 2.03"
sd_data2.x <- format(round(sd(data$data2.x),2),nsmall=2)
paste("sd_data2.x:",sd_data2.x)
## [1] "sd_data2.x: 3.32"
sd_data2.y <- format(round(sd(data$data2.y),2),nsmall=2)
paste("sd_data2.y:",sd_data2.y)
## [1] "sd_data2.y: 2.03"
sd_data3.x <- format(round(sd(data$data3.x),2),nsmall=2)
paste("sd_data3.x:",sd_data3.x)
## [1] "sd_data3.x: 3.32"
sd_data3.y <- format(round(sd(data$data3.y),2),nsmall=2)
paste("sd_data3.y:",sd_data3.y)
## [1] "sd_data3.y: 2.03"
sd_data4.x <- format(round(sd(data$data4.x),2),nsmall=2)
paste("sd_data4.x:",sd_data4.x)
## [1] "sd_data4.x: 3.32"
sd_data4.y <- format(round(sd(data$data4.y),2),nsmall=2)
paste("sd_data4.y:",sd_data4.y)
## [1] "sd_data4.y: 2.03"
For each x and y pair, calculate (also to two decimal places; 1 pt):
cor(data1$x, data1$y)
## [1] 0.82
cor(data2$x, data2$y)
## [1] 0.82
cor(data3$x, data3$y)
## [1] 0.82
cor(data4$x, data4$y)
## [1] 0.82
m1 <- lm(data1$y ~ data1$x)
summary(m1)
##
## Call:
## lm(formula = data1$y ~ data1$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## data1$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
m2 <- lm(data2$y ~ data2$x)
summary(m2)
##
## Call:
## lm(formula = data2$y ~ data2$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## data2$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
m3 <- lm(data3$y ~ data3$x)
summary(m2)
##
## Call:
## lm(formula = data2$y ~ data2$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## data2$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
m4 <- lm(data4$y ~ data4$x)
summary(m4)
##
## Call:
## lm(formula = data4$y ~ data4$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## data4$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
Linear regression equation for data1:
$$\hat{y}= 3+0.5\times x$$
Linear regression equation for data1:
$$\hat{y}= 3+0.5\times x$$
Linear regression equation for data1:
$$\hat{y}= 3+0.5\times x$$
Linear regression equation for data1:
$$\hat{y}= 3+0.5\times x$$
For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)
plot(data1$y ~ data1$x)
abline(m1)
The plot above for data1 shows a upward linear trend. It is appropriate to estimate whether the linear regression model is reliable.
plot(data2$y ~ data2$x)
abline(m2)
The plot above shows x and y from data2 has a very strong relationship but the trend is not linear. A straight line could not fit the data. Since it is a non-linear relationship, it is not appropriate to estimate the linear regression model.
plot(data3$y ~ data3$x)
abline(m3)
The plot above shows x and y from data3 has a very strong linear relationship but there are non-normal residuals. There is outliers and the point is very far away from the line. Since it is linear relationship, it is appropriate to estimate the linear model.
plot(data4$y ~ data4$x)
abline(m4)
The plot above shows that x and y from data4 has no linear relationship. Since there is no linear relationship, it is not appropriate to estimate the linear regression model.
Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)
From the scatterplot, we can check linearity in the first place to see whether a straight line model is appropriate or not. To acess whether the linear regression modle is reliable, besides linearity, we should also check: nearly normal residuals, constant variability, and independent observations.
As for data1, we can acess whether the linear regression modle is reliable by using visalizations to do model diagnostics.
Linearity: is the relationship between x and y linear?
plot(m1$residuals ~ data1$x)
abline(h = 0, lty = 3)
Nearly normal residuals:
qqnorm(m1$residuals)
qqline(m1$residuals)
The normal probability plot of the residuals shows that the residuals is not normal.
Constant variability:
fitted <- 3+0.5*data1$x
plot(m1$residuals ~ data1$x)
abline(h = 0, lty = 3)
The above plot shows that the variability is nearly constant.Based on the diagnotic, we conclude linear regression model could not fit data1 as the residuals is not normal.
As for data3, we can acess whether the linear regression modle is reliable by using visalizations to do model diagnostics.
Linearity: is the relationship between x and y linear?
plot(m3$residuals ~ data3$x)
abline(h = 0, lty = 3)
Nearly normal residuals:
qqnorm(m3$residuals)
qqline(m3$residuals)
The normal probability plot of the residuals shows that the residuals is normal except one outlier on the higher end deviated far away from the line.
Constant variability:
fitted <- 3+0.5*data3$x
plot(m3$residuals ~ data3$x)
abline(h = 0, lty = 3)
The above plot shows that the variability is nearly constant except there one outlier deviated far away from the line. Based on the diagnotic, we conclude linear regression model could not fit data3 since the conditions for linear regression model could not be satisfied.
When we analyze the data, including visualization is very helpful for us to check whether the data satisfy the conditions for fitting linear regression model.