library(DATA606)
Answers are in bold
b. daysDrive
a. mean = 3.3, median = 3.5
d. Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.
d. eye color and natural hair color are independent
min Q1 median Q3 max mean sd n 26 37 45 49.8 65 44.4 8.4 50
b. 17.8 and 69.0
d. median and interquartile range; mean and standard deviation
Describe the two distributions (2 pts). Both distributions appear to follow a normal model with moderate skew on distribution illustrated by A. The observation distribution appears to have tight spread and the sampling distribution appears to have wider spread. With larger sample size, though, the sampling distribution could generate a tighter spread due to smaller standard error and a closer estimate of the mean
Explain why the means of these two distributions are similar but the standard deviations are not (2pts). The standard deviation of the sampling distribution is the standard deviation of A divided by the square root of the sample size
What is the statistical principal that describes this phenomenon (2 pts)? Central Limit Theorem
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
summary(data1)
## x y
## Min. : 4.0 Min. : 4.3
## 1st Qu.: 6.5 1st Qu.: 6.3
## Median : 9.0 Median : 7.6
## Mean : 9.0 Mean : 7.5
## 3rd Qu.:11.5 3rd Qu.: 8.6
## Max. :14.0 Max. :10.8
summary(data2)
## x y
## Min. : 4.0 Min. :3.1
## 1st Qu.: 6.5 1st Qu.:6.7
## Median : 9.0 Median :8.1
## Mean : 9.0 Mean :7.5
## 3rd Qu.:11.5 3rd Qu.:8.9
## Max. :14.0 Max. :9.3
summary(data3)
## x y
## Min. : 4.0 Min. : 5.4
## 1st Qu.: 6.5 1st Qu.: 6.2
## Median : 9.0 Median : 7.1
## Mean : 9.0 Mean : 7.5
## 3rd Qu.:11.5 3rd Qu.: 8.0
## Max. :14.0 Max. :12.7
summary(data4)
## x y
## Min. : 8 Min. : 5.2
## 1st Qu.: 8 1st Qu.: 6.2
## Median : 8 Median : 7.0
## Mean : 9 Mean : 7.5
## 3rd Qu.: 8 3rd Qu.: 8.2
## Max. :19 Max. :12.5
For each column, calculate (to two decimal places):
a.1 The mean of data1 x and y:
data_means <- c(format(mean(data1$x), nsmall = 2),format(mean(data1$y), nsmall = 2)); data_means
## [1] "9.00" "7.50"
a.2 The mean of data2 x and y:
data_means <- c(format(mean(data2$x), nsmall = 2),format(mean(data2$y), nsmall = 2)); data_means
## [1] "9.00" "7.50"
a.3 The mean of data3 x and y:
data_means <- c(format(mean(data3$x), nsmall = 2),format(mean(data3$y), nsmall = 2)); data_means
## [1] "9.00" "7.50"
a.4 The mean of data4 x and y:
data_means <- c(format(mean(data4$x), nsmall = 2),format(mean(data4$y), nsmall = 2)); data_means
## [1] "9.00" "7.50"
b.1 The median of data1 x and y:
data_medians <- c(format(median(data1$x), nsmall = 2),format(median(data1$y), nsmall = 2)); data_medians
## [1] "9.00" "7.58"
b.2 The median of data2 x and y:
data_medians <- c(format(median(data2$x), nsmall = 2),format(median(data2$y), nsmall = 2)); data_medians
## [1] "9.00" "8.14"
b.3 The median of data3 x and y:
data_medians <- c(format(median(data3$x), nsmall = 2),format(median(data3$y), nsmall = 2)); data_medians
## [1] "9.00" "7.11"
b.4 The median of data4 x and y:
data_medians <- c(format(median(data4$x), nsmall = 2),format(median(data4$y), nsmall = 2)); data_medians
## [1] "8.00" "7.04"
c.1 The median of data1 x and y:
data_sd <- c(format(sd(data1$x), nsmall = 2),format(sd(data1$y), nsmall = 2)); data_sd
## [1] "3.32" "2.03"
c.2 The median of data2 x and y:
data_sd <- c(format(sd(data2$x), nsmall = 2),format(sd(data2$y), nsmall = 2)); data_sd
## [1] "3.32" "2.03"
c.3 The median of data3 x and y:
data_sd <- c(format(sd(data3$x), nsmall = 2),format(sd(data3$y), nsmall = 2)); data_sd
## [1] "3.32" "2.03"
c.4 The median of data4 x and y:
data_sd <- c(format(sd(data4$x), nsmall = 2),format(sd(data4$y), nsmall = 2)); data_sd
## [1] "3.32" "2.03"
For each x and y pair, calculate (also to two decimal places; 1 pt):
d.1 The correlation of data1:
cor(data1$x, data1$y)
## [1] 0.82
d.2 The correlation of data2:
cor(data2$x, data2$y)
## [1] 0.82
d.3 The correlation of data3:
cor(data3$x, data3$y)
## [1] 0.82
d.4 The correlation of data4:
cor(data4$x, data4$y)
## [1] 0.82
m1 <- lm(y ~ x, data = data1)
summary(m1)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
e.1 The equation of data1: y = 3.00 + 0.50x
m2 <- lm(y ~ x, data = data2)
summary(m2)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
e.2 The equation of data2: y = 3.00 + 0.50x
m3 <- lm(y ~ x, data = data3)
summary(m3)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
e.3 The equation of data3: y = 3.00 + 0.50x
m4 <- lm(y ~ x, data = data4)
summary(m4)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
e.3 The equation of data4: y = 3.00 + 0.50x
f.1 The data1 R-squared: 0.67
f.2 The data2 R-squared: 0.67
f.3 The data3 R-squared: 0.67
f.4 The data4 R-squared: 0.67
For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)
data1par(mfrow=c(2,2))
plot(data1$x, data1$y)
plot(m1$residuals ~ data1$x)
abline(h = 0, lty = 3)
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
data1 failed independence
data2par(mfrow=c(2,2))
plot(data2$x, data2$y)
plot(m2$residuals ~ data2$x)
abline(h = 0, lty = 3)
hist(m2$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)
data2 failed linearity
data3par(mfrow=c(2,2))
plot(data3$x, data3$y)
plot(m3$residuals ~ data3$x)
abline(h = 0, lty = 3)
hist(m3$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals)
data3 failed constant variability
data4par(mfrow=c(2,2))
plot(data4$x, data4$y)
plot(m4$residuals ~ data4$x)
abline(h = 0, lty = 3)
hist(m4$residuals)
qqnorm(m4$residuals)
qqline(m4$residuals)
data4 failed normality
Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)
Visualizations help in estimating data and supporting conclusions of data analysis. They can aid in quickly seeing trend and/or abnormality in the data