Both Part 1 and Part 2 are in seperate Tabs
Answer 1: Variable \(daysDrive\) is quantitative and discrete.
Answer 2: Options (a) mean = 3.3, median = 3.5
Answer 3: a. Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups.
Answer 4: c. there is an association between natural hair color and eye color.
Answer 5: b. 17.8 and 69.0 \(IQR=Q3 - Q1 = 49.8 - 37 = 12.8\)
Outliers will generally lie 1.5 IQR away from Q1 and Q3, so below \(Q1 - 1.5 * IQR = 17.8\) and above \(Q3 + 1.5 * IQR = 69\).
Answer 6: d. The median and interquartile range are resistant to outliers, whereas the mean and standard deviation are not.
Answer 7: a. Distribution A is unimodal and left-skewed. Distribution B is unimodal, symmetrical and nearly normal.
Distribution B represents samples from population A. Each sample is random, so its mean should be similar to population mean. And while the mean of one sample may differ due to natural variations, the mean of large number of samples is very close to the actual population mean. However, distributions A and B represent different things - one is distribution of observations while the other is distribution of sample means across multiple samples. So standard deviation for A illustrates variability of observations and standard deviation for B illustrates variability of sample mean. These two standard deviations are not related, so can differ widely.
This phenomenon is Central Limit Theorem.
# Initial data
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
all_dfs <- c("data1", "data2", "data3", "data4")a. The mean (for x and y separately; 1 pt).
for (i in all_dfs) {
my_df <- get(i)
print(i)
print(sapply(my_df, function(x) round(mean(x), 2)))
}## [1] "data1"
## x y
## 9.0 7.5
## [1] "data2"
## x y
## 9.0 7.5
## [1] "data3"
## x y
## 9.0 7.5
## [1] "data4"
## x y
## 9.0 7.5
b. The median (for x and y separately; 1 pt).
for (i in all_dfs) {
my_df <- get(i)
print(i)
print(sapply(my_df, function(x) round(median(x), 2)))
}## [1] "data1"
## x y
## 9.0 7.6
## [1] "data2"
## x y
## 9.0 8.1
## [1] "data3"
## x y
## 9.0 7.1
## [1] "data4"
## x y
## 8 7
c. The standard deviation (for x and y separately; 1 pt).
for (i in all_dfs) {
my_df <- get(i)
print(i)
print(sapply(my_df, function(x) round(sd(x), 2)))
}## [1] "data1"
## x y
## 3.3 2.0
## [1] "data2"
## x y
## 3.3 2.0
## [1] "data3"
## x y
## 3.3 2.0
## [1] "data4"
## x y
## 3.3 2.0
d. The correlation (1 pt).
for (i in all_dfs) {
my_df <- get(i)
print(i)
print(round(cor(my_df), 2))
}## [1] "data1"
## x y
## x 1.00 0.82
## y 0.82 1.00
## [1] "data2"
## x y
## x 1.00 0.82
## y 0.82 1.00
## [1] "data3"
## x y
## x 1.00 0.82
## y 0.82 1.00
## [1] "data4"
## x y
## x 1.00 0.82
## y 0.82 1.00
e. Linear regression equation (2 pts).
m_list <- c()
for (i in all_dfs) {
i_df <- get(i)
i_lm <- lm(y ~ x, i_df)
new_var <- paste0("m_", i)
assign(new_var, i_lm)
m_list <- c(m_list, new_var)
intercept <- round(coef(i_lm)[1], 2)
slope <- round(coef(i_lm)[2], 2)
print(i)
print(paste0("y-hat = ", slope, " * x + ", intercept))
}## [1] "data1"
## [1] "y-hat = 0.5 * x + 3"
## [1] "data2"
## [1] "y-hat = 0.5 * x + 3"
## [1] "data3"
## [1] "y-hat = 0.5 * x + 3"
## [1] "data4"
## [1] "y-hat = 0.5 * x + 3"
f. R-Squared (2 pts).
sapply(m_list, function(x) round(summary(get(x))$r.squared, 2))## m_data1 m_data2 m_data3 m_data4
## 0.67 0.67 0.67 0.67
For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)
op <- par(mfrow = c(2, 2))
for (i in 1:4) {
i_df_name <- all_dfs[i]
i_df <- get(i_df_name)
i_mod <- get(m_list[i])
print(paste("Check conditions for", i_df_name))
# linearity plot
plot(y ~ x, i_df, main = "Linearity check")
abline(i_mod)
# residual normality check
hist(i_mod$residuals, main = "Histogram of Residuals")
qqnorm(i_mod$residuals, main = "Nearly normal residual check")
qqline(i_mod$residuals)
# Homoscedasticity check
plot(i_mod$residuals ~ i_df$x, main = "Homoscedasticity check")
abline(h = 0, lty = 3)
}## [1] "Check conditions for data1"
## [1] "Check conditions for data2"
## [1] "Check conditions for data3"
## [1] "Check conditions for data4"
We dont know if data was independent as we dont know how it was collected.
Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)
A linear model is only valid if the necessary conditions have been met. Creating an visualizations is one way to check for linearity, nearly normal residuals, homoscedasticity, and sometimes even independence when the data collection order is provided.