Out of all the choices, only B. daysDrive is both quantatitive and discrete. Though daysDrive is in both Answers D and E, the reasons why D and E are incorrect is because gasMonth is quantitative and continuous, and car is categorical and discrete.
A. mean = 3.3, median = 3.5
This should be a double blinded randomized controlled prospective study. The principle investigator needs to attempt to minimize cofounders (to maximize independence) in the patients and randomize them to either the intervention or placebo. The researcher should attempt to track all participants over the duration of the study period to minimize bias from lack of follow-up. Other studies such as observational studies may determine correlation, but cannot establish causation.
The closest answer here is: A. Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups.
A large chi-square (depending if the corresponding p value is < 0.05) suggests that we reject the null hypothesis that hair color and eye color are independent. In this case, the best answer is: C. There is an association between natural hair color and eye color.
If the point is below Quartile1 - 1.5 * IQR or above Quartile3 + 1.5 * IQR, then it is considered an outlier in a boxplot. The answer here is: B. 17.8 and 69.0.
The “median and interquartile range” are resistant to outliers, whereas the “mean and standard deviation” are not. (Answer: D)
Figure A below represents the distribution of an observed variable. Figure B below represents the distribution of the mean from 500 random samples of size 30 from A. The mean of A is 5.05 and the mean of B is 5.04. The standard deviations of A and B are 3.22 and 0.58, respectively.
In figure A, this is a distribution with a fairly moderate right sided skew and lower kurtosis with a mean of 5.05 and standard deviation of 3.22. It appears that Whereas in figure B, this distribution is normal but with a high kurtosis. Its mean is 5.04 with a standard deviation of 0.58.
Figure A is a distribution of an observed variable, whereas Figure B is a distribution of the mean from 500 random samples of size 30 from A. The reason why the standard deviations are different is because they are calculated different. The standard deviation of A is the standard “SD” formula that we used to calculate. However, in figure B, we calculate the standard of error (and NOT the standard deviation) because this is the distribution of the means (and NOT of the observed variables). The standard error formula is: SD/sqrt(30), which gives us 0.58.
This is called the “Central Limit Theorem”. It states that given enough large sample size of the means, the distribution of the means will take on the normal distribution curve.
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
# data1
mean.1.x <- round(mean(data1$x), 2)
mean.1.y <- round(mean(data1$y), 2)
# data2
mean.2.x <- round(mean(data2$x), 2)
mean.2.y <- round(mean(data2$y), 2)
# data3
mean.3.x <- round(mean(data3$x), 2)
mean.3.y <- round(mean(data3$y), 2)
# data4
mean.4.x <- round(mean(data4$x), 2)
mean.4.y <- round(mean(data4$y), 2)
paste0("Data1 - Mean(x):", mean.1.x)
## [1] "Data1 - Mean(x):9"
paste0("Data1 - Mean(y):", mean.1.y)
## [1] "Data1 - Mean(y):7.5"
paste0("Data2 - Mean(x):", mean.2.x)
## [1] "Data2 - Mean(x):9"
paste0("Data2 - Mean(y):", mean.2.y)
## [1] "Data2 - Mean(y):7.5"
paste0("Data3 - Mean(x):", mean.3.x)
## [1] "Data3 - Mean(x):9"
paste0("Data3 - Mean(y):", mean.3.y)
## [1] "Data3 - Mean(y):7.5"
paste0("Data4 - Mean(x):", mean.4.x)
## [1] "Data4 - Mean(x):9"
paste0("Data4 - Mean(y):", mean.4.y)
## [1] "Data4 - Mean(y):7.5"
# data1
median.1.x <- round(median(data1$x), 2)
median.1.y <- round(median(data1$y), 2)
# data2
median.2.x <- round(median(data2$x), 2)
median.2.y <- round(median(data2$y), 2)
# data3
median.3.x <- round(median(data3$x), 2)
median.3.y <- round(median(data3$y), 2)
# data4
median.4.x <- round(median(data4$x), 2)
median.4.y <- round(median(data4$y), 2)
paste0("Data1 - Median(x):", median.1.x)
## [1] "Data1 - Median(x):9"
paste0("Data1 - Median(y):", median.1.y)
## [1] "Data1 - Median(y):7.58"
paste0("Data2 - Median(x):", median.2.x)
## [1] "Data2 - Median(x):9"
paste0("Data2 - Median(y):", median.2.y)
## [1] "Data2 - Median(y):8.14"
paste0("Data3 - Median(x):", median.3.x)
## [1] "Data3 - Median(x):9"
paste0("Data3 - Median(y):", median.3.y)
## [1] "Data3 - Median(y):7.11"
paste0("Data4 - Median(x):", median.4.x)
## [1] "Data4 - Median(x):8"
paste0("Data4 - Median(y):", median.4.y)
## [1] "Data4 - Median(y):7.04"
# data1
sd.1.x <- round(sd(data1$x), 2)
sd.1.y <- round(sd(data1$y), 2)
# data2
sd.2.x <- round(sd(data2$x), 2)
sd.2.y <- round(sd(data2$y), 2)
# data3
sd.3.x <- round(sd(data3$x), 2)
sd.3.y <- round(sd(data3$y), 2)
# data4
sd.4.x <- round(sd(data4$x), 2)
sd.4.y <- round(sd(data4$y), 2)
paste0("Data1 - SD(x):", sd.1.x)
## [1] "Data1 - SD(x):3.32"
paste0("Data1 - SD(y):", sd.1.y)
## [1] "Data1 - SD(y):2.03"
paste0("Data2 - SD(x):", sd.2.x)
## [1] "Data2 - SD(x):3.32"
paste0("Data2 - SD(y):", sd.2.y)
## [1] "Data2 - SD(y):2.03"
paste0("Data3 - SD(x):", sd.3.x)
## [1] "Data3 - SD(x):3.32"
paste0("Data3 - SD(y):", sd.3.y)
## [1] "Data3 - SD(y):2.03"
paste0("Data4 - SD(x):", sd.4.x)
## [1] "Data4 - SD(x):3.32"
paste0("Data4 - SD(y):", sd.4.y)
## [1] "Data4 - SD(y):2.03"
For each x and y pair, calculate (also to two decimal places):
# Correlation using the cor() function
data1.cor <- round(cor(x = data1$x, y = data1$y), 2)
data2.cor <- round(cor(x = data2$x, y = data2$y), 2)
data3.cor <- round(cor(x = data3$x, y = data3$y), 2)
data4.cor <- round(cor(x = data4$x, y = data4$y), 2)
paste0("Data1 Correlation: ", data1.cor)
## [1] "Data1 Correlation: 0.82"
paste0("Data2 Correlation: ", data2.cor)
## [1] "Data2 Correlation: 0.82"
paste0("Data3 Correlation: ", data3.cor)
## [1] "Data3 Correlation: 0.82"
paste0("Data4 Correlation: ", data4.cor)
## [1] "Data4 Correlation: 0.82"
lm1 <- lm(y ~ x, data = data1)
summary(lm1)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
lm2 <- lm(y ~ x, data = data2)
summary(lm2)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
lm3 <- lm(y ~ x, data = data3)
summary(lm3)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
lm4 <- lm(y ~ x, data = data4)
summary(lm4)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
Data1: y(hat) <- 3.00 + 0.50 * x
Data2: y(hat) <- 3.00 + 0.50 * x
Data3: y(hat) <- 3.00 + 0.50 * x
Data4: y(hat) <- 3.00 + 0.50 * x
Data1 R-squared: 0.63
Data2 R-squared: 0.63
Data3 R-squared: 0.63
Data4 R-squared: 0.63
For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots!
In order for a pair to be appropriate for a linear regression model, it must satisfy these conditions.
We will go in order for each pair of data points and evaluate if the linear regression is applicable.
# Data1
#1. Linearity
plot(x = data1$x, y = data1$y)
# There does appear to be a linear component to this plot.
#2. Nearly normal residuals
hist(lm1$residuals, breaks = 10)
qqnorm(lm1$residuals)
qqline(lm1$residuals)
# This one is questionable given the low sample numbers. The qqnorm() also demonstrates that the points seems to deviation from the qqline(). I would state that this could potentially be true if we had a larger sample size.
#3 Constant variability
plot(lm1$residuals ~ data1$x)
abline(h = 0, lty = 3)
# On visual inspection, there does appear to be constant variability
#4 Independent Observations
# There is assumed to be independent as the author has not provided any more details regarding where it came from. It doesn't appear that choosing one point had swayed the author to choose a second point.
Data 1 does NOT appear to be satisfy the criteria for a linear model regression due to violation of the 2nd criteria of nearly normal residuals.
# Data2
#1. Linearity
plot(x = data2$x, y = data2$y)
# There does NOT appear to be a linear component to this.
#2. Nearly normal residuals
hist(lm2$residuals, breaks = 10)
qqnorm(lm2$residuals)
qqline(lm2$residuals)
# This also violates this criteria. The distribution is NOT normal.
#3 Constant variability
plot(lm2$residuals ~ data2$x)
abline(h = 0, lty = 3)
# On visual inspection, there does NOT appear to be constant variability. There appears to be a pattern. It fails on criteria #3.
#4 Independent Observations
# There is assumed to be independent as the author has not provided any more details regarding where it came from. It doesn't appear that choosing one point had swayed the author to choose a second point.
Data 2 is NOT appropriate for the linear regression model as it violates criteria 1 (linearity), criteria 2 (nearly normal residuals), and criteria 3 (constant variability).
# Data3
#1. Linearity
plot(x = data3$x, y = data3$y)
# There does appear to be a linear component to this plot.
#2. Nearly normal residuals
hist(lm3$residuals, breaks = 10)
qqnorm(lm3$residuals)
qqline(lm3$residuals)
# This is questionable whether or not there is nearly normal residuals. On the qqnorm() plot, it appears that, other than the 1 outlier, the residuals appears to be on the line. The histogram is difficult to discern if this contains nearly normal residuals. This again also suffers from low sample size.
#3 Constant variability
plot(lm3$residuals ~ data3$x)
abline(h = 0, lty = 3)
# On visual inspection, there appears to be a problem. There's an obvious pattern here. So it fails criteria here.
#4 Independent Observations
# There is assumed to be independent as the author has not provided any more details regarding where it came from. It doesn't appear that choosing one point had swayed the author to choose a second point.
Data 3 fails criteria (3) and likely (2) as well. It is inappropriate to use the linear regression model here.
# Data4
#1. Linearity
plot(x = data4$x, y = data4$y)
# There is NO linear component. This fails the criteria.
#2. Nearly normal residuals
hist(lm4$residuals, breaks = 10)
qqnorm(lm4$residuals)
qqline(lm4$residuals)
# There appears to be nearly normal residuals.
#3 Constant variability
plot(lm4$residuals ~ data4$x)
abline(h = 0, lty = 3)
# On visual inspection, there appears to be a problem. There's an obvious pattern here. So it fails the criteria as well here.
#4 Independent Observations
# There is assumed to be independent as the author has not provided any more details regarding where it came from. It doesn't appear that choosing one point had swayed the author to choose a second point.
Data 4 fails on criteria (1) and (3).
Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create.
It is important to visualize the data as it can provide insight into the data. For example, looking at the residuals can determine if there is some underlying pattern that can’t be recognize with just looking at the numbers directly. When we had calculated the mean and standard deviation, all of the numbers were the same. Likewise, the median were similar to each other. When we had used the lm() function, they all suggested the same linear regression model. However, each data set had its own unique characteristics of it that were more obvious when we visualized it. The visualization will allow us to easily see patterns that weren’t so obvious before. Please see above for the included visualizations.