Answer:
daysDrive- Since this is indicated to be a count of days then it appears this value can only take on whole non-negative integers for the number of days the student drives.
gasMonth - This value is indicated to be cost, which is only calculated down to the nearest $0.01. Therefore, there are not an infinite number of possible values for a given variable range. Also, note that if this value were calculated in cents, then it would be all non-negative integers which is similar to the daysDrive variable.
Answer:
Since the data appears to have a left skew, we can immediately remove any option that states the mean is greater than the median (b, d).
When we evaluate the histogram we can see that if we add the percentages right to left, approximately 41% of the data is above 3.5; therefore the median must be 3.5 or below. This leaves us with one option which is:
Answer:
Answer:
To determine possible outliers, the researcher would examine points that fell beyond the whisker limits of the boxplot, which are 1.5 * IQR.
Q1 <- 37
Q3 <- 49.8
IQR <- Q3 - Q1
(Lower_whisker <- 37 - 1.5 * IQR)
## [1] 17.8
(Upper_whisker <- 49.8 + 1.5 * IQR)
## [1] 69
Answer:
Answer:
The observations (Figure A), are unimodal and have a significnt right skew. The sampling distribution (Figure B) is unimodal and appears to have roughly normal distribution centered around the mean. However, it appears that the overlayed normal distribution may be wider than the data.
Answer:
Figure B is a representation of the sample means taken from figure A. Therefore this figure is a summary of point estimates and if a sample consists of enough independent observations (usually at least 30) and the data are not strongly skewed, then the distribution of the sample mean would be approximated by a normal model centered around the mean.
The standard deviations differ between Figure A and B because Figure B is a distribution of point estimates based on samples of 30. Therefore, the standard deviation of Figure B is the typical error or uncertainty associated with the point estimate based on the number of samples taken. Figure A is the distribution of the actual variable; therefore the standard deviation associated with this distribution roughly describes how far away the typical observation is from the mean.
Answer:
The statistical principal that describes this phenomenom is the Central Limit Theorem (CLT). It is described as follows:
\[ \underset { n\rightarrow \infty }{ lim } P\left( \frac { \bar { X } -\mu }{ \sigma /\sqrt { n } } \le z \right) =\Phi \left( z \right)\]
where \(\Phi\) is the cumulative distribution function (cdf) of the standard normal distribution.
In other words, the distribution of a sample mean taken from a population is well approximated by a normal model:
\[\bar { x } \sim N\left( mean=\mu ,SE=\frac { \sigma }{ \sqrt { n } } \right)\]
Consider the four datasets, each with two columns (x and y), provided below.
options(digits = 2)
data1 <- data.frame(x = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), y = c(8.04, 6.95, 7.58, 8.81,
8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68))
data2 <- data.frame(x = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), y = c(9.14, 8.14, 8.74, 8.77,
9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74))
data3 <- data.frame(x = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), y = c(7.46, 6.77, 12.74, 7.11,
7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73))
data4 <- data.frame(x = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8), y = c(6.58, 5.76, 7.71, 8.84,
8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89))
For each column, calculate (to two decimal places):
Answer:
options(digits = 2)
format(mean(data1$x), nsmall = 2)
## [1] "9.00"
format(mean(data1$y), nsmall = 2)
## [1] "7.50"
format(mean(data2$x), nsmall = 2)
## [1] "9.00"
format(mean(data2$y), nsmall = 2)
## [1] "7.50"
format(mean(data3$x), nsmall = 2)
## [1] "9.00"
format(mean(data3$y), nsmall = 2)
## [1] "7.50"
format(mean(data4$x), nsmall = 2)
## [1] "9.00"
format(mean(data4$y), nsmall = 2)
## [1] "7.50"
Answer:
options(digits = 2)
format(median(data1$x), nsmall = 2)
## [1] "9.00"
format(median(data1$y), nsmall = 2)
## [1] "7.58"
format(median(data2$x), nsmall = 2)
## [1] "9.00"
format(median(data2$y), nsmall = 2)
## [1] "8.14"
format(median(data3$x), nsmall = 2)
## [1] "9.00"
format(median(data3$y), nsmall = 2)
## [1] "7.11"
format(median(data4$x), nsmall = 2)
## [1] "8.00"
format(median(data4$y), nsmall = 2)
## [1] "7.04"
Answer:
options(digits = 2)
format(sd(data1$x), nsmall = 2)
## [1] "3.32"
format(sd(data1$y), nsmall = 2)
## [1] "2.03"
format(sd(data2$x), nsmall = 2)
## [1] "3.32"
format(sd(data2$y), nsmall = 2)
## [1] "2.03"
format(sd(data3$x), nsmall = 2)
## [1] "3.32"
format(sd(data3$y), nsmall = 2)
## [1] "2.03"
format(sd(data4$x), nsmall = 2)
## [1] "3.32"
format(sd(data4$y), nsmall = 2)
## [1] "2.03"
For each x and y pair, calculate (also to two decimal places; 1 pt):
Answer:
options(digits = 2)
format(cor(data1$x, data1$y), nsmall = 2)
## [1] "0.82"
format(cor(data2$x, data2$y), nsmall = 2)
## [1] "0.82"
format(cor(data3$x, data3$y), nsmall = 2)
## [1] "0.82"
format(cor(data4$x, data4$y), nsmall = 2)
## [1] "0.82"
Answer:
lm_d1 <- lm(y ~ x, data = data1)
summary(lm_d1)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
lm_d2 <- lm(y ~ x, data = data2)
summary(lm_d2)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
lm_d3 <- lm(y ~ x, data = data3)
summary(lm_d3)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
lm_d4 <- lm(y ~ x, data = data4)
summary(lm_d4)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
the linear equation for all data sets is as follows:
\[\widehat { y } \quad ={ \quad \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }\quad *\quad x\\ \widehat { y } \quad =\quad 3.00\quad +\quad 0.50\quad *\quad x\]
Answer:
options(digits = 2)
# two methods are used to show the r-squared value for each data set.
format(cor(data1$y, data1$x)^2, nsmall = 2)
## [1] "0.67"
format(summary(lm(y ~ x, data = data1))[[8]], nsmall = 2)
## [1] "0.67"
format(cor(data2$y, data2$x)^2, nsmall = 2)
## [1] "0.67"
format(summary(lm(y ~ x, data = data2))[[8]], nsmall = 2)
## [1] "0.67"
format(cor(data3$y, data3$x)^2, nsmall = 2)
## [1] "0.67"
format(summary(lm(y ~ x, data = data3))[[8]], nsmall = 2)
## [1] "0.67"
format(cor(data4$y, data4$x)^2, nsmall = 2)
## [1] "0.67"
format(summary(lm(y ~ x, data = data4))[[8]], nsmall = 2)
## [1] "0.67"
Answer:
require(ggplot2)
## Loading required package: ggplot2
ggplot(data1, aes(y = y, x = x)) + geom_point() + geom_smooth(method = lm, fullrange = TRUE) +
labs(x = "X Values", y = "Y Values") + ggtitle("Scatterplot of Data") + theme(plot.title = element_text(hjust = 0.5))
ggplot(data2, aes(y = y, x = x)) + geom_point() + geom_smooth(method = lm, fullrange = TRUE) +
labs(x = "X Values", y = "Y Values") + ggtitle("Scatterplot of Data") + theme(plot.title = element_text(hjust = 0.5))
ggplot(data3, aes(y = y, x = x)) + geom_point() + geom_smooth(method = lm, fullrange = TRUE) +
labs(x = "X Values", y = "Y Values") + ggtitle("Scatterplot of Data") + theme(plot.title = element_text(hjust = 0.5))
ggplot(data4, aes(y = y, x = x)) + geom_point() + geom_smooth(method = lm, fullrange = TRUE) +
labs(x = "X Values", y = "Y Values") + ggtitle("Scatterplot of Data") + theme(plot.title = element_text(hjust = 0.5))
# data1
ggplot(data = lm_d1, aes(x = resid(lm_d1))) + geom_histogram(binwidth = 0.5, position = "identity",
aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(lm_d1$residuals),
sd(lm_d1$residuals))) + labs(x = "Residuals") + labs(x = "X Values", y = "Density of Residuals") +
ggtitle("Histogram of Residuals") + theme(plot.title = element_text(hjust = 0.5))
qqnorm(lm_d1$residuals)
qqline(lm_d1$residuals)
library(car)
## Warning: package 'car' was built under R version 3.3.3
qqPlot(lm_d1$residuals, envelope = 0.95, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
# data2
ggplot(data = lm_d2, aes(x = resid(lm_d2))) + geom_histogram(binwidth = 0.5, position = "identity",
aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(lm_d2$residuals),
sd(lm_d2$residuals))) + labs(x = "Residuals") + labs(x = "X Values", y = "Density of Residuals") +
ggtitle("Histogram of Residuals") + theme(plot.title = element_text(hjust = 0.5))
qqnorm(lm_d2$residuals)
qqline(lm_d2$residuals)
qqPlot(lm_d2$residuals, envelope = 0.95, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
# data3
ggplot(data = lm_d3, aes(x = resid(lm_d3))) + geom_histogram(binwidth = 0.5, position = "identity",
aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(lm_d3$residuals),
sd(lm_d3$residuals))) + labs(x = "Residuals") + labs(x = "X Values", y = "Density of Residuals") +
ggtitle("Histogram of Residuals") + theme(plot.title = element_text(hjust = 0.5))
qqnorm(lm_d3$residuals)
qqline(lm_d3$residuals)
qqPlot(lm_d3$residuals, envelope = 0.95, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
# data4
ggplot(data = lm_d4, aes(x = resid(lm_d4))) + geom_histogram(binwidth = 0.5, position = "identity",
aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(lm_d4$residuals),
sd(lm_d4$residuals))) + labs(x = "Residuals") + labs(x = "X Values", y = "Density of Residuals") +
ggtitle("Histogram of Residuals") + theme(plot.title = element_text(hjust = 0.5))
qqnorm(lm_d4$residuals)
qqline(lm_d4$residuals)
qqPlot(lm_d4$residuals, envelope = 0.95, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")
ggplot(lm_d1, aes(y = abs(resid(lm_d1)), x = lm_d1$fitted.values)) + geom_point() + geom_hline(yintercept = 0) +
labs(x = "X Values", y = "Absolute Value of Residuals") + ggtitle("Scatterplot of Residuals") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(lm_d2, aes(y = abs(resid(lm_d2)), x = lm_d2$fitted.values)) + geom_point() + geom_hline(yintercept = 0) +
labs(x = "X Values", y = "Absolute Value of Residuals") + ggtitle("Scatterplot of Residuals") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(lm_d3, aes(y = abs(resid(lm_d3)), x = lm_d3$fitted.values)) + geom_point() + geom_hline(yintercept = 0) +
labs(x = "X Values", y = "Absolute Value of Residuals") + ggtitle("Scatterplot of Residuals") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(lm_d4, aes(y = abs(resid(lm_d4)), x = lm_d4$fitted.values)) + geom_point() + geom_hline(yintercept = 0) +
labs(x = "X Values", y = "Absolute Value of Residuals") + ggtitle("Scatterplot of Residuals") +
theme(plot.title = element_text(hjust = 0.5))
Given the context of the quesiton provided, we can assume there are no issues concerning sequential observations that may have an underlying structure which should be considered.
Summary:
data1 - Yes, the p-values for each test statistic are below \(\alpha = 0.05\), which means that we can reject the null hypothesis of the intercept and slope being equal to zero. Additionally, we see that the model explains 66.65% of the variability in the response variable (y). Pertaining to the condition checks, there are not many data points for all the sets provided so it is difficult to determine whether the normal distiribution, as well as other model validations, have been met; therefore, we will use the qqPlot package from the car library to draw a confidence band around the residuals to see if their variation is due to more than just natural occurrence. In this case, there is one point that appears to fall on the line of the confidence band so, if this situation were brough up in practice, I would most likely report the correlation along with a note that specifically outlines this issue as a possible shortcoming.
data2 - No, from the initial inspection we see that the data do not follow a linear pattern. It appears a quadratic equation would produce a better fit. The following condition checks provide further support that the data do not follow a linear pattern and a linear regression model should not be used despite the fact that this data has the same summary statistics (\(\mu\),sd, r-squared,\({ \beta }_{ 0 }\),\({ \beta }_{ 1 }\), etc.) as the data1 set.
data3 - Yes, while there appears to be a point of high leverage that is affecting the model, We can see that if the outlying point was removed, the resulting line would appear to be similar. However, due to this point, the conditions noted above are not being met since the residuals are not nearly normal and the variability of the residuals is not nearly constant. Therefore, if we really faced this data in practice, further investigation on the outlying point should be performed. Additionally the issues with the outlying point would be noted, as explained in the data1 summary.
data4 - No, this data has a point of high leverage and unlike the data3 set, the linear model is significantly affected. Note, with the outlying point removed, the line would be vertical. The conditions checked above provide further suppor that a linear model is not a good fit for the data.
Answer:
It is important to include appropriate visuaztions when analyzing data because we must use these representations to make judgements about whether the necessary conditions are being met. It is not enough to simply look at the resulting values in order to determine whether a model is a good fit. For example, in the previous problem, we see that all four data sets have the same summary statistics (\(\mu\),sd, r-squared,\({ \beta }_{ 0 }\),\({ \beta }_{ 1 }\), etc.); however, they are all very different from one another. I have created visualizations in the previous problem to show how they should be appropriately used.