Grando - Final

Part I

Question 1 - A student is gathering data on the driving experiences of other college students. A description of the data car color is presented below. Which of the variables are quantitative and discrete?

Answer:

  1. daysDrive, gasMonth

daysDrive- Since this is indicated to be a count of days then it appears this value can only take on whole non-negative integers for the number of days the student drives.

gasMonth - This value is indicated to be cost, which is only calculated down to the nearest $0.01. Therefore, there are not an infinite number of possible values for a given variable range. Also, note that if this value were calculated in cents, then it would be all non-negative integers which is similar to the daysDrive variable.

Question 2 - A histogram of the GPA of 132 students from this course in Fall 2012 class is presented below. Which estimates of the mean and median are most plausible?

Answer:

Since the data appears to have a left skew, we can immediately remove any option that states the mean is greater than the median (b, d).

When we evaluate the histogram we can see that if we add the percentages right to left, approximately 41% of the data is above 3.5; therefore the median must be 3.5 or below. This leaves us with one option which is:

  1. mean = 3.3, median = 3.5

Question 4 - A study is designed to test whether there is a relationship between natural hair color (brunette, blond, red) and eye color (blue, green, brown). If a large χ2 test statistic is obtained, this suggests that:

Answer:

  1. there is an association between natural hair color and eye color.

Question 5 - A researcher studying how monkeys remember is interested in examining the distribution of the score on a standard memory task. The researcher wants to produce a boxplot to examine this distribution. Below are summary statistics from the memory task. What values should the researcher use to determine if a particular score is a potential outlier in the boxplot?

Answer:

To determine possible outliers, the researcher would examine points that fell beyond the whisker limits of the boxplot, which are 1.5 * IQR.

Q1 <- 37
Q3 <- 49.8
IQR <- Q3 - Q1
(Lower_whisker <- 37 - 1.5 * IQR)
## [1] 17.8
(Upper_whisker <- 49.8 + 1.5 * IQR)
## [1] 69
  1. 17.8 and 69.0

Question 6 - The [blank] are resistant to outliers, whereas the [blank] are not.

Answer:

  1. median and interquartile range; mean and standard deviation

Question 7 - Figure A below represents the distribution of an observed variable. Figure B below represents the distribution of the mean from 500 random samples of size 30 from A. The mean of A is 5.05 and the mean of B is 5.04. The standard deviations of A and B are 3.22 and 0.58, respectively.

a. Describe the two distributions (2 pts).

Answer:

The observations (Figure A), are unimodal and have a significnt right skew. The sampling distribution (Figure B) is unimodal and appears to have roughly normal distribution centered around the mean. However, it appears that the overlayed normal distribution may be wider than the data.

b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

Answer:

Figure B is a representation of the sample means taken from figure A. Therefore this figure is a summary of point estimates and if a sample consists of enough independent observations (usually at least 30) and the data are not strongly skewed, then the distribution of the sample mean would be approximated by a normal model centered around the mean.

The standard deviations differ between Figure A and B because Figure B is a distribution of point estimates based on samples of 30. Therefore, the standard deviation of Figure B is the typical error or uncertainty associated with the point estimate based on the number of samples taken. Figure A is the distribution of the actual variable; therefore the standard deviation associated with this distribution roughly describes how far away the typical observation is from the mean.

c. What is the statistical principal that describes this phenomenon (2 pts)?

Answer:

The statistical principal that describes this phenomenom is the Central Limit Theorem (CLT). It is described as follows:

\[ \underset { n\rightarrow \infty }{ lim } P\left( \frac { \bar { X } -\mu }{ \sigma /\sqrt { n } } \le z \right) =\Phi \left( z \right)\]

where \(\Phi\) is the cumulative distribution function (cdf) of the standard normal distribution.

In other words, the distribution of a sample mean taken from a population is well approximated by a normal model:

\[\bar { x } \sim N\left( mean=\mu ,SE=\frac { \sigma }{ \sqrt { n } } \right)\]

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits = 2)
data1 <- data.frame(x = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), y = c(8.04, 6.95, 7.58, 8.81, 
    8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68))
data2 <- data.frame(x = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), y = c(9.14, 8.14, 8.74, 8.77, 
    9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74))
data3 <- data.frame(x = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5), y = c(7.46, 6.77, 12.74, 7.11, 
    7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73))
data4 <- data.frame(x = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8), y = c(6.58, 5.76, 7.71, 8.84, 
    8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

Answer:

options(digits = 2)
format(mean(data1$x), nsmall = 2)
## [1] "9.00"
format(mean(data1$y), nsmall = 2)
## [1] "7.50"
format(mean(data2$x), nsmall = 2)
## [1] "9.00"
format(mean(data2$y), nsmall = 2)
## [1] "7.50"
format(mean(data3$x), nsmall = 2)
## [1] "9.00"
format(mean(data3$y), nsmall = 2)
## [1] "7.50"
format(mean(data4$x), nsmall = 2)
## [1] "9.00"
format(mean(data4$y), nsmall = 2)
## [1] "7.50"

b. The median (for x and y separately; 1 pt).

Answer:

options(digits = 2)
format(median(data1$x), nsmall = 2)
## [1] "9.00"
format(median(data1$y), nsmall = 2)
## [1] "7.58"
format(median(data2$x), nsmall = 2)
## [1] "9.00"
format(median(data2$y), nsmall = 2)
## [1] "8.14"
format(median(data3$x), nsmall = 2)
## [1] "9.00"
format(median(data3$y), nsmall = 2)
## [1] "7.11"
format(median(data4$x), nsmall = 2)
## [1] "8.00"
format(median(data4$y), nsmall = 2)
## [1] "7.04"

c. The standard deviation (for x and y separately; 1 pt).

Answer:

options(digits = 2)
format(sd(data1$x), nsmall = 2)
## [1] "3.32"
format(sd(data1$y), nsmall = 2)
## [1] "2.03"
format(sd(data2$x), nsmall = 2)
## [1] "3.32"
format(sd(data2$y), nsmall = 2)
## [1] "2.03"
format(sd(data3$x), nsmall = 2)
## [1] "3.32"
format(sd(data3$y), nsmall = 2)
## [1] "2.03"
format(sd(data4$x), nsmall = 2)
## [1] "3.32"
format(sd(data4$y), nsmall = 2)
## [1] "2.03"

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

Answer:

options(digits = 2)
format(cor(data1$x, data1$y), nsmall = 2)
## [1] "0.82"
format(cor(data2$x, data2$y), nsmall = 2)
## [1] "0.82"
format(cor(data3$x, data3$y), nsmall = 2)
## [1] "0.82"
format(cor(data4$x, data4$y), nsmall = 2)
## [1] "0.82"

e. Linear regression equation (2 pts).

Answer:

lm_d1 <- lm(y ~ x, data = data1)
summary(lm_d1)
## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217
lm_d2 <- lm(y ~ x, data = data2)
summary(lm_d2)
## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
lm_d3 <- lm(y ~ x, data = data3)
summary(lm_d3)
## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
lm_d4 <- lm(y ~ x, data = data4)
summary(lm_d4)
## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

the linear equation for all data sets is as follows:

\[\widehat { y } \quad ={ \quad \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }\quad *\quad x\\ \widehat { y } \quad =\quad 3.00\quad +\quad 0.50\quad *\quad x\]

f. R-Squared (2 pts).

Answer:

options(digits = 2)

# two methods are used to show the r-squared value for each data set.
format(cor(data1$y, data1$x)^2, nsmall = 2)
## [1] "0.67"
format(summary(lm(y ~ x, data = data1))[[8]], nsmall = 2)
## [1] "0.67"
format(cor(data2$y, data2$x)^2, nsmall = 2)
## [1] "0.67"
format(summary(lm(y ~ x, data = data2))[[8]], nsmall = 2)
## [1] "0.67"
format(cor(data3$y, data3$x)^2, nsmall = 2)
## [1] "0.67"
format(summary(lm(y ~ x, data = data3))[[8]], nsmall = 2)
## [1] "0.67"
format(cor(data4$y, data4$x)^2, nsmall = 2)
## [1] "0.67"
format(summary(lm(y ~ x, data = data4))[[8]], nsmall = 2)
## [1] "0.67"

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Answer:

  1. First we should visually inspect the data to ensure it has a linear pattern.
require(ggplot2)
## Loading required package: ggplot2
ggplot(data1, aes(y = y, x = x)) + geom_point() + geom_smooth(method = lm, fullrange = TRUE) + 
    labs(x = "X Values", y = "Y Values") + ggtitle("Scatterplot of Data") + theme(plot.title = element_text(hjust = 0.5))

ggplot(data2, aes(y = y, x = x)) + geom_point() + geom_smooth(method = lm, fullrange = TRUE) + 
    labs(x = "X Values", y = "Y Values") + ggtitle("Scatterplot of Data") + theme(plot.title = element_text(hjust = 0.5))

ggplot(data3, aes(y = y, x = x)) + geom_point() + geom_smooth(method = lm, fullrange = TRUE) + 
    labs(x = "X Values", y = "Y Values") + ggtitle("Scatterplot of Data") + theme(plot.title = element_text(hjust = 0.5))

ggplot(data4, aes(y = y, x = x)) + geom_point() + geom_smooth(method = lm, fullrange = TRUE) + 
    labs(x = "X Values", y = "Y Values") + ggtitle("Scatterplot of Data") + theme(plot.title = element_text(hjust = 0.5))

  1. the residuals of the model are nearly normal
# data1
ggplot(data = lm_d1, aes(x = resid(lm_d1))) + geom_histogram(binwidth = 0.5, position = "identity", 
    aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(lm_d1$residuals), 
    sd(lm_d1$residuals))) + labs(x = "Residuals") + labs(x = "X Values", y = "Density of Residuals") + 
    ggtitle("Histogram of Residuals") + theme(plot.title = element_text(hjust = 0.5))

qqnorm(lm_d1$residuals)
qqline(lm_d1$residuals)

library(car)
## Warning: package 'car' was built under R version 3.3.3
qqPlot(lm_d1$residuals, envelope = 0.95, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")

# data2
ggplot(data = lm_d2, aes(x = resid(lm_d2))) + geom_histogram(binwidth = 0.5, position = "identity", 
    aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(lm_d2$residuals), 
    sd(lm_d2$residuals))) + labs(x = "Residuals") + labs(x = "X Values", y = "Density of Residuals") + 
    ggtitle("Histogram of Residuals") + theme(plot.title = element_text(hjust = 0.5))

qqnorm(lm_d2$residuals)
qqline(lm_d2$residuals)

qqPlot(lm_d2$residuals, envelope = 0.95, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")

# data3
ggplot(data = lm_d3, aes(x = resid(lm_d3))) + geom_histogram(binwidth = 0.5, position = "identity", 
    aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(lm_d3$residuals), 
    sd(lm_d3$residuals))) + labs(x = "Residuals") + labs(x = "X Values", y = "Density of Residuals") + 
    ggtitle("Histogram of Residuals") + theme(plot.title = element_text(hjust = 0.5))

qqnorm(lm_d3$residuals)
qqline(lm_d3$residuals)

qqPlot(lm_d3$residuals, envelope = 0.95, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")

# data4
ggplot(data = lm_d4, aes(x = resid(lm_d4))) + geom_histogram(binwidth = 0.5, position = "identity", 
    aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(lm_d4$residuals), 
    sd(lm_d4$residuals))) + labs(x = "Residuals") + labs(x = "X Values", y = "Density of Residuals") + 
    ggtitle("Histogram of Residuals") + theme(plot.title = element_text(hjust = 0.5))

qqnorm(lm_d4$residuals)
qqline(lm_d4$residuals)

qqPlot(lm_d4$residuals, envelope = 0.95, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles")

  1. the variability of the residuals is nearly constant
ggplot(lm_d1, aes(y = abs(resid(lm_d1)), x = lm_d1$fitted.values)) + geom_point() + geom_hline(yintercept = 0) + 
    labs(x = "X Values", y = "Absolute Value of Residuals") + ggtitle("Scatterplot of Residuals") + 
    theme(plot.title = element_text(hjust = 0.5))

ggplot(lm_d2, aes(y = abs(resid(lm_d2)), x = lm_d2$fitted.values)) + geom_point() + geom_hline(yintercept = 0) + 
    labs(x = "X Values", y = "Absolute Value of Residuals") + ggtitle("Scatterplot of Residuals") + 
    theme(plot.title = element_text(hjust = 0.5))

ggplot(lm_d3, aes(y = abs(resid(lm_d3)), x = lm_d3$fitted.values)) + geom_point() + geom_hline(yintercept = 0) + 
    labs(x = "X Values", y = "Absolute Value of Residuals") + ggtitle("Scatterplot of Residuals") + 
    theme(plot.title = element_text(hjust = 0.5))

ggplot(lm_d4, aes(y = abs(resid(lm_d4)), x = lm_d4$fitted.values)) + geom_point() + geom_hline(yintercept = 0) + 
    labs(x = "X Values", y = "Absolute Value of Residuals") + ggtitle("Scatterplot of Residuals") + 
    theme(plot.title = element_text(hjust = 0.5))

  1. the residuals are independent

Given the context of the quesiton provided, we can assume there are no issues concerning sequential observations that may have an underlying structure which should be considered.

Summary:

data1 - Yes, the p-values for each test statistic are below \(\alpha = 0.05\), which means that we can reject the null hypothesis of the intercept and slope being equal to zero. Additionally, we see that the model explains 66.65% of the variability in the response variable (y). Pertaining to the condition checks, there are not many data points for all the sets provided so it is difficult to determine whether the normal distiribution, as well as other model validations, have been met; therefore, we will use the qqPlot package from the car library to draw a confidence band around the residuals to see if their variation is due to more than just natural occurrence. In this case, there is one point that appears to fall on the line of the confidence band so, if this situation were brough up in practice, I would most likely report the correlation along with a note that specifically outlines this issue as a possible shortcoming.

data2 - No, from the initial inspection we see that the data do not follow a linear pattern. It appears a quadratic equation would produce a better fit. The following condition checks provide further support that the data do not follow a linear pattern and a linear regression model should not be used despite the fact that this data has the same summary statistics (\(\mu\),sd, r-squared,\({ \beta }_{ 0 }\),\({ \beta }_{ 1 }\), etc.) as the data1 set.

data3 - Yes, while there appears to be a point of high leverage that is affecting the model, We can see that if the outlying point was removed, the resulting line would appear to be similar. However, due to this point, the conditions noted above are not being met since the residuals are not nearly normal and the variability of the residuals is not nearly constant. Therefore, if we really faced this data in practice, further investigation on the outlying point should be performed. Additionally the issues with the outlying point would be noted, as explained in the data1 summary.

data4 - No, this data has a point of high leverage and unlike the data3 set, the linear model is significantly affected. Note, with the outlying point removed, the line would be vertical. The conditions checked above provide further suppor that a linear model is not a good fit for the data.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Answer:

It is important to include appropriate visuaztions when analyzing data because we must use these representations to make judgements about whether the necessary conditions are being met. It is not enough to simply look at the resulting values in order to determine whether a model is a good fit. For example, in the previous problem, we see that all four data sets have the same summary statistics (\(\mu\),sd, r-squared,\({ \beta }_{ 0 }\),\({ \beta }_{ 1 }\), etc.); however, they are all very different from one another. I have created visualizations in the previous problem to show how they should be appropriately used.