Part 1

A student is gathering data on the driving experiences of other college students. A description of the data car color is presented below. Which of the variables are quantitative and discrete?

Out of all the choices, only B. daysDrive is both quantatitive and discrete. Though daysDrive is in both Answers D and E, the reasons why D and E are incorrect is because gasMonth is quantitative and continuous, and car is categorical and discrete.

A histogram of the GPA of 132 students from this course in Fall 2012 class is presented below. Which estimates of the mean and median are most plausible?

A. mean = 3.3, median = 3.5

A researcher wants to determine if a new treatment is effective for reducing Ebola related fever. What type of study should be conducted in order to establish that the treatment does indeed cause improvement in Ebola patients?

This should be a double blinded randomized controlled prospective study. The principle investigator needs to attempt to minimize cofounders (to maximize independence) in the patients and randomize them to either the intervention or placebo. The researcher should attempt to track all participants over the duration of the study period to minimize bias from lack of follow-up. Other studies such as observational studies may determine correlation, but cannot establish causation.

The closest answer here is: A. Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups.

A study is designed to test whether there is a relationship between natural hair color (brunette, blond, red) and eye color (blue, green, brown). If a large χ2 test statistic is obtained, this suggests that:

A large chi-square (depending if the corresponding p value is < 0.05) suggests that we reject the null hypothesis that hair color and eye color are independent. In this case, the best answer is: C. There is an association between natural hair color and eye color.

A researcher studying how monkeys remember is interested in examining the distribution of the score on a standard memory task. The researcher wants to produce a boxplot to examine this distribution. Below are summary statistics from the memory task. What values should the researcher use to determine if a particular score is a potential outlier in the boxplot?

If the point is below Quartile1 - 1.5 * IQR or above Quartile3 + 1.5 * IQR, then it is considered an outlier in a boxplot. The answer here is: B. 17.8 and 69.0.

The “median and interquartile range” are resistant to outliers, whereas the “mean and standard deviation” are not. (Answer: D)
Figure A below represents the distribution of an observed variable. Figure B below represents the distribution of the mean from 500 random samples of size 30 from A. The mean of A is 5.05 and the mean of B is 5.04. The standard deviations of A and B are 3.22 and 0.58, respectively.

Describe the two distributions.

In figure A, this is a distribution with a fairly moderate right sided skew and lower kurtosis with a mean of 5.05 and standard deviation of 3.22. It appears that Whereas in figure B, this distribution is normal but with a high kurtosis. Its mean is 5.04 with a standard deviation of 0.58.

Explain why the means of these two distributions are similar but the standard deviations are not.

Figure A is a distribution of an observed variable, whereas Figure B is a distribution of the mean from 500 random samples of size 30 from A. The reason why the standard deviations are different is because they are calculated different. The standard deviation of A is the standard “SD” formula that we used to calculate. However, in figure B, we calculate the standard of error (and NOT the standard deviation) because this is the distribution of the means (and NOT of the observed variables). The standard error formula is: SD/sqrt(30), which gives us 0.58.

What is the statistical principal that describes this phenomenon?

This is called the “Central Limit Theorem”. It states that given enough large sample size of the means, the distribution of the means will take on the normal distribution curve.

Part 2

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)

data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)) 

data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74)) 

data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73)) 

data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

The mean (for x and y separately).

# data1

mean.1.x <- round(mean(data1$x), 2)
mean.1.y <- round(mean(data1$y), 2)

# data2

mean.2.x <- round(mean(data2$x), 2)
mean.2.y <- round(mean(data2$y), 2)

# data3

mean.3.x <- round(mean(data3$x), 2)
mean.3.y <- round(mean(data3$y), 2)

# data4

mean.4.x <- round(mean(data4$x), 2)
mean.4.y <- round(mean(data4$y), 2)

paste0("Data1 - Mean(x):", mean.1.x)

## [1] "Data1 - Mean(x):9"

paste0("Data1 - Mean(y):", mean.1.y)

## [1] "Data1 - Mean(y):7.5"

paste0("Data2 - Mean(x):", mean.2.x)

## [1] "Data2 - Mean(x):9"

paste0("Data2 - Mean(y):", mean.2.y)

## [1] "Data2 - Mean(y):7.5"

paste0("Data3 - Mean(x):", mean.3.x)

## [1] "Data3 - Mean(x):9"

paste0("Data3 - Mean(y):", mean.3.y)

## [1] "Data3 - Mean(y):7.5"

paste0("Data4 - Mean(x):", mean.4.x)

## [1] "Data4 - Mean(x):9"

paste0("Data4 - Mean(y):", mean.4.y)

## [1] "Data4 - Mean(y):7.5"

The median (for x and y separately).

# data1

median.1.x <- round(median(data1$x), 2)
median.1.y <- round(median(data1$y), 2)

# data2

median.2.x <- round(median(data2$x), 2)
median.2.y <- round(median(data2$y), 2)

# data3

median.3.x <- round(median(data3$x), 2)
median.3.y <- round(median(data3$y), 2)

# data4

median.4.x <- round(median(data4$x), 2)
median.4.y <- round(median(data4$y), 2)

paste0("Data1 - Median(x):", median.1.x)

## [1] "Data1 - Median(x):9"

paste0("Data1 - Median(y):", median.1.y)

## [1] "Data1 - Median(y):7.58"

paste0("Data2 - Median(x):", median.2.x)

## [1] "Data2 - Median(x):9"

paste0("Data2 - Median(y):", median.2.y)

## [1] "Data2 - Median(y):8.14"

paste0("Data3 - Median(x):", median.3.x)

## [1] "Data3 - Median(x):9"

paste0("Data3 - Median(y):", median.3.y)

## [1] "Data3 - Median(y):7.11"

paste0("Data4 - Median(x):", median.4.x)

## [1] "Data4 - Median(x):8"

paste0("Data4 - Median(y):", median.4.y)

## [1] "Data4 - Median(y):7.04"

The standard deviation (for x and y separately).

# data1
sd.1.x <- round(sd(data1$x), 2)
sd.1.y <- round(sd(data1$y), 2)

# data2
sd.2.x <- round(sd(data2$x), 2)
sd.2.y <- round(sd(data2$y), 2)

# data3
sd.3.x <- round(sd(data3$x), 2)
sd.3.y <- round(sd(data3$y), 2)

# data4
sd.4.x <- round(sd(data4$x), 2)
sd.4.y <- round(sd(data4$y), 2)

paste0("Data1 - SD(x):", sd.1.x)

## [1] "Data1 - SD(x):3.32"

paste0("Data1 - SD(y):", sd.1.y)

## [1] "Data1 - SD(y):2.03"

paste0("Data2 - SD(x):", sd.2.x)

## [1] "Data2 - SD(x):3.32"

paste0("Data2 - SD(y):", sd.2.y)

## [1] "Data2 - SD(y):2.03"

paste0("Data3 - SD(x):", sd.3.x)

## [1] "Data3 - SD(x):3.32"

paste0("Data3 - SD(y):", sd.3.y)

## [1] "Data3 - SD(y):2.03"

paste0("Data4 - SD(x):", sd.4.x)

## [1] "Data4 - SD(x):3.32"

paste0("Data4 - SD(y):", sd.4.y)

## [1] "Data4 - SD(y):2.03"

For each x and y pair, calculate (also to two decimal places):

The correlation.

# Correlation using the cor() function
data1.cor <- round(cor(x = data1$x, y = data1$y), 2)
data2.cor <- round(cor(x = data2$x, y = data2$y), 2)
data3.cor <- round(cor(x = data3$x, y = data3$y), 2)
data4.cor <- round(cor(x = data4$x, y = data4$y), 2)

paste0("Data1 Correlation: ", data1.cor)

## [1] "Data1 Correlation: 0.82"

paste0("Data2 Correlation: ", data2.cor)

## [1] "Data2 Correlation: 0.82"

paste0("Data3 Correlation: ", data3.cor)

## [1] "Data3 Correlation: 0.82"

paste0("Data4 Correlation: ", data4.cor)

## [1] "Data4 Correlation: 0.82"

Linear regression equation

lm1 <- lm(y ~ x, data = data1)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

lm2 <- lm(y ~ x, data = data2)
summary(lm2)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

lm3 <- lm(y ~ x, data = data3)
summary(lm3)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

lm4 <- lm(y ~ x, data = data4)
summary(lm4)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Data1: y(hat) <- 3.00 + 0.50 * x

Data2: y(hat) <- 3.00 + 0.50 * x

Data3: y(hat) <- 3.00 + 0.50 * x

Data4: y(hat) <- 3.00 + 0.50 * x

R-Squared We will obtain these values by looking at the summary(lm) from above in question stem e.

Data1 R-squared: 0.63

Data2 R-squared: 0.63

Data3 R-squared: 0.63

Data4 R-squared: 0.63

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots!

In order for a pair to be appropriate for a linear regression model, it must satisfy these conditions.

Linearity
Nearly normal residuals
Constant variability
Independent observations

We will go in order for each pair of data points and evaluate if the linear regression is applicable.

# Data1

#1. Linearity
plot(x = data1$x, y = data1$y)

# There does appear to be a linear component to this plot.

#2. Nearly normal residuals
hist(lm1$residuals, breaks = 10)

qqnorm(lm1$residuals)
qqline(lm1$residuals)

# This one is questionable given the low sample numbers. The qqnorm() also demonstrates that the points seems to deviation from the qqline(). I would state that this could potentially be true if we had a larger sample size.

#3 Constant variability
plot(lm1$residuals ~ data1$x)
abline(h = 0, lty = 3)

# On visual inspection, there does appear to be constant variability

#4 Independent Observations
# There is assumed to be independent as the author has not provided any more details regarding where it came from. It doesn't appear that choosing one point had swayed the author to choose a second point.

Data 1 does NOT appear to be satisfy the criteria for a linear model regression due to violation of the 2nd criteria of nearly normal residuals.

# Data2

#1. Linearity
plot(x = data2$x, y = data2$y)

# There does NOT appear to be a linear component to this. 

#2. Nearly normal residuals
hist(lm2$residuals, breaks = 10)

qqnorm(lm2$residuals)
qqline(lm2$residuals)

# This also violates this criteria. The distribution is NOT normal.

#3 Constant variability
plot(lm2$residuals ~ data2$x)
abline(h = 0, lty = 3)

# On visual inspection, there does NOT appear to be constant variability. There appears to be a pattern. It fails on criteria #3.

#4 Independent Observations
# There is assumed to be independent as the author has not provided any more details regarding where it came from. It doesn't appear that choosing one point had swayed the author to choose a second point.

Data 2 is NOT appropriate for the linear regression model as it violates criteria 1 (linearity), criteria 2 (nearly normal residuals), and criteria 3 (constant variability).

# Data3

#1. Linearity
plot(x = data3$x, y = data3$y)

# There does appear to be a linear component to this plot.

#2. Nearly normal residuals
hist(lm3$residuals, breaks = 10)

qqnorm(lm3$residuals)
qqline(lm3$residuals)

# This is questionable whether or not there is nearly normal residuals. On the qqnorm() plot, it appears that, other than the 1 outlier, the residuals appears to be on the line. The histogram is difficult to discern if this contains nearly normal residuals. This again also suffers from low sample size.

#3 Constant variability
plot(lm3$residuals ~ data3$x)
abline(h = 0, lty = 3)

# On visual inspection, there appears to be a problem. There's an obvious pattern here. So it fails criteria here.

#4 Independent Observations
# There is assumed to be independent as the author has not provided any more details regarding where it came from. It doesn't appear that choosing one point had swayed the author to choose a second point.

Data 3 fails criteria (3) and likely (2) as well. It is inappropriate to use the linear regression model here.

# Data4

#1. Linearity
plot(x = data4$x, y = data4$y)

# There is NO linear component. This fails the criteria.

#2. Nearly normal residuals
hist(lm4$residuals, breaks = 10)

qqnorm(lm4$residuals)
qqline(lm4$residuals)

# There appears to be nearly normal residuals. 

#3 Constant variability
plot(lm4$residuals ~ data4$x)
abline(h = 0, lty = 3)

# On visual inspection, there appears to be a problem. There's an obvious pattern here. So it fails the criteria as well here.

#4 Independent Observations
# There is assumed to be independent as the author has not provided any more details regarding where it came from. It doesn't appear that choosing one point had swayed the author to choose a second point.

Data 4 fails on criteria (1) and (3).

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create.

It is important to visualize the data as it can provide insight into the data. For example, looking at the residuals can determine if there is some underlying pattern that can’t be recognize with just looking at the numbers directly. When we had calculated the mean and standard deviation, all of the numbers were the same. Likewise, the median were similar to each other. When we had used the lm() function, they all suggested the same linear regression model. However, each data set had its own unique characteristics of it that were more obvious when we visualized it. The visualization will allow us to easily see patterns that weren’t so obvious before. Please see above for the included visualizations.

CUNY 606 Final Exam

Joel Park

5/19/2017

Part 1

Part 2