DATA 606 Spring 2017

Part I

A student is gathering data on the driving experiences of other college students. A description of the data car color is presented below. Which of the variables are quantitative and discrete? car 1=compact,2=standardsize,3=minivan,4=SUV,andv5=truck color red, blue, green, black, white daysDrive number of days per week the student drives gasMonth the amount of money the student spends on gas per month

car
daysDrive
daysDrive, car
daysDrive, gasMonth
car, daysDrive, gasMonth

ANS: B

A histogram of the GPA of 132 students from this course in Fall 2012 class is presented below. Which estimates of the mean and median are most plausible?

mean = 3.3,median = 3.5
mean = 3.5,median = 3.3
mean = 2.9,median = 3.8
mean = 3.8,median = 2.9
mean = 2.5,median = 3.8

ANS: c

A researcher wants to determine if a new treatment is effective for reducing Ebola related fever. What type of study should be conducted in order to establish that the treatment does indeed cause improvement in Ebola patients?

Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups.
Identify Ebola patients who received the new treatment and those who did not, and then compare the fever of those two groups.
Identify clusters of villages and then stratify them by gender and compare the fevers of male and female groups.
Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.

ANS: A

A study is designed to test whether there is a relationship between natural hair color (brunette, blond, red) and eye color (blue, green, brown). If a large χ2 test statistic is obtained, this suggests that:

there is a difference between average eye color and average hair color.
a person’s hair color is determined by his or her eye color.
there is an association between natural hair color and eye color.
eye color and natural hair color are independent

ANS: C

A researcher studying how monkeys remember is interested in examining the distribution of the score on a standard memory task. The researcher wants to produce a boxplot to examine this distribution. Below are summary statistics from the memory task. What values should the researcher use to determine if a particular score is a potential outlier in the boxplot? min Q1 median Q3 max mean sd n 26 37 45 49.8 65 44.4 8.4 50

37.0 and 49.8
17.8 and 69.0
36.0 and 52.8
26.0 and 50.0
19.2 and 69.9

ANS:B

The _________are resistant to outliers, whereas the_____________are not.

mean and median; standard deviation and interquartile range
mean and standard deviation; median and interquartile range
standard deviation and interquartile range; mean and median
median and interquartile range; mean and standard deviation
median and standard deviation; mean and interquartile range

ANS: D

Figure A below represents the distribution of an observed variable. Figure B below represents the distribution of the mean from 500 random samples of size 30 from A. The mean of A is 5.05 and the mean of B is 5.04. The standard deviations of A and B are 3.22 and 0.58, respectively.

Describe the two distributions (2 pts).

ANS:   The distribution for the observations, represented by graph A is unimodal distribution which is highly skeewed to the right where as Distribution B is a symmetrical nearly normal distribution. 
From the graph A, the presence of outliers with values in 20’s or above. the median is expected to be lower than the mean.

Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

ANS: Figure B represents the distribution of the mean from 500 random samples of size 30 from A. Based on this, the mean of this distribution to be similar to the mean of the original population A as per central Limit Theorem. The standard deviation of the sample mean describe the margin of error from the estimate to the true mean of the population. It is call the Standard Error. 
SE = standard deviation/sqrt(n) = 3.22/sqrt(30) =.58 .

What is the statistical principal that describes this phenomenon (2 pts)?

ANS: The Central Limit Theorem which states “if a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model.

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)

data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)) 
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74)) 
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73)) 
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
knitr::opts_chunk$set(echo = TRUE)

For each column, calculate (to two decimal places):

a. The mean (for x and y separately):

# Data Set 1:
mx1 <- mean(data1$x)
my1 <- mean(data1$y)

# Data Set 2:
mx2 <- mean(data2$x)
my2 <- mean(data2$y)

# Data Set 3:
mx3 <- mean(data3$x)
my3 <- mean(data3$y)

# Data Set 4:
mx4 <- mean(data4$x)
my4 <- mean(data4$y)

knitr::opts_chunk$set(echo = TRUE)

The mean for the data sets::

Data	variable x	variable y
set 1	9	7.5
set 2	9	7.5
set 3	9	7.5
set 4	9	7.5

b. The median (for x and y separately):

# Data Set 1:
mdx1 <- median(data1$x)
mdy1 <- median(data1$y)

# Data Set 2:
mdx2 <- median(data2$x)
mdy2 <- median(data2$y)

# Data Set 3:
mdx3 <- median(data3$x)
mdy3 <- median(data3$y)

# Data Set 4:
mdx4 <- median(data4$x)
mdy4 <- median(data4$y)

knitr::opts_chunk$set(echo = TRUE)

The median for the data sets:

Data	variable x	variable y
set 1	9	7.58
set 2	9	8.14
set 3	9	7.11
set 4	8	7.04

c. The standard deviation (for x and y separately;).

# Data Set 1:
sdx1 <- sd(data1$x)
sdy1 <- sd(data1$y)

# Data Set 2:
sdx2 <- sd(data2$x)
sdy2 <- sd(data2$y)

# Data Set 3:
sdx3 <- sd(data3$x)
sdy3 <- sd(data3$y)

# Data Set 4:
sdx4 <- sd(data4$x)
sdy4 <- sd(data4$y)

knitr::opts_chunk$set(echo = TRUE)

The standard deviation for the data sets:

Data	variable x	variable y
set 1	3.32	2.03
set 2	3.32	2.03
set 3	3.32	2.03
set 4	3.32	2.03

For each x and y pair, calculate (also to two decimal places):

d. The correlation:

cor1 <- cor(data1$x, data1$y)
cor2 <- cor(data2$x, data2$y)
cor3 <- cor(data3$x, data3$y)
cor4 <- cor(data4$x, data4$y)

The correlation for each data set is:

Data	Correlation (x,y)
Set 1	0.82
Set 2	0.82
Set 3	0.82
Set 4	0.82

e. Linear regression equation.

lm1 <- lm(data1$y ~ data1$x)
lm2 <- lm(data2$y ~ data2$x)
lm3 <- lm(data3$y ~ data3$x)
lm4 <- lm(data4$y ~ data4$x)

b0.1 <- lm1$coefficients[[1]]
b1.1 <- lm1$coefficients[[2]]

b0.2 <- lm2$coefficients[[1]]
b1.2 <- lm2$coefficients[[2]]

b0.3 <- lm3$coefficients[[1]]
b1.3 <- lm3$coefficients[[2]]

b0.4 <- lm4$coefficients[[1]]
b1.4 <- lm4$coefficients[[2]]

Linear Line equation:
\(\hat { y } \quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }\cdot x\quad\)

Linear regression equation for 1st data set: y = 3 + 0.5 x

Linear regression equation for 2nd data set: y = 3 + 0.5 x

Linear regression equation for 3rd data set: y = 3 + 0.5 x

Linear regression equation for 4th data set: y = 3 + 0.5 x

f. R-Squared.

The R-Squared for the 1st data set is 0.67

The R-Squared for the 2nd data set is 0.67

The R-Squared for the 3rd data set is 0.67

The R-Squared for the 4th data set is 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots!

For each data plot, we would like to evaluate whether it is appropriate to estimate a linear regression model. We will therefore check the following for each data set.

Linearity:
The data should show a linear trend
Nearly Normal Residuals:
Generally the residuals must be nearly normal.
Constant Variability:
The variability of points around the least squares line remains roughly constant.
Independent Observations:
The observations of the data set must be independent.

For the first condition, we will plot the scatter graph and observe the pattern.
For the second condition, we will plot the qqplot for the residuals and observe the pattern.
For the third condition, we will plot the residual graph and observe the pattern. For the fourth condition, we will assume independence unless we have evidence to the contrary.

Evaluation for Data Set 1:

library(ggplot2)
ggplot(data1, aes(x = x, y=y)) + geom_point(size = 2, color="red")

From the scatter plot, it appears that there is a positive linear relationship.

Residual plot for Data set 1:

#library(ggplot2)
qqnorm(lm1$residuals)
qqline(lm1$residuals)

From the qqplot of the residuals, it appears that the points are closer to the line with the exception of the 2 point at with lower values. Will now plot the histogram for the residuals.

hist(lm1$residuals)

The qqplot indicates that the residuals have a nearly normal distribution but the histogram does not.

Will now observe the residual plot and determine whether the points have a constant variability.

plot(lm1$residuals ~ data1$x)
abline(h = 0, lty = 3)

From this graph, we can conclude that the variability of the data is constant.

We will assume independence of the observations.

In summary for data set 1, it would be appropriate to estimate a linear regression model.

Data Set # 2

Will now perform a similar evaluation on the second data set.

Will first look at the scatter plot.

From the scatter plot, it appears that there is strong relationship between the variables but it is not a linear one. It appears to be a parabola.

From this observation, we would conclude that it would not be appropriate to estimate a linear regression model

Data Set # 3

We will now look at the scatter plot for the third data set.

ggplot(data3, aes(x = x, y=y)) + geom_point(size = 2, color="red")

From the scatter plot, it appears that there is a positive linear relation between the variables. There is an outlier point.

Will now look at the qqplot for the residuals.

Except for the outlier, all other points are on the line, wich would indicate the the residuals follow a normal distribution.

hist(lm3$residuals)

The residual distribution appears to be uniform, the outlier would need to be accounted for. It is difficult to interpret or conclude based on the histogram of the residuals. We will rely on the qqplot of the residuals to evaluate distribution of the residuals.

We will now look at the residual plot against the x variable to observe the variabilty of the data.

plot(lm3$residuals ~ data3$x)
abline(h = 0, lty = 3)

From this plot, we can observe that the variability of the points is deccreasing as the x variable increase. The variability is therefore not constant.

We will conclude that for Data Set 3, it would not be appropriate to estimate a linear regression model.

Data Set #4:

ggplot(data4, aes(x = x, y=y)) + geom_point(size = 2, color="red")

From the scatter plot, we would expect that the data points would be in vertial line with an outlier quite away from the rest of the data. But the line model is y = 3 + 0.5 x. We would surmize that the observations may not be independent.

Therefore it would not be appropriate to estimate a linear regression model.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create.

It is extremely important to include visualizations to provide a better picture. The above exercise is such a case where  plotting the data provides an idea of what is really happening. For example, many of the variables of measurement are the same for a few of the datasets. Based on the summary stats alone, one could assume that three data sets behave the same. By plotting the datasets and with visual aids to go along with it that we realize that even though those variables of measurement are the same, the sets of data behave quite differently.

DATA 606 Spring 2017 - Final Exam

Raghu

5/19/2017