DATA 606 Spring 2017

library(DATA606)

Part I

Answers are in bold

A student is gathering data on the driving experiences of other college students. A description of the data car color is presented below. Which of the variables are quantitative and discrete? car 1 = compact, 2 = standard size, 3 = mini van, 4 = SUV, and 5 = truck color red, blue, green, black, white daysDrive number of days per week the student drives gasMonth the amount of money the student spends on gas per month

b. daysDrive

daysDrive, car
daysDrive, gasMonth
car, daysDrive, gasMonth

A histogram of the GPA of 132 students from this course in Fall 2012 class is presented below. Which estimates of the mean and median are most plausible?

a. mean = 3.3, median = 3.5

mean = 3.5, median = 3.3
mean = 2.9, median = 3.8
mean = 3.8, median = 2.9
mean = 2.5, median = 3.8

A researcher wants to determine if a new treatment is effective for reducing Ebola related fever. What type of study should be conducted in order to establish that the treatment does indeed cause improvement in Ebola patients?

Randomly assign Ebola patients to one of two groups, either the treatment or placebo group, and then compare the fever of the two groups.
Identify Ebola patients who received the new treatment and those who did not, and then compare the fever of those two groups.
Identify clusters of villages and then stratify them by gender and compare the fevers of male and female groups.

d. Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.

A study is designed to test whether there is a relationship between natural hair color (brunette, blond, red) and eye color (blue, green, brown). If a large X2 test statistic is obtained, this suggests that:

there is a difference between average eye color and average hair color.
a person’s hair color is determined by his or her eye color.
there is an association between natural hair color and eye color.

d. eye color and natural hair color are independent

A researcher studying how monkeys remember is interested in examining the distribution of the score on a standard memory task. The researcher wants to produce a boxplot to examine this distribution. Below are summary statistics from the memory task. What values should the researcher use to determine if a particular score is a potential outlier in the boxplot?

min Q1 median Q3 max mean sd n 26 37 45 49.8 65 44.4 8.4 50

37.0 and 49.8

b. 17.8 and 69.0

36.0 and 52.8
26.0 and 50.0
19.2 and 69.9

The are resistant to outliers, whereas the are not.

mean and median; standard deviation and interquartile range
mean and standard deviation; median and interquartile range
standard deviation and interquartile range; mean and median

d. median and interquartile range; mean and standard deviation

median and standard deviation; mean and interquartile range

Figure A below represents the distribution of an observed variable. Figure B below represents the distribution of the mean from 500 random samples of size 30 from A. The mean of A is 5.05 and the mean of B is 5.04. The standard deviations of A and B are 3.22 and 0.58, respectively.

Describe the two distributions (2 pts). Both distributions appear to follow a normal model with moderate skew on distribution illustrated by A. The observation distribution appears to have tight spread and the sampling distribution appears to have wider spread. With larger sample size, though, the sampling distribution could generate a tighter spread due to smaller standard error and a closer estimate of the mean
Explain why the means of these two distributions are similar but the standard deviations are not (2pts). The standard deviation of the sampling distribution is the standard deviation of A divided by the square root of the sample size
What is the statistical principal that describes this phenomenon (2 pts)? Central Limit Theorem

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)

data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))

data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))

data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))

data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

summary(data1)

##        x              y       
##  Min.   : 4.0   Min.   : 4.3  
##  1st Qu.: 6.5   1st Qu.: 6.3  
##  Median : 9.0   Median : 7.6  
##  Mean   : 9.0   Mean   : 7.5  
##  3rd Qu.:11.5   3rd Qu.: 8.6  
##  Max.   :14.0   Max.   :10.8

summary(data2)

##        x              y      
##  Min.   : 4.0   Min.   :3.1  
##  1st Qu.: 6.5   1st Qu.:6.7  
##  Median : 9.0   Median :8.1  
##  Mean   : 9.0   Mean   :7.5  
##  3rd Qu.:11.5   3rd Qu.:8.9  
##  Max.   :14.0   Max.   :9.3

summary(data3)

##        x              y       
##  Min.   : 4.0   Min.   : 5.4  
##  1st Qu.: 6.5   1st Qu.: 6.2  
##  Median : 9.0   Median : 7.1  
##  Mean   : 9.0   Mean   : 7.5  
##  3rd Qu.:11.5   3rd Qu.: 8.0  
##  Max.   :14.0   Max.   :12.7

summary(data4)

##        x            y       
##  Min.   : 8   Min.   : 5.2  
##  1st Qu.: 8   1st Qu.: 6.2  
##  Median : 8   Median : 7.0  
##  Mean   : 9   Mean   : 7.5  
##  3rd Qu.: 8   3rd Qu.: 8.2  
##  Max.   :19   Max.   :12.5

For each column, calculate (to two decimal places):

The mean (for x and y separately; 1 pt).

a.1 The mean of data1 x and y:

data_means <- c(format(mean(data1$x), nsmall = 2),format(mean(data1$y), nsmall = 2)); data_means

## [1] "9.00" "7.50"

a.2 The mean of data2 x and y:

data_means <- c(format(mean(data2$x), nsmall = 2),format(mean(data2$y), nsmall = 2)); data_means

## [1] "9.00" "7.50"

a.3 The mean of data3 x and y:

data_means <- c(format(mean(data3$x), nsmall = 2),format(mean(data3$y), nsmall = 2)); data_means

## [1] "9.00" "7.50"

a.4 The mean of data4 x and y:

data_means <- c(format(mean(data4$x), nsmall = 2),format(mean(data4$y), nsmall = 2)); data_means

## [1] "9.00" "7.50"

The median (for x and y separately; 1 pt).

b.1 The median of data1 x and y:

data_medians <- c(format(median(data1$x), nsmall = 2),format(median(data1$y), nsmall = 2)); data_medians

## [1] "9.00" "7.58"

b.2 The median of data2 x and y:

data_medians <- c(format(median(data2$x), nsmall = 2),format(median(data2$y), nsmall = 2)); data_medians

## [1] "9.00" "8.14"

b.3 The median of data3 x and y:

data_medians <- c(format(median(data3$x), nsmall = 2),format(median(data3$y), nsmall = 2)); data_medians

## [1] "9.00" "7.11"

b.4 The median of data4 x and y:

data_medians <- c(format(median(data4$x), nsmall = 2),format(median(data4$y), nsmall = 2)); data_medians

## [1] "8.00" "7.04"

The standard deviation (for x and y separately; 1 pt).

c.1 The median of data1 x and y:

data_sd <- c(format(sd(data1$x), nsmall = 2),format(sd(data1$y), nsmall = 2)); data_sd

## [1] "3.32" "2.03"

c.2 The median of data2 x and y:

data_sd <- c(format(sd(data2$x), nsmall = 2),format(sd(data2$y), nsmall = 2)); data_sd

## [1] "3.32" "2.03"

c.3 The median of data3 x and y:

data_sd <- c(format(sd(data3$x), nsmall = 2),format(sd(data3$y), nsmall = 2)); data_sd

## [1] "3.32" "2.03"

c.4 The median of data4 x and y:

data_sd <- c(format(sd(data4$x), nsmall = 2),format(sd(data4$y), nsmall = 2)); data_sd

## [1] "3.32" "2.03"

For each x and y pair, calculate (also to two decimal places; 1 pt):

The correlation (1 pt).

d.1 The correlation of data1:

cor(data1$x, data1$y)

## [1] 0.82

d.2 The correlation of data2:

cor(data2$x, data2$y)

## [1] 0.82

d.3 The correlation of data3:

cor(data3$x, data3$y)

## [1] 0.82

d.4 The correlation of data4:

cor(data4$x, data4$y)

## [1] 0.82

Linear regression equation (2 pts).

m1 <- lm(y ~ x, data = data1)
summary(m1)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

e.1 The equation of data1: y = 3.00 + 0.50x

m2 <- lm(y ~ x, data = data2)
summary(m2)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

e.2 The equation of data2: y = 3.00 + 0.50x

m3 <- lm(y ~ x, data = data3)
summary(m3)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

e.3 The equation of data3: y = 3.00 + 0.50x

m4 <- lm(y ~ x, data = data4)
summary(m4)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

e.3 The equation of data4: y = 3.00 + 0.50x

R-Squared (2 pts).

f.1 The data1 R-squared: 0.67

f.2 The data2 R-squared: 0.67

f.3 The data3 R-squared: 0.67

f.4 The data4 R-squared: 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Plots for `data1`

par(mfrow=c(2,2))
plot(data1$x, data1$y)
plot(m1$residuals ~ data1$x)
abline(h = 0, lty = 3)
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)

data1 failed independence

Plots for `data2`

par(mfrow=c(2,2))
plot(data2$x, data2$y)
plot(m2$residuals ~ data2$x)
abline(h = 0, lty = 3)
hist(m2$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)

data2 failed linearity

Plots for `data3`

par(mfrow=c(2,2))
plot(data3$x, data3$y)
plot(m3$residuals ~ data3$x)
abline(h = 0, lty = 3)
hist(m3$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals)

data3 failed constant variability

Plots for `data4`

par(mfrow=c(2,2))
plot(data4$x, data4$y)
plot(m4$residuals ~ data4$x)
abline(h = 0, lty = 3)
hist(m4$residuals)
qqnorm(m4$residuals)
qqline(m4$residuals)

data4 failed normality

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Visualizations help in estimating data and supporting conclusions of data analysis. They can aid in quickly seeing trend and/or abnormality in the data

DATA 606 Spring 2017 - Final Exam

Luisa Velasco

May 25, 2017

Part I

Part II

Plots for `data1`

Plots for `data2`

Plots for `data3`

Plots for `data4`

DATA 606 Spring 2017 - Final Exam

Luisa Velasco

May 25, 2017

Part I

Part II

Plots for data1

Plots for data2

Plots for data3

Plots for data4

Plots for `data1`

Plots for `data2`

Plots for `data3`

Plots for `data4`