Part I

1

Ans: b) daysDrive

daysDrive is the variable that is quantitative and discrete. Both car and color are not quantitative and while gasMonth is quantitative it is continuous.

2

Ans: a) mean = 3.3, median = 3.5

gpa <- c((1.9*3.3),(2.1*3.3),(2.5*6.6),(2.7*6.6),(2.9*19.8),(3.1*6.6),(3.4*18.48),(3.5*18.48),(3.7*27.72),(3.0*26.4))
sum(gpa)/132

## [1] 3.293

3

Ans: d) Both a) and c)

Using a random selection for testing and looking at how new testing affects one group will both help to see if the treatment causes improvement in Ebola patients.

4

Ans: a) There’s a difference between average eye color and average hair color

Having a large chi square means that we will reject the null hypothesis that the there is no difference in the averages.

5

Ans: b) 17.8 and 69.0

\[IQR = 49.8 - 37 = 12.8\]
\[lower\quad limit = 37 - (1.5*12.8) = 17.8\]
\[upper\quad limit = 49.8 + (1.5*12.8) = 69\]

6

Ans: d) The median and IQR are resistant to outliers, whereas the mean and SD are not.

7

a)

Distribution A is uni-modal and skewed to the right. It has a mean around 5 and the spread is small.
Distribution B is uni-modal with no skew. The spread is wide and the sample size is 30.

b)

The means of the two distributions are similar because distribution B is a sample of A. The standard deviations are different because distribution B has wider spread with a smaller population that A.

c)

The Central Limit Theorem

Part II

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

a)

mean1 <- data.frame(c(meanx= mean(data1$x), meany=mean(data1$y)))
mean1

##       c.meanx...mean.data1.x...meany...mean.data1.y..
## meanx                                             9.0
## meany                                             7.5

mean2 <- data.frame(c(meanx= mean(data2$x), meany=mean(data2$y)))
mean2

##       c.meanx...mean.data2.x...meany...mean.data2.y..
## meanx                                             9.0
## meany                                             7.5

mean3 <- data.frame(c(meanx= mean(data3$x), meany=mean(data3$y)))
mean3

##       c.meanx...mean.data3.x...meany...mean.data3.y..
## meanx                                             9.0
## meany                                             7.5

mean4 <- data.frame(c(meanx= mean(data4$x), meany=mean(data4$y)))
mean4

##       c.meanx...mean.data4.x...meany...mean.data4.y..
## meanx                                             9.0
## meany                                             7.5

b)

median1 <- data.frame(c(medianx= median(data1$x), mediany=median(data1$y)))
median1

##         c.medianx...median.data1.x...mediany...median.data1.y..
## medianx                                                     9.0
## mediany                                                     7.6

median2 <- data.frame(c(medianx= median(data2$x), mediany=median(data2$y)))
median2

##         c.medianx...median.data2.x...mediany...median.data2.y..
## medianx                                                     9.0
## mediany                                                     8.1

median3 <- data.frame(c(medianx= median(data3$x), mediany=median(data3$y)))
median3

##         c.medianx...median.data3.x...mediany...median.data3.y..
## medianx                                                     9.0
## mediany                                                     7.1

median4 <- data.frame(c(medianx= mean(data4$x), mediany=median(data4$y)))
median4

##         c.medianx...mean.data4.x...mediany...median.data4.y..
## medianx                                                     9
## mediany                                                     7

c)

sd1 <- data.frame(c(sdx= sd(data1$x), sdy=sd(data1$y)))
sd1

##     c.sdx...sd.data1.x...sdy...sd.data1.y..
## sdx                                     3.3
## sdy                                     2.0

sd2 <- data.frame(c(sdx= sd(data2$x), sdy=sd(data2$y)))
sd2

##     c.sdx...sd.data2.x...sdy...sd.data2.y..
## sdx                                     3.3
## sdy                                     2.0

sd3 <- data.frame(c(sdx= sd(data3$x), sdy=sd(data3$y)))
sd3

##     c.sdx...sd.data3.x...sdy...sd.data3.y..
## sdx                                     3.3
## sdy                                     2.0

sd4 <- data.frame(c(sdx= sd(data4$x), sdy=sd(data4$y)))
sd4

##     c.sdx...sd.data4.x...sdy...sd.data4.y..
## sdx                                     3.3
## sdy                                     2.0

d)

cor(data1)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

cor(data2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

cor(data3)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

cor(data4)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

e)

eq1<- lm(data1$y ~ data1$x)
eq1

## 
## Call:
## lm(formula = data1$y ~ data1$x)
## 
## Coefficients:
## (Intercept)      data1$x  
##         3.0          0.5

eq2<- lm(data2$y ~ data2$x)
eq2

## 
## Call:
## lm(formula = data2$y ~ data2$x)
## 
## Coefficients:
## (Intercept)      data2$x  
##         3.0          0.5

eq3<- lm(data3$y ~ data3$x)
eq3

## 
## Call:
## lm(formula = data3$y ~ data3$x)
## 
## Coefficients:
## (Intercept)      data3$x  
##         3.0          0.5

eq4<- lm(data4$y ~ data4$x)
eq4

## 
## Call:
## lm(formula = data4$y ~ data4$x)
## 
## Coefficients:
## (Intercept)      data4$x  
##         3.0          0.5

\[Equation: y = .5x + 3\]

f)

summary(eq1)$r.squared

## [1] 0.67

summary(eq2)$r.squared

## [1] 0.67

summary(eq3)$r.squared

## [1] 0.67

summary(eq4)$r.squared

## [1] 0.67

\[R^2 = 0.67\]

For each pair, is it appropriate to estimate a linear regression model? Why or why not?

#Data 1
par(mfrow=c(2,2))
plot(data1)
plot(eq1$residuals)
hist(eq1$residuals)
qqnorm(eq1$residuals)
qqline(eq1$residuals)

Although the data plot looks like there is linearity, Data 1 does not have residuals that follow a normal distribution.

#Data 2
par(mfrow=c(2,2))
plot(data2)
plot(eq2$residuals)
hist(eq2$residuals)
qqnorm(eq2$residuals)
qqline(eq2$residuals)

Data 2 doesn’t have a plot that shows linearity or have residuals that follow a normal distribution.

#Data 3
par(mfrow=c(2,2))
plot(data3)
plot(eq3$residuals)
hist(eq3$residuals)
qqnorm(eq3$residuals)
qqline(eq3$residuals)

There is an outlier in Data 3 but the plot seems to show linearity and the residuals look to follow a normal distribution.

#Data 4
par(mfrow=c(2,2))
plot(data4)
plot(eq4$residuals)
hist(eq4$residuals)
qqnorm(eq4$residuals)
qqline(eq4$residuals)

Data 4’s plot has no linearity and the residuals don’t follow any normal distribution.

Explain why it is important to include appropriate visualizations when analyzing data.

Visualizations give proof to any statements when analyzing data. They are able to help show trends and insights that cannot be seen by just looking at numbers. For example the outlier in Data 3 is hard to see by just looking at the numbers but when put in a plot it is clear. Visualizations help us easily see any problems like this.

plot(data3)

Data 606 - Final Exam

David Quarshie

December 14, 2017

Part I

1

2

3

4

5

6

7

a)

b)

c)

Part II

a)

b)

c)

d)

e)

f)

For each pair, is it appropriate to estimate a linear regression model? Why or why not?

Explain why it is important to include appropriate visualizations when analyzing data.