Part I

Please put the answers for Part I next to the question number (2pts each):

  1. b - just DaysDrive
  2. a
  3. a - but in practice this brings moral complications and regulations
  4. c
  5. b
  6. d

7a. Describe the two distributions (2pts).

A is right-skewed. It appears to be maybe fit a lognormal distribution. B is centered around 5. B appears to be platykurtic. It looks pretty close to a normal distribution.

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

If we take random variables independently from a distribution, the sample standard deviation approaches \(\frac{\sigma}{\sqrt{n}}\). With more samples, the standard deviation will go down and the sample distribution will become more normal.

7c. What is the statistical principal that describes this phenomenon (2 pts)?

The Central Limit Theorem

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=3)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

\(\Large\mu=\sum\frac{x_{i}}{n}\)

mean(data1$x)
## [1] 9
mean(data1$y)
## [1] 7.5
mean(data2$x)
## [1] 9
mean(data2$y)
## [1] 7.5
round(mean(data3$x),2)
## [1] 9
mean(data3$y)
## [1] 7.5
round(mean(data4$x),2)
## [1] 9
mean(data4$y)
## [1] 7.5

b. The median (for x and y separately; 1 pt).

The median is the central point, or where the cdf \(\geq\) .5.

median(data1$x)
## [1] 9
round(median(data1$y),2)
## [1] 7.58
round(median(data2$x),2)
## [1] 9
median(data2$y)
## [1] 8.14
median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.11
median(data4$x)
## [1] 8
median(data4$y)
## [1] 7.04

c. The standard deviation (for x and y separately; 1 pt).

\(\Large\sqrt{\frac{\sum (x_{i}-\overline{x})}{n-1}}\)

round(sd(data1$x),2)
## [1] 3.32
round(sd(data1$y),2)
## [1] 2.03
round(sd(data2$x),2)
## [1] 3.32
round(sd(data2$y),2)
## [1] 2.03
round(sd(data3$x),2)
## [1] 3.32
round(sd(data3$y),2)
## [1] 2.03
round(sd(data4$x),2)
## [1] 3.32
round(sd(data4$y),2)
## [1] 2.03

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

sample correlation=\(\frac{Cov(x,y)}{\sigma_{x}*\sigma_{y}}\)

round(cor(data1)[1,2],2)
## [1] 0.82
round(cor(data2)[1,2],2)
## [1] 0.82
round(cor(data3)[1,2],2)
## [1] 0.82
round(cor(data4)[1,2],2)
## [1] 0.82

e. Linear regression equation (2 pts).

lm.1<-lm(data1$y~data1$x)
summary(lm.1)
## 
## Call:
## lm(formula = data1$y ~ data1$x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## data1$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217
a.1<-summary(lm.1)[["coefficients"]][1,1]
b.1<-summary(lm.1)[["coefficients"]][2,1]
lm.2<-lm(data2$y~data2$x)
summary(lm.2)
## 
## Call:
## lm(formula = data2$y ~ data2$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## data2$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
a.2<-summary(lm.2)[["coefficients"]][1,1]
b.2<-summary(lm.2)[["coefficients"]][2,1]
lm.3<-lm(data3$y~data3$x)
summary(lm.3)
## 
## Call:
## lm(formula = data3$y ~ data3$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## data3$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
a.3<-summary(lm.3)[["coefficients"]][1,1]
b.3<-summary(lm.3)[["coefficients"]][2,1]
lm.4<-lm(data4$y~data4$x)
summary(lm.4)
## 
## Call:
## lm(formula = data4$y ~ data4$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## data4$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216
a.4<-summary(lm.4)[["coefficients"]][1,1]
b.4<-summary(lm.4)[["coefficients"]][2,1]

y= 3 + 0.5 * x

y= 3 + 0.5 * x

y= 3 + 0.5 * x

y= 3 + 0.5 * x

f. R-Squared (2 pts).

summary(lm.1)$r.squared
## [1] 0.667
summary(lm.2)$r.squared
## [1] 0.666
summary(lm.3)$r.squared
## [1] 0.666
summary(lm.4)$r.squared
## [1] 0.667

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

plot (data1$x,data1$y)+abline(lm.1)

## integer(0)
qqplot(data1$x,data1$y)

boxplot(data1$x,data1$y)

plot(resid(lm.1))

Model 1 appears to be appropriate for a linear regression. From boxplots, both appear to be fairly normal. Their relationship appears to be linear. The residuals appear to be random and fairly stable.
#—————————————————————————————————

plot (data2$x,data2$y)+abline(lm.2)

## integer(0)
qqplot(data2$x,data2$y)

boxplot(data2$x,data2$y)

plot(resid(lm.2))

Model 2 is not appropriate for a linear regression. It appears, from the regular plot and the slope changes of the qq plot, to be a great candidate for a quadratic regression.

—————————————————————————————————

plot (data3$x,data3$y)+abline(lm.3)

## integer(0)
qqplot(data3$x,data3$y)

boxplot(data3$x,data3$y)

plot(resid(lm.3))

Model 3 has one outlier that skews the regression line from what is otherwise nearly a perfect line. Test like Cook’s distance and DFITS would confirm how much influence the outlier has on the regression line. In this case, it’s pretty clear that it does. It should be investigated for removal from the model.

—————————————————————————————————

plot (data4$x,data4$y)+abline(lm.4)

## integer(0)
qqplot(data4$x,data4$y)

boxplot(data4$x,data4$y)

plot(resid(lm.4))

The x values for model 4 are all 8 except one. If that point were removed, the model would predict all points would be one. It would probably be obvious if real data produced all one value except 1 point. That would be an extremely rare event to happen without knowing why. It’s unlikely a linear regression would work well for this model, even if we removed the point.

—————————————————————————————————

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

The plot of residuals for model 4 appear to show a fairly nice model, with random residuals.

plot(resid(lm.4))

The scatterplot or qq plot of the same data shows a much different story of a model that doesn’t work well with a linear plot.

plot (data4$x,data4$y)+abline(lm.4)

## integer(0)

In model 2, the boxplots and residuals plots show a model that may have some heteroskedasticity, and may have a left-skewed set of x’s.

boxplot(data2$x,data2$y)

plot(resid(lm.2))

But the true nature of the quadratic relationship is only seen through a scatterplot and through the qq plot if you look really carefully.

plot (data2$x,data2$y)+abline(lm.2)

## integer(0)
qqplot(data2$x,data2$y)

In model 3, the outlier shows up in all plots, but only the regular plot shows what y it matches with. Only the residuals plot show what order of data collection it shows up in.

Some data visualizations are inappropriate for different types of data. Pie charts, for example often make it hard to find differences in data as .3 looks similar to .25, but may be really different in context. Some data visualizations are good to use with relationships that may vary by factor (by coloring or by facet wrap). Data visualization can give you the wrong picture of the data if you use them incorrectly(model 4 and 2 above). Sometimes, a transformation of data, like a log transformation can help you see relationships in your visualizations that you wouldn’t see in a straight visualization. Different types of data also require different types of visualizations. Categorical data may be good with charts. Zooming in to the wrong place may also give a false picture. We could create a plot of model 3 that only shows y up to 10. In doing so, we would totally miss our point that was a big outlier:

library(ggplot2)
ggplot()+geom_point(aes(x=data3$x,y=data3$y))+ theme(panel.background = element_rect(fill = '#e6eabb'))

ggplot()+geom_point(aes(x=data3$x,y=data3$y))+ylim(5,10)+ theme(panel.background = element_rect(fill = '#e6eabb'))+ggtitle('with bad y range')
## Warning: Removed 1 rows containing missing values (geom_point).