Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

options(digits=3)
mean(data1$x)

## [1] 9

mean(data1$y)

## [1] 7.5

mean(data2$x)

## [1] 9

mean(data2$y)

## [1] 7.5

mean(data3$x)

## [1] 9

mean(data3$y)

## [1] 7.5

mean(data4$x)

## [1] 9

mean(data4$y)

## [1] 7.5

Note, I am showing two decimal places only when applicable (when not zeros)

b. The median (for x and y separately; 1 pt).

options(digits=3)
median(data1$x)

## [1] 9

median(data1$y)

## [1] 7.58

median(data2$x)

## [1] 9

median(data2$y)

## [1] 8.14

median(data3$x)

## [1] 9

median(data3$y)

## [1] 7.11

median(data4$x)

## [1] 8

median(data4$y)

## [1] 7.04

c. The standard deviation (for x and y separately; 1 pt).

options(digits=3)
sd(data1$x)

## [1] 3.32

sd(data1$y)

## [1] 2.03

sd(data2$x)

## [1] 3.32

sd(data2$y)

## [1] 2.03

sd(data3$x)

## [1] 3.32

sd(data3$y)

## [1] 2.03

sd(data4$x)

## [1] 3.32

sd(data4$y)

## [1] 2.03

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

options(digits=2)
cor(data1$x, data1$y)

## [1] 0.82

cor(data2$x, data2$y)

## [1] 0.82

cor(data3$x, data3$y)

## [1] 0.82

cor(data4$x, data4$y)

## [1] 0.82

e. Linear regression equation (2 pts).

lmdata1 <- lm(x ~ y, data = data1)
lmdata2 <- lm(x ~ y, data = data2)
lmdata3 <- lm(x ~ y, data = data3)
lmdata4 <- lm(x ~ y, data = data4)
summary(lmdata1)

## 
## Call:
## lm(formula = x ~ y, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.652 -1.512 -0.266  1.234  3.895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.998      2.434   -0.41   0.6916   
## y              1.333      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

summary(lmdata2)

## 
## Call:
## lm(formula = x ~ y, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.852 -1.432 -0.344  0.847  4.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.995      2.435   -0.41   0.6925   
## y              1.332      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(lmdata3)

## 
## Call:
## lm(formula = x ~ y, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.987 -1.373 -0.027  1.320  3.213 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.000      2.436   -0.41   0.6910   
## y              1.333      0.315    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(lmdata4)

## 
## Call:
## lm(formula = x ~ y, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.786 -1.412 -0.185  1.455  3.333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.004      2.435   -0.41   0.6898   
## y              1.334      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Data 1: Expected x = -0.998 + 1.333 * y Data 2: Expected x = -0.995 + 1.332 * y Data 3: Expected x = -1.000 + 1.333 * y Data 4: Expected x = -1.004 + 1.334 * y

f. R-Squared (2 pts).

summary(lmdata1)

## 
## Call:
## lm(formula = x ~ y, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.652 -1.512 -0.266  1.234  3.895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.998      2.434   -0.41   0.6916   
## y              1.333      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

summary(lmdata2)

## 
## Call:
## lm(formula = x ~ y, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.852 -1.432 -0.344  0.847  4.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.995      2.435   -0.41   0.6925   
## y              1.332      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(lmdata3)

## 
## Call:
## lm(formula = x ~ y, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.987 -1.373 -0.027  1.320  3.213 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.000      2.436   -0.41   0.6910   
## y              1.333      0.315    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(lmdata4)

## 
## Call:
## lm(formula = x ~ y, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.786 -1.412 -0.185  1.455  3.333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.004      2.435   -0.41   0.6898   
## y              1.334      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Data 1: R-Squared = .667 Data 2: R-Squared = .666 Data 3: R-Squared = .666 Data 4: R-Squared = .667

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Data 1

# Linearity
plot(x ~ y, data = data1)

# Nearly normal residuals
hist(lmdata1$residuals)

# Equal variance
plot(x ~ y, data = data1)
abline(lmdata1)

1. Linearity: yes 2. Nearly normal residuals: yes, nearly normal 3. Equal variance: yes Yes, it is appropriate for data 1.

Data 2

# Linearity
plot(x ~ y, data = data2)

# Nearly normal residuals
hist(lmdata2$residuals)

# Equal variance
plot(x ~ y, data = data2)
abline(lmdata2)

1. Linearity: No 2. Nearly normal residuals: No 3. Equal variance: No No, it is not appropriate to use a linear model for data2

Data 3

# Linearity
plot(x ~ y, data = data3)

# Nearly normal residuals
hist(lmdata3$residuals)

# Equal variance
plot(x ~ y, data = data3)
abline(lmdata3)

All of this holds if we take out the outlier. However, because there are very few number of observations, it might make sense to not take out the outlier and thus not use a lienar model. 1. Linearity: no 2. Nearly normal residuals: no, not really normal 3. Equal variance: no No, it is not appropriate for data 3 without removing the outlier

Data 4

# Linearity
plot(x ~ y, data = data4)

# Nearly normal residuals
hist(lmdata4$residuals)

# Equal variance
plot(x ~ y, data = data4)
abline(lmdata4)

1. Linearity: No 2. Nearly normal residuals: no, not really normal 3. Equal variance: no No, it is not appropriate for data4 without removing the large outlier

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

It is important to create visualizations because although data might have nearly the same summary statistics, they can be shaped in various different ways thus skewing models. Comparing data 1 and 2, this is particularly clear.

DATA 606 Spring 2018 - Final Exam

Part I