Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

mean_x1

## [1] 9

mean_y1

## [1] 7.5

mean_x2

## [1] 9

mean_y2

## [1] 7.5

mean_x3

## [1] 9

mean_y3

## [1] 7.5

mean_x4

## [1] 9

mean_y4

## [1] 7.5

b. The median (for x and y separately; 1 pt).

median_x1

## [1] 9

median_y1

## [1] 7.6

median_x2

## [1] 9

median_y2

## [1] 8.1

median_x3

## [1] 9

median_y3

## [1] 7.1

median_x4

## [1] 8

median_y4

## [1] 7

c. The standard deviation (for x and y separately; 1 pt).

sd_x1

## [1] 3.3

sd_y1

## [1] 2

sd_x2

## [1] 3.3

sd_y2

## [1] 2

sd_x3

## [1] 3.3

sd_y3

## [1] 2

sd_x4

## [1] 3.3

sd_y4

## [1] 2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

cor(data1$x, data1$y)

## [1] 0.82

cor(data2$x, data2$y)

## [1] 0.82

cor(data3$x, data3$y)

## [1] 0.82

cor(data4$x, data4$y)

## [1] 0.82

e. Linear regression equation (2 pts).

lm_1 <- lm(y ~ x, data=data1)
summary(lm_1)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

lm_2 <- lm(y ~ x, data=data2)
summary(lm_2)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

lm_3 <- lm(y ~ x, data=data3)
summary(lm_3)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

lm_4 <- lm(y ~ x, data=data4)
summary(lm_4)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Linear regression equations is the same for all data sets: y=3+0.5x

f. R-Squared (2 pts).

Data1 R-Squared = 0.667

Data2 R-Squared= 0.666

Data3 R-Squared = 0.666

Data4 R-Squared = 0.667

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Data1

Conditions:

1-Linearity CHECK

2-Nearly normal residuals X

3-Constant variability CHECK

4-Independent observations UNKNOWN

Linear regression model is not appropriate.

par(mfrow=c(2,2))
plot(data1)
hist(lm_1$residuals)
qqnorm(lm_1$residuals)
qqline(lm_1$residuals)

Data2

Conditions:

1-Linearity X

2-Nearly normal residuals X

3-Constant variability X

4-Independent observations UNKNOWN

Linear regression model is not appropriate.

par(mfrow=c(2,2))
plot(data2)
hist(lm_2$residuals)
qqnorm(lm_2$residuals)
qqline(lm_2$residuals)

Data3

Conditions:

1-Linearity CHECK

2-Nearly normal residuals CHECK

3-Constant variability X

4-Independent observations UNKNOWN

Linear regression model is not appropriate.

par(mfrow=c(2,2))
plot(data3)
hist(lm_3$residuals)
qqnorm(lm_3$residuals)
qqline(lm_3$residuals)

Data4

Conditions:

1-Linearity X (very extreme outlier)

2-Nearly normal residuals X

3-Constant variability X

4-Independent observations UNKNOWN

Linear regression model is not appropriate.

par(mfrow=c(2,2))
plot(data4)
hist(lm_4$residuals)
qqnorm(lm_4$residuals)
qqline(lm_4$residuals)

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

It is critical to visualize the data and check all conditions when creating a model. These data sets have very similar means, standard deviations, R-squared and linear regression equations, however using visualization methods we can conclude that it’s completely inappropriate for some data sets to use linear regression equation.

DATA 606 Fall 2017 - Final Exam

Lidiia Tronina

Part I

Part II

a. The mean (for x and y separately; 1 pt).

b. The median (for x and y separately; 1 pt).

c. The standard deviation (for x and y separately; 1 pt).

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

e. Linear regression equation (2 pts).

f. R-Squared (2 pts).

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)