Part II

Consider the four data sets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

options(digits = 3) # for 3 digit answers (2 decimal places)
mean(data1$x) #mean of x in data 1

## [1] 9

mean(data1$y) #mean of y in data 1

## [1] 7.5

mean(data2$x) #mean of x in data 2

## [1] 9

mean(data2$y) #mean of y in data 2

## [1] 7.5

mean(data3$x) #mean of x in data 3

## [1] 9

mean(data3$y) #mean of y in data 3

## [1] 7.5

mean(data4$x) #mean of x in data 4

## [1] 9

mean(data4$y) #mean of y in data 4

## [1] 7.5

b. The median (for x and y separately; 1 pt).

median(data1$x) #median of x in data 1

## [1] 9

median(data1$y) #median of y in data 1

## [1] 7.58

median(data2$x) #median of x in data 2

## [1] 9

median(data2$y) #median of y in data 2

## [1] 8.14

median(data3$x) #median of x in data 3

## [1] 9

median(data3$y) #median of y in data 3

## [1] 7.11

median(data4$x) #median of x in data 4

## [1] 8

median(data4$y) #median of y in data 4

## [1] 7.04

c. The standard deviation (for x and y separately; 1 pt).

sd(data1$x) #standard deviation of x in data 1

## [1] 3.32

sd(data1$y) #standard deviation of y in data 1

## [1] 2.03

sd(data2$x) #standard deviation of x in data 2

## [1] 3.32

sd(data2$y) #standard deviation of y in data 2

## [1] 2.03

sd(data3$x) #standard deviation of x in data 3

## [1] 3.32

sd(data3$y) #standard deviation of y in data 3

## [1] 2.03

sd(data4$x) #standard deviation of x in data 4

## [1] 3.32

sd(data4$y) #standard deviation of y in data 4

## [1] 2.03

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

options(digits = 2) # for 2 digit answers (2 decimal places)
cor(data1$x, data1$y) #correlation of x and y in data 1

## [1] 0.82

cor(data2$y, data2$x) #correlation of x and y in data 2

## [1] 0.82

cor(data3$y, data3$x) #correlation of x and y in data 3

## [1] 0.82

cor(data4$y, data4$x) #correlation of x and y in data 4

## [1] 0.82

e. Linear regression equation (2 pts).

data_1 <- lm(y~x, data1)
data_2 <- lm(y~x, data2)
data_3 <- lm(y~x, data3)
data_4 <- lm(y~x, data4)
data_1 # y =3.0 + .05x

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5

data_2 # y =3.0 + .05x

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5

data_3 # y =3.0 + .05x

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5

data_4 # y =3.0 + .05x

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5

f. R-Squared (2 pts).

summary(data_1) #R-Squared = 0.67

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

summary(data_2) #R-Squared = 0.67

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(data_3) #R-Squared = 0.67

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(data_4) #R-Squared = 0.67

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Data 1

par(mfrow = c(2,2))

plot(x = data1$x, y = data1$y)

hist(data_1$residuals)

qqnorm(data_1$residuals)
qqline(data_1$residuals)

plot(data_1$residuals ~ data1$x)
abline(h = 0)

There is a linear upward trend. The normal residual plot looks good and the histogram of the residuals for the most part are centered around 0. The Residual vs Fitted looks random. Yes, this data passes the conditions for Linear Regression.

Data 2

par(mfrow = c(2,2))

plot(x = data2$x, y = data2$y)

hist(data_2$residuals)

qqnorm(data_2$residuals)
qqline(data_2$residuals)

plot(data_2$residuals ~ data2$x)
abline(h = 0)

Data 2 does not pass the linear regression conditions as it is curved and the Residuals vs Fitted model does not look random.

Data 3

par(mfrow = c(2,2))

plot(x = data3$x, y = data3$y)

hist(data_3$residuals)

qqnorm(data_3$residuals)
qqline(data_3$residuals)

plot(data_3$residuals ~ data3$x)
abline(h = 0)

Data3 looks linear except for an outlier, which is giving it a right skew. without the outlier, the data might pass the conditions for linear regression, but the Risiduals vs Fitted graph does not look random.

Data 4

par(mfrow = c(2,2))

plot(x = data4$x, y = data4$y)

hist(data_4$residuals)

qqnorm(data_4$residuals)
qqline(data_4$residuals)

plot(data_4$residuals ~ data4$x)
abline(h = 0)

Data 4, the x is constant except for 1 point, which gives it a strange plot. The residuals look good but it doesn’t look random.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Even though datas’ 1, 2, 3 and 4 had identical mean, median, sd, correlation, linear regression equation, and R-Squared, they were all different when we visualized each data. visualizing the data helped determine if the data was fit for a linear regression model. From our visuals above, only data 1 passed the linear regression conditions, while the rest failed. You are not fully able to see the story the data tell you, without visualizing it.

DATA 606 Fall 2017 - Final Exam

Munkhnaran Gankhuyag

Part I