Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

sprintf("%.2f", mean(data1$x))

## [1] "9.00"

sprintf("%.2f", mean(data1$y))

## [1] "7.50"

sprintf("%.2f", mean(data2$x))

## [1] "9.00"

sprintf("%.2f", mean(data2$y))

## [1] "7.50"

sprintf("%.2f", mean(data3$x))

## [1] "9.00"

sprintf("%.2f", mean(data3$y))

## [1] "7.50"

sprintf("%.2f", mean(data4$x))

## [1] "9.00"

sprintf("%.2f", mean(data4$y))

## [1] "7.50"

b. The median (for x and y separately; 1 pt).

sprintf("%.2f", median(data1$x))

## [1] "9.00"

sprintf("%.2f", median(data1$y))

## [1] "7.58"

sprintf("%.2f", median(data2$x))

## [1] "9.00"

sprintf("%.2f", median(data2$y))

## [1] "8.14"

sprintf("%.2f", median(data3$x))

## [1] "9.00"

sprintf("%.2f", median(data3$y))

## [1] "7.11"

sprintf("%.2f", median(data4$x))

## [1] "8.00"

sprintf("%.2f", median(data4$y))

## [1] "7.04"

c. The standard deviation (for x and y separately; 1 pt).

sprintf("%.2f", sd(data1$x))

## [1] "3.32"

sprintf("%.2f", sd(data1$y))

## [1] "2.03"

sprintf("%.2f", sd(data2$x))

## [1] "3.32"

sprintf("%.2f", sd(data2$y))

## [1] "2.03"

sprintf("%.2f", sd(data3$x))

## [1] "3.32"

sprintf("%.2f", sd(data3$y))

## [1] "2.03"

sprintf("%.2f", sd(data4$x))

## [1] "3.32"

sprintf("%.2f", sd(data4$y))

## [1] "2.03"

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

sprintf("%.2f", cor(data1$x,data1$y))

## [1] "0.82"

sprintf("%.2f", cor(data2$x,data2$y))

## [1] "0.82"

sprintf("%.2f", cor(data3$x,data3$y))

## [1] "0.82"

sprintf("%.2f", cor(data4$x,data4$y))

## [1] "0.82"

e. Linear regression equation (2 pts).

print(l1 <- lm(y~x, data=data1))  # y = 3.00 +0.50x

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5

print(l2 <- lm(y~x, data=data2))  # y = 3.00 +0.50x

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5

print(l3 <- lm(y~x, data=data3))  # y = 3.00 +0.50x

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5

print(l4 <- lm(y~x, data=data4))  # y = 3.00 +0.50x

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5

f. R-Squared (2 pts).

R-squared is a statistical measure of how close the data are to the fitted regression line. R-squared of each data set is 0.67, which means X={x1,..x11} explained only 67% of Y={y1,…,y11} in the linear regression model.

rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(data1$x, data1$y)

## [1] 0.67

rsq(data2$x, data2$y)

## [1] 0.67

rsq(data3$x, data3$y)

## [1] 0.67

rsq(data4$x, data4$y)

## [1] 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Answer: Data1 modle isn’t appropriate to estimate a linear regression model. R-squared is 0.67 only. Data points are not close to the line.

qqnorm(l1$residuals)
qqline(l1$residuals)

Data2 modle isn’t appropriate to estimate a linear regression model. R-squared is 0.67 only. Data points are around the line but not very closer to the line.

qqnorm(l2$residuals)
qqline(l2$residuals)

Data3 modle is appropriate to estimate a linear regression model. R-squared is 0.67 only but Data points are on the line.

qqnorm(l3$residuals)
qqline(l3$residuals)

Data4 modle is appropriate to estimate a linear regression model. R-squared is 0.67 only but Data points are very close to the line.

qqnorm(l4$residuals)
qqline(l4$residuals)

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Answer: There is a good example above to show why visualizations are important. Four small data sets have same linear regression equations, same R-Squareds,same means, same standard deviations and same correlation. Their medians are little different but it’s not to tell how the data points are approriate to a line. The plot graph helps me to find out that data3 and data4 are appropriate to estimate a linear regression model.Data1 and data2 are not.

DATA 606 Fall 2017 - Final Exam

Chunmei Zhu

December 19, 2017

Part I