Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

executeFunction <- function(dataSet, funcName)
{
  return(lapply(dataSet, funcName))
}

executeFunction(data1, mean)

## $x
## [1] 9
## 
## $y
## [1] 7.5

executeFunction(data2, mean)

## $x
## [1] 9
## 
## $y
## [1] 7.5

executeFunction(data3, mean)

## $x
## [1] 9
## 
## $y
## [1] 7.5

executeFunction(data4, mean)

## $x
## [1] 9
## 
## $y
## [1] 7.5

b. The median (for x and y separately; 1 pt).

executeFunction(data1, median)

## $x
## [1] 9
## 
## $y
## [1] 7.6

executeFunction(data2, median)

## $x
## [1] 9
## 
## $y
## [1] 8.1

executeFunction(data3, median)

## $x
## [1] 9
## 
## $y
## [1] 7.1

executeFunction(data4, median)

## $x
## [1] 8
## 
## $y
## [1] 7

c. The standard deviation (for x and y separately; 1 pt).

executeFunction(data1, sd)

## $x
## [1] 3.3
## 
## $y
## [1] 2

executeFunction(data2, sd)

## $x
## [1] 3.3
## 
## $y
## [1] 2

executeFunction(data3, sd)

## $x
## [1] 3.3
## 
## $y
## [1] 2

executeFunction(data4, sd)

## $x
## [1] 3.3
## 
## $y
## [1] 2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

plot(data1)

cor(data1)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

plot(data2)

cor(data2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

plot(data3)

cor(data3)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

plot(data4)

cor(data4)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

e. Linear regression equation (2 pts).

lm1 <- lm(y~x, data1)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

plot(lm1)

lm2 <- lm(y~x, data2)
summary(lm2)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

plot(lm2)

lm3 <- lm(y~x, data3)
summary(lm3)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

plot(lm3)

lm4 <- lm(y~x, data4)
summary(lm4)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

plot(lm4)

## Warning: not plotting observations with leverage one:
##   8

## Warning: not plotting observations with leverage one:
##   8

f. R-Squared (2 pts).

summary(lm1)$r.squared

## [1] 0.67

summary(lm2)$r.squared

## [1] 0.67

summary(lm3)$r.squared

## [1] 0.67

summary(lm4)$r.squared

## [1] 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

plot(lm1)

hist(lm1$residuals)

qqnorm(lm1$residuals)
qqline(lm1$residuals)

**For data1, the plot shows linear relationship since the data shows an upward linear trend. It is appropriate to estimate that the linear regression model is reliable for* data1*

plot(lm2)

hist(lm2$residuals)

qqnorm(lm2$residuals)
qqline(lm2$residuals)

For data2, a straight line could not hold the data. so it is NOT appropriate to estimate the linear regression model

plot(lm3)

hist(lm3$residuals)

qqnorm(lm3$residuals)
qqline(lm3$residuals)

For data3, the plot shows strong linear relationship however the residuals are non-normal. The outliers are far away from the line so it is appropriate to estimate that the linear regression model is reliable for data3

plot(lm4)

## Warning: not plotting observations with leverage one:
##   8

## Warning: not plotting observations with leverage one:
##   8

hist(lm4$residuals)

qqnorm(lm4$residuals)
qqline(lm4$residuals)

For data4, the plot shows NO linear relationship so linear regression model is NOT reliable for data4

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

It will be easier and quicker to interpret the facts with visualizations. We can understand the trend (linear vs non-linear) more easily with the visualizations

DATA 606 Fall 2017 - Final Exam

Kalyanaraman Parthasarathy

Part I