Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

round(mean(data1$x),2)

## [1] 9

round(mean(data1$y),2)

## [1] 7.5

round(mean(data2$x),2)

## [1] 9

round(mean(data2$y),2)

## [1] 7.5

round(mean(data3$x),2)

## [1] 9

round(mean(data3$y),2)

## [1] 7.5

round(mean(data4$x),2)

## [1] 9

round(mean(data4$y),2)

## [1] 7.5

b. The median (for x and y separately; 1 pt).

round(median(data1$x),2)

## [1] 9

round(median(data1$y),2)

## [1] 7.6

round(median(data2$x),2)

## [1] 9

round(median(data2$y),2)

## [1] 8.1

round(median(data3$x),2)

## [1] 9

round(median(data3$y),2)

## [1] 7.1

round(median(data4$x),2)

## [1] 8

round(median(data4$y),2)

## [1] 7

c. The standard deviation (for x and y separately; 1 pt).

round(sd(data1$x),2)

## [1] 3.3

round(sd(data1$y),2)

## [1] 2

round(sd(data2$x),2)

## [1] 3.3

round(sd(data2$y),2)

## [1] 2

round(sd(data3$x),2)

## [1] 3.3

round(sd(data3$y),2)

## [1] 2

round(sd(data4$x),2)

## [1] 3.3

round(sd(data4$y),2)

## [1] 2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

round(cor(data1),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

round(cor(data2),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

round(cor(data3),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

round(cor(data4),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

e. Linear regression equation (2 pts).

lm1 <- lm(y ~ x, data = data1)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

lm2 <- lm(y ~ x, data = data2)
summary(lm2)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

lm3 <- lm(y ~ x, data = data3)
summary(lm3)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

lm4 <- lm(y ~ x, data = data4)
summary(lm4)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

f. R-Squared (2 pts).

summary(lm1)$r.squared

## [1] 0.67

summary(lm2)$r.squared

## [1] 0.67

summary(lm3)$r.squared

## [1] 0.67

summary(lm4)$r.squared

## [1] 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Conditions for a pair to be appropriate for a linear regression model:

Linearity
Nearly normal residuals
Constant variability
Independent observations

par(mfrow=c(2,2))
plot(data1)
hist(lm1$residuals)
qqnorm(lm1$residuals)
qqline(lm1$residuals)

Data 1 does NOT appropriate for linear model regression due to violation of criteria for nearly normal residuals

par(mfrow=c(2,2))
plot(data2)
hist(lm2$residuals)
qqnorm(lm2$residuals)
qqline(lm2$residuals)

Data 2 is NOT appropriate for the linear regression model as it violates criterias for linearity, nearly normal residuals and constant variability

par(mfrow=c(2,2))
plot(data3)
hist(lm3$residuals)
qqnorm(lm3$residuals)
qqline(lm3$residuals)

Data 2 is NOT appropriate for the linear regression model as it violates criterias for nearly normal residuals and constant variability

par(mfrow=c(2,2))
plot(data4)
hist(lm4$residuals)
qqnorm(lm4$residuals)
qqline(lm4$residuals)

Data 2 is NOT appropriate for the linear regression model as it violates criterias for linearity and constant variability

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Data Visualization is an important part and step when analyzing data as it provides more insights into the underlying data by exposing the patterns associated with it. Statiscal analysis such as finding mean, median, sd etc are important and when we visualize or plot the data it confirms that. It also reveals other statistical characteristics. Below is an example for data1.

plot(lm1)

DATA 606 Spring 2018 - Final Exam

Niteen Kumar

Part I