Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

Means

meand1x <- round(mean(data1$x), 2)
meand1y <- round(mean(data1$y), 2)
meand2x <- round(mean(data2$x), 2)
meand2y <- round(mean(data2$y), 2)
meand3x <- round(mean(data3$x), 2)
meand3y <- round(mean(data3$y), 2)
meand4x <- round(mean(data4$x), 2)
meand4y <- round(mean(data4$y), 2)

b. The median (for x and y separately; 1 pt).

Medians

medd1x <- round(median(data1$x), 2)
medd1y <- round(median(data1$y), 2)
medd2x <- round(median(data2$x), 2)
medd2y <- round(median(data2$y), 2)
medd3x <- round(median(data3$x), 2)
medd3y <- round(median(data3$y), 2)
medd4x <- round(median(data4$x), 2)
medd4y <- round(median(data4$y), 2)

c. The standard deviation (for x and y separately; 1 pt).

sdd1x <- round(sd(data1$x), 2)
sdd1y <- round(sd(data1$y), 2)
sdd2x <- round(sd(data2$x), 2)
sdd2y <- round(sd(data2$y), 2)
sdd3x <- round(sd(data3$x), 2)
sdd3y <- round(sd(data3$y), 2)
sdd4x <- round(sd(data4$x), 2)
sdd4y <- round(sd(data4$y), 2)

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

cord1 <- cor(data1$x, data1$y)
cord2 <- cor(data2$x, data2$y)
cord3 <- cor(data3$x, data3$y)
cord4 <- cor(data4$x, data4$y)

data.frame(data = c("Data1", "Data2", "Data3", "Data4"), Meanx=c(meand1x, meand2x, meand3x, meand4x), Meany=c(meand1y, meand2y, meand3y, meand4y), Medianx=c(medd1x,medd2x,medd3x, medd4x), Mediany=c(medd1y, medd2y, medd3y, medd4y), SDx=c(sdd1x, sdd2x, sdd3x, sdd4x), SDy=c(sdd1y, sdd2y, sdd3y, sdd4y), SDy=c(sdd1y, sdd2y, sdd3y, sdd4y), Correlation=c(cord1, cord2, cord3, cord4))

##    data Meanx Meany Medianx Mediany SDx SDy SDy.1 Correlation
## 1 Data1     9   7.5       9     7.6 3.3   2     2        0.82
## 2 Data2     9   7.5       9     8.1 3.3   2     2        0.82
## 3 Data3     9   7.5       9     7.1 3.3   2     2        0.82
## 4 Data4     9   7.5       8     7.0 3.3   2     2        0.82

e. Linear regression equation (2 pts).

Rd1 <- lm(data1$y~data1$x)
Rd2 <- lm(data2$y~data2$x)
Rd3 <- lm(data3$y~data3$x)
Rd4 <- lm(data4$y~data4$x)
Rd1

## 
## Call:
## lm(formula = data1$y ~ data1$x)
## 
## Coefficients:
## (Intercept)      data1$x  
##         3.0          0.5

Rd2

## 
## Call:
## lm(formula = data2$y ~ data2$x)
## 
## Coefficients:
## (Intercept)      data2$x  
##         3.0          0.5

Rd3

## 
## Call:
## lm(formula = data3$y ~ data3$x)
## 
## Coefficients:
## (Intercept)      data3$x  
##         3.0          0.5

Rd4

## 
## Call:
## lm(formula = data4$y ~ data4$x)
## 
## Coefficients:
## (Intercept)      data4$x  
##         3.0          0.5

The linear regression equation for all data sets: y = 3.00 + 0.50x

f. R-Squared (2 pts).

summary(Rd1)

## 
## Call:
## lm(formula = data1$y ~ data1$x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## data1$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

summary(Rd2)

## 
## Call:
## lm(formula = data2$y ~ data2$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## data2$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(Rd3)

## 
## Call:
## lm(formula = data3$y ~ data3$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## data3$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(Rd4)

## 
## Call:
## lm(formula = data4$y ~ data4$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## data4$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

The R-squared for all data sets is 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Requirements for calculating linear regression:

1.Linearity

2.Nearly Normal Residuals

3.Constant Variability

Independent Observations

Linearity and Constant Variability with Scatter Plot

plot(data1$x, data1$y, main = "Data1")
abline(Rd1)

plot(data2$x, data2$y, main = "Data2")
abline(Rd2)

plot(data3$x, data3$y, main = "Data3")
abline(Rd3)

plot(data4$x, data4$y, main = "Data4")
abline(Rd4)

Data set 1 is linear and has constant variability

Data set 2 is not linear

Data set 3 has an outlier

Data set 4 does not have variability

Linearity:

plot(data1$x, Rd1$residuals, main = "Data1")
abline(h=0,lty=2)

plot(data2$x, Rd2$residuals, main = "Data2")
abline(h=0,lty=2)

plot(data3$x, Rd3$residuals, main = "Data3")
abline(h=0,lty=2)

plot(data4$x, Rd4$residuals, main = "Data4")
abline(h=0,lty=2)

The above plots show that only Data set 1 has nearly normal residuals.

Nearly normal Residuals:

qqnorm(Rd1$residuals)
qqline(Rd1$residuals)

qqnorm(Rd2$residuals)
qqline(Rd2$residuals)

qqnorm(Rd3$residuals)
qqline(Rd3$residuals)

qqnorm(Rd4$residuals)
qqline(Rd4$residuals)

All Data sets have nearly normal residuals

It is only appropriate to estimate a linear regression model because it meets all the conditions needed (Linearity, Nearly Normal Residuals, Constant Variability) shown by the scatterplot, the Normal Q-Q Plot, and the residual plot.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

It is important to include the visualizations when analyzing data to see if there are any trends and also to ensure that all neccessary conditions are met. Many conditions are not noted from simply the standard deviation, mean, or median (which were the same for each data set). We were only able to see that the Linear Regression was not applicable for all data sets through the visualizations.

DATA 606 Fall 2017 - Final Exam

Christina Kasman

Part I