Part I

Please put the answers for Part I next to the question number (2pts each):

  1. B
  2. A
  3. A
  4. D
  5. B
  6. D

7a. Describe the two distributions (2pts). Distribution A is Right skewed and unimodal Ditribution B is normal, symmetrical, and unimodal

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts). Due to independent observations the distribution of the sample mean is approximated by a normal distribution. The Standard deviation or variability can differ (one chart looking at observed variables and the other looking at sampling distribution)

7c. What is the statistical principal that describes this phenomenon (2 pts)? Central Limit Theorem

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

Means

meand1x <- round(mean(data1$x), 2)
meand1y <- round(mean(data1$y), 2)
meand2x <- round(mean(data2$x), 2)
meand2y <- round(mean(data2$y), 2)
meand3x <- round(mean(data3$x), 2)
meand3y <- round(mean(data3$y), 2)
meand4x <- round(mean(data4$x), 2)
meand4y <- round(mean(data4$y), 2)

b. The median (for x and y separately; 1 pt).

Medians

medd1x <- round(median(data1$x), 2)
medd1y <- round(median(data1$y), 2)
medd2x <- round(median(data2$x), 2)
medd2y <- round(median(data2$y), 2)
medd3x <- round(median(data3$x), 2)
medd3y <- round(median(data3$y), 2)
medd4x <- round(median(data4$x), 2)
medd4y <- round(median(data4$y), 2)

c. The standard deviation (for x and y separately; 1 pt).

sdd1x <- round(sd(data1$x), 2)
sdd1y <- round(sd(data1$y), 2)
sdd2x <- round(sd(data2$x), 2)
sdd2y <- round(sd(data2$y), 2)
sdd3x <- round(sd(data3$x), 2)
sdd3y <- round(sd(data3$y), 2)
sdd4x <- round(sd(data4$x), 2)
sdd4y <- round(sd(data4$y), 2)

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

cord1 <- cor(data1$x, data1$y)
cord2 <- cor(data2$x, data2$y)
cord3 <- cor(data3$x, data3$y)
cord4 <- cor(data4$x, data4$y)
data.frame(data = c("Data1", "Data2", "Data3", "Data4"), Meanx=c(meand1x, meand2x, meand3x, meand4x), Meany=c(meand1y, meand2y, meand3y, meand4y), Medianx=c(medd1x,medd2x,medd3x, medd4x), Mediany=c(medd1y, medd2y, medd3y, medd4y), SDx=c(sdd1x, sdd2x, sdd3x, sdd4x), SDy=c(sdd1y, sdd2y, sdd3y, sdd4y), SDy=c(sdd1y, sdd2y, sdd3y, sdd4y), Correlation=c(cord1, cord2, cord3, cord4))
##    data Meanx Meany Medianx Mediany SDx SDy SDy.1 Correlation
## 1 Data1     9   7.5       9     7.6 3.3   2     2        0.82
## 2 Data2     9   7.5       9     8.1 3.3   2     2        0.82
## 3 Data3     9   7.5       9     7.1 3.3   2     2        0.82
## 4 Data4     9   7.5       8     7.0 3.3   2     2        0.82

e. Linear regression equation (2 pts).

Rd1 <- lm(data1$y~data1$x)
Rd2 <- lm(data2$y~data2$x)
Rd3 <- lm(data3$y~data3$x)
Rd4 <- lm(data4$y~data4$x)
Rd1
## 
## Call:
## lm(formula = data1$y ~ data1$x)
## 
## Coefficients:
## (Intercept)      data1$x  
##         3.0          0.5
Rd2
## 
## Call:
## lm(formula = data2$y ~ data2$x)
## 
## Coefficients:
## (Intercept)      data2$x  
##         3.0          0.5
Rd3
## 
## Call:
## lm(formula = data3$y ~ data3$x)
## 
## Coefficients:
## (Intercept)      data3$x  
##         3.0          0.5
Rd4
## 
## Call:
## lm(formula = data4$y ~ data4$x)
## 
## Coefficients:
## (Intercept)      data4$x  
##         3.0          0.5

The linear regression equation for all data sets: y = 3.00 + 0.50x

f. R-Squared (2 pts).

summary(Rd1)
## 
## Call:
## lm(formula = data1$y ~ data1$x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## data1$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217
summary(Rd2)
## 
## Call:
## lm(formula = data2$y ~ data2$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## data2$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
summary(Rd3)
## 
## Call:
## lm(formula = data3$y ~ data3$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## data3$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
summary(Rd4)
## 
## Call:
## lm(formula = data4$y ~ data4$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## data4$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

The R-squared for all data sets is 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Requirements for calculating linear regression:

1.Linearity

2.Nearly Normal Residuals

3.Constant Variability

  1. Independent Observations

Linearity and Constant Variability with Scatter Plot

plot(data1$x, data1$y, main = "Data1")
abline(Rd1)

plot(data2$x, data2$y, main = "Data2")
abline(Rd2)

plot(data3$x, data3$y, main = "Data3")
abline(Rd3)

plot(data4$x, data4$y, main = "Data4")
abline(Rd4)

Data set 1 is linear and has constant variability

Data set 2 is not linear

Data set 3 has an outlier

Data set 4 does not have variability

Linearity:

plot(data1$x, Rd1$residuals, main = "Data1")
abline(h=0,lty=2)

plot(data2$x, Rd2$residuals, main = "Data2")
abline(h=0,lty=2)

plot(data3$x, Rd3$residuals, main = "Data3")
abline(h=0,lty=2)

plot(data4$x, Rd4$residuals, main = "Data4")
abline(h=0,lty=2)

The above plots show that only Data set 1 has nearly normal residuals.

Nearly normal Residuals:

qqnorm(Rd1$residuals)
qqline(Rd1$residuals)

qqnorm(Rd2$residuals)
qqline(Rd2$residuals)

qqnorm(Rd3$residuals)
qqline(Rd3$residuals)

qqnorm(Rd4$residuals)
qqline(Rd4$residuals)

All Data sets have nearly normal residuals

It is only appropriate to estimate a linear regression model because it meets all the conditions needed (Linearity, Nearly Normal Residuals, Constant Variability) shown by the scatterplot, the Normal Q-Q Plot, and the residual plot.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

It is important to include the visualizations when analyzing data to see if there are any trends and also to ensure that all neccessary conditions are met. Many conditions are not noted from simply the standard deviation, mean, or median (which were the same for each data set). We were only able to see that the Linear Regression was not applicable for all data sets through the visualizations.