Part I

Please put the answers for Part I next to the question number (2pts each):

daysDrive
mean = 3.3, median = 3.5
Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.
eye color and natural hair color are independent
17.8 and 69.0
median and interquartile range; mean and standard deviation

7a. Describe the two distributions (2pts).

Figure A is a right skewed normal distribution that is uni-modal while Figure B is a near normal distribution with proper symmetry and uni modal

Figure A \[ N ( \mu = 5.05, \sigma = 3.22 ) \]

Figure B \[ N ( \mu = 5.04, \sigma = 0.58 ) \]

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

The means are similar because the central limit theorem is applied here. When 500 samples of size n=30 is taken from a skewed distribution like figure A, the outcome will be a normal distribution that has a sample mean that is close to the population mean. While the standard deviation will be approximate to \[ \frac{\sigma}{n}\]

7c. What is the statistical principal that describes this phenomenon (2 pts)?

The statistical principal here is Central Limit Theorem

Part II

Consider the four datasets, each with two columns (x and y), provided below.

library(psych)
options(digits=4)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

summary(data1)

##        x              y        
##  Min.   : 4.0   Min.   : 4.26  
##  1st Qu.: 6.5   1st Qu.: 6.32  
##  Median : 9.0   Median : 7.58  
##  Mean   : 9.0   Mean   : 7.50  
##  3rd Qu.:11.5   3rd Qu.: 8.57  
##  Max.   :14.0   Max.   :10.84

summary(data2)

##        x              y       
##  Min.   : 4.0   Min.   :3.10  
##  1st Qu.: 6.5   1st Qu.:6.70  
##  Median : 9.0   Median :8.14  
##  Mean   : 9.0   Mean   :7.50  
##  3rd Qu.:11.5   3rd Qu.:8.95  
##  Max.   :14.0   Max.   :9.26

summary(data3)

##        x              y        
##  Min.   : 4.0   Min.   : 5.39  
##  1st Qu.: 6.5   1st Qu.: 6.25  
##  Median : 9.0   Median : 7.11  
##  Mean   : 9.0   Mean   : 7.50  
##  3rd Qu.:11.5   3rd Qu.: 7.98  
##  Max.   :14.0   Max.   :12.74

summary(data4)

##        x            y        
##  Min.   : 8   Min.   : 5.25  
##  1st Qu.: 8   1st Qu.: 6.17  
##  Median : 8   Median : 7.04  
##  Mean   : 9   Mean   : 7.50  
##  3rd Qu.: 8   3rd Qu.: 8.19  
##  Max.   :19   Max.   :12.50

The mean of data1 x is 9, the mean of data1 y is 7.5 The mean of data2 x is 9, the mean of data1 y is 7.5 The mean of data3 x is 9, the mean of data1 y is 7.5 The mean of data4 x is 9, the mean of data1 y is 7.5

b. The median (for x and y separately; 1 pt).

summary(data1)

##        x              y        
##  Min.   : 4.0   Min.   : 4.26  
##  1st Qu.: 6.5   1st Qu.: 6.32  
##  Median : 9.0   Median : 7.58  
##  Mean   : 9.0   Mean   : 7.50  
##  3rd Qu.:11.5   3rd Qu.: 8.57  
##  Max.   :14.0   Max.   :10.84

summary(data2)

##        x              y       
##  Min.   : 4.0   Min.   :3.10  
##  1st Qu.: 6.5   1st Qu.:6.70  
##  Median : 9.0   Median :8.14  
##  Mean   : 9.0   Mean   :7.50  
##  3rd Qu.:11.5   3rd Qu.:8.95  
##  Max.   :14.0   Max.   :9.26

summary(data3)

##        x              y        
##  Min.   : 4.0   Min.   : 5.39  
##  1st Qu.: 6.5   1st Qu.: 6.25  
##  Median : 9.0   Median : 7.11  
##  Mean   : 9.0   Mean   : 7.50  
##  3rd Qu.:11.5   3rd Qu.: 7.98  
##  Max.   :14.0   Max.   :12.74

summary(data4)

##        x            y        
##  Min.   : 8   Min.   : 5.25  
##  1st Qu.: 8   1st Qu.: 6.17  
##  Median : 8   Median : 7.04  
##  Mean   : 9   Mean   : 7.50  
##  3rd Qu.: 8   3rd Qu.: 8.19  
##  Max.   :19   Max.   :12.50

The median of data1 x is 9, the median of data1 y is 7.58 The median of data2 x is 9, the mean of data2 y is 8.14 The median of data3 x is 9, the mean of data3 y is 7.11 The median of data4 x is 8, the mean of data4 y is 7.04

c. The standard deviation (for x and y separately; 1 pt).

describe(data1)

##   vars  n mean   sd median trimmed  mad  min   max range  skew kurtosis
## x    1 11  9.0 3.32   9.00    9.00 4.45 4.00 14.00 10.00  0.00    -1.53
## y    2 11  7.5 2.03   7.58    7.49 1.82 4.26 10.84  6.58 -0.05    -1.20
##     se
## x 1.00
## y 0.61

describe(data2)

##   vars  n mean   sd median trimmed  mad min   max range  skew kurtosis
## x    1 11  9.0 3.32   9.00    9.00 4.45 4.0 14.00 10.00  0.00    -1.53
## y    2 11  7.5 2.03   8.14    7.79 1.47 3.1  9.26  6.16 -0.98    -0.51
##     se
## x 1.00
## y 0.61

describe(data3)

##   vars  n mean   sd median trimmed  mad  min   max range skew kurtosis
## x    1 11  9.0 3.32   9.00    9.00 4.45 4.00 14.00 10.00 0.00    -1.53
## y    2 11  7.5 2.03   7.11    7.15 1.53 5.39 12.74  7.35 1.38     1.24
##     se
## x 1.00
## y 0.61

describe(data4)

##   vars  n mean   sd median trimmed mad  min  max range skew kurtosis   se
## x    1 11  9.0 3.32   8.00     8.0 0.0 8.00 19.0 11.00 2.47     4.52 1.00
## y    2 11  7.5 2.03   7.04     7.2 1.9 5.25 12.5  7.25 1.12     0.63 0.61

The standard deviation of data1 x is 3.32, the standard deviation of data1 y is 2.03 The standard deviation of data2 x is 3.32, the standard deviation of data2 y is 2.03 The standard deviation of data3 x is 3.32, the standard deviation of data3 y is 2.03 The standard deviation of data4 x is 3.32, the standard deviation of data4 y is 2.03

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

cor_data1 <- cor(data1$x, data1$y)
cor_data2 <- cor(data2$x, data2$y)
cor_data3 <- cor(data3$x, data3$y)
cor_data4 <- cor(data4$x, data4$y)

correlation of data1 is 0.82 correlation of data2 is 0.82 correlation of data3 is 0.82 correlation of data4 is 0.82

e. Linear regression equation (2 pts).

lm_data1 <- lm(y ~ x, data=data1)
lm_data2 <- lm(y ~ x, data=data2)
lm_data3 <- lm(y ~ x, data=data3)
lm_data4 <- lm(y ~ x, data=data4)

Linear regression equation for data1: y = 0.5x + 3 Linear regression equation for data2: y = 0.5x + 3 Linear regression equation for data3: y = 0.5x + 3 Linear regression equation for data4: y = 0.5x + 3

f. R-Squared (2 pts).

summary(lm_data1)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

summary(lm_data2)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(lm_data3)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(lm_data4)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

data1_R-Squared = 0.667 data2_R-Squared = 0.666 data3_R-Squared = 0.666 data4_R-Squared = 0.667

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

In order to estimate a linear regression model we need to check if the data passes the following regression conditions

Data1

Linearity; We need to check if the scatter plot displays a roughly linear relationship between x and y We should also check that the plot of the residuals vs. x lies constantly above the y=0 line

#scatter plot
plot(data1$x,data1$y, main ='Data1 plot')
abline(lm_data1, col='red')

#Residuals plot
plot(lm_data1$residuals ~ data1$x)
abline(h = 0, lty = 3)

We can see in Data1 there is a constant linearity of the y and x values, because in the residual plot the points are positioned around zero, and in the scatter plot the points look fairly linear between y and x value.

Nearly Normal Residuals: The residuals plot should be normal or the Q-Q plot should lie along the line

#histogram of the residuals
hist(lm_data1$residuals)

#normal probability plot of the residuals (Q-Q plot)
qqnorm(lm_data1$residuals)
qqline(lm_data1$residuals)

Based on the residual plot and Q-Q normality plot, we can see that histogram is nearly normal and the points on the Q-Q plot lie along the line. Therefore we can say this condition of Normality of Residuals is satisfied

Constant variability:The variability of points around the least squares line should be roughly constant. Also variability of residuals around the 0 line should be roughly constant as well.

Looking at the plots under condition 1 we can see that the scatter plot is tube shaped and that the residuals are spread across the zero line constantly spaced with no pattern or sign that they influence each other.

Conclusion: It is appropriate to estimate a linear regression model for Data1

Data2

Linearity; We need to check if the scatter plot displays a roughly linear relationship between x and y We should also check that the plot of the residuals vs. x lies constantly above the y=0 line

#scatter plot
plot(data2$x,data2$y, main ='Data2 plot')
abline(lm_data2, col='red')

#Residuals plot
plot(lm_data2$residuals ~ data2$x)
abline(h = 0, lty = 3)

We can see in Data2 there is not a constant linearity of the y and x values, because in the residual plot the points are NOT positioned around zero in a random fashion but they follow a pattern, and in the scatter plot the points DO NOT look fairly linear between y and x value, it follows a non linear pattern.

Nearly Normal Residuals: The residuals plot should be normal or the Q-Q plot should lie along the line

#histogram of the residuals
hist(lm_data2$residuals)

#normal probability plot of the residuals (Q-Q plot)
qqnorm(lm_data2$residuals)
qqline(lm_data2$residuals)

Based on the residual plot and Q-Q normality plot of Data2, we can see that histogram is not normally distributed and the points on the Q-Q plot DO NOT lie along the line consistently. Therefore we can say this condition of Normality of Residuals is NOT satisfied

Constant variability: The variability of points around the least squares line should be roughly constant. Also variability of residuals around the 0 line should be roughly constant as well.

Looking at the plots under condition 1 we can see that the scatter plot is cone shaped and that the residuals are spread across the zero line with a bell shape or sign that they influence each other.

Conclusion: It is NOT appropriate to estimate a linear regression model for Data2

Data3

Linearity; We need to check if the scatter plot displays a roughly linear relationship between x and y We should also check that the plot of the residuals vs. x lies constantly above the y=0 line

#scatter plot
plot(data3$x,data3$y, main ='Data3 plot')
abline(lm_data3, col='red')

#Residuals plot
plot(lm_data3$residuals ~ data3$x)
abline(h = 0, lty = 3)

We can see in Data3 there is not a constant linearity of the y and x values, because in the residual plot the points are NOT positioned around zero in a random fashion but they follow a pattern, and in the scatter plot the points look fairly linear between y and x value but has a huge outlier.

Nearly Normal Residuals: The residuals plot should be normal or the Q-Q plot should lie along the line

#histogram of the residuals
hist(lm_data3$residuals)

#normal probability plot of the residuals (Q-Q plot)
qqnorm(lm_data3$residuals)
qqline(lm_data3$residuals)

Based on the residual histogram plot and Q-Q normality plot of Data3, we can see that histogram is right skewed for a normal distribution and the points on the Q-Q plot lie along the line consistently except for one outlier. Therefore we can say this condition of Normality of Residuals is NOT satisfied

Constant variability: The variability of points around the least squares line should be roughly constant. Also variability of residuals around the 0 line should be roughly constant as well.

Looking at the plots under condition 1 we can see that the scatter plot is cone shaped(because of outlier) and that the residuals are spread across the zero line as a diagonal or sign that they influence each other.

Conclusion: It is NOT appropriate to estimate a linear regression model for Data3 unless the outlier is excluded, but it is much better than Data2.

Data4

Linearity; We need to check if the scatter plot displays a roughly linear relationship between x and y We should also check that the plot of the residuals vs. x lies constantly above the y=0 line

#scatter plot
plot(data4$x,data4$y, main ='Data4 plot')
abline(lm_data4, col='red')

#Residuals plot
plot(lm_data4$residuals ~ data4$x)
abline(h = 0, lty = 3)

We can see in Data4 there is not a constant linearity of the y and x values, because in the residual plot the points are NOT positioned around zero in a random fashion but they form constant(vertical line), and in the scatter plot the points DO NOT looklinear between y and x value since they remain as a constant vertical line.

Nearly Normal Residuals: The residuals plot should be normal or the Q-Q plot should lie along the line

#histogram of the residuals
hist(lm_data4$residuals)

#normal probability plot of the residuals (Q-Q plot)
qqnorm(lm_data4$residuals)
qqline(lm_data4$residuals)

Based on the residual histogram plot and Q-Q normality plot of Data4, we can see that histogram is NOT normally distributed and the points on the Q-Q plot lie along the line consistently . Therefore we can say this condition of Normality of Residuals is NOT satisfied

Constant variability: The variability of points around the least squares line should be roughly constant. Also variability of residuals around the 0 line should be roughly constant as well.

Looking at the plots under condition 1 we can see that the scatter plot is cone shaped and that the residuals are spread across the zero line as a straight vertical line or sign that they influence each other.

Conclusion: It is NOT appropriate to estimate a linear regression model for Data4

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

It is important to check the conditions through visualization and plot the data because according the summary statistics, correlation coefficient and the regression equation, all the data sets looked similar and would predict the same y value for a new x value. However, after analyzing the residual plots and scatter plots we see clearly why it is so dangerous to just go with the numbers or stats without any visualization. The visualizations helped us to determine that only the Data1 is most appropriate for use in linear regression model

DATA 606 Fall 2017 - Final Exam