MSDS Spring 2018

DATA 606 Statistics and Probability for Data Analytics

Jiadi Li

Final Exam

Part I

Please put the answers for Part I next to the question number (2pts each):

  1. B. daysDrive
  2. A. mean = 3.3,median = 3.5
  3. D. both studies (a) and (b)
  4. A. there is a difference between average eye color and average hair color
  5. B. 17.8 and 69.0
  6. D. The median and interquartile range are resistant to outliers, whereas the mean and standard deviation are not


7a. Describe the two distributions (2pts).

Figure A: relatively larger size, unimodal, skewed to right
Figure B: relatively smaller size (sampling distribution), unimodal, sysmetrical


7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

The mean of sampling distribution is similar to observations because the data is randomly selected from the observations, the sample size is large enough, and therefore normally distributed. Regarding the standard deviations, while the observations include outliers, they were not selected in the sample, so that the standard deviation of the sample wasn’t affected.



7c. What is the statistical principal that describes this phenomenon (2 pts)?

The phenomenon can be described by the Central Limit Theorem.


Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))



For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

#data1
mean(data1$x)
## [1] 9
mean(data1$y)
## [1] 7.5
#data2
mean(data2$x)
## [1] 9
mean(data2$y)
## [1] 7.5
#data3
mean(data3$x)
## [1] 9
mean(data3$y)
## [1] 7.5
#data4
mean(data4$x)
## [1] 9
mean(data4$y)
## [1] 7.5



b. The median (for x and y separately; 1 pt).

#data1
median(data1$x)
## [1] 9
median(data1$y)
## [1] 7.6
#data2
median(data2$x)
## [1] 9
median(data2$y)
## [1] 8.1
#data3
median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.1
#data4
median(data4$x)
## [1] 8
median(data4$y)
## [1] 7



c. The standard deviation (for x and y separately; 1 pt).

#data1
sd(data1$x)
## [1] 3.3
sd(data1$y)
## [1] 2
#data2
sd(data2$x)
## [1] 3.3
sd(data2$y)
## [1] 2
#data3
sd(data3$x)
## [1] 3.3
sd(data3$y)
## [1] 2
#data4
sd(data4$x)
## [1] 3.3
sd(data4$y)
## [1] 2



For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

#data1
cor(data1$x,data1$y)
## [1] 0.82
#data2
cor(data2$x,data2$y)
## [1] 0.82
#data3
cor(data3$x,data3$y)
## [1] 0.82
#data4
cor(data4$x,data4$y)
## [1] 0.82



e. Linear regression equation (2 pts).

#data1
eq1 <- lm(data1$y ~ data1$x)
summary(eq1)
## 
## Call:
## lm(formula = data1$y ~ data1$x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## data1$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217
#data2
eq2 <- lm(data2$y ~ data2$x)
summary(eq2)
## 
## Call:
## lm(formula = data2$y ~ data2$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## data2$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
#data3
eq3 <- lm(data3$y ~ data3$x)
summary(eq3)
## 
## Call:
## lm(formula = data3$y ~ data3$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## data3$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
#data1
eq4 <- lm(data4$y ~ data4$x)
summary(eq4)
## 
## Call:
## lm(formula = data4$y ~ data4$x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## data4$x        0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

all four equations are similar, which are: \(\hat{y} = 3 + 0.5x\)



f. R-Squared (2 pts).

#data1
summary(eq1)$r.squared
## [1] 0.67
#data2
summary(eq2)$r.squared
## [1] 0.67
#data3
summary(eq3)$r.squared
## [1] 0.67
#data4
summary(eq4)$r.squared
## [1] 0.67



For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

#data1
par(mfrow = c(2,3))
plot(data1)
hist(eq1$residuals,main = "Histogram: residuals of eq 1")
qqnorm(eq1$residuals)
qqline(eq1$residuals)

#data2
plot(data2)
hist(eq2$residuals,main = "Histogram: residuals of eq 2")
qqnorm(eq2$residuals)
qqline(eq2$residuals)

Data1: while the plot shows a tendency of linearity, the residuals don’t seem to be normally distributed based on the histogram and normal QQ plot.

Data2: the plot shows that a shape of parabola, and the residuals aren’t normally distributed.

#data3
par(mfrow = c(2,3))
plot(data3)
hist(eq3$residuals,main = "Histogram: residuals of eq 3")
qqnorm(eq3$residuals)
qqline(eq3$residuals)

#data4
plot(data4)
hist(eq4$residuals,main = "Histogram: residuals of eq 4")
qqnorm(eq4$residuals)
qqline(eq4$residuals)

Data3: With one outlier, the plot shows that each point of the dataset fall on a single line and residuals appear to be normally distributed which can be proved by the histogram and normal QQ plot.

Data4: with one outlier, all other points have x values of 8 no matter what the y values are. Regardless that the x value of this dataset might be a binary variable, the fact that the residuals aren’t normally distributed shows that a linear regression model should not be applied on this dataset.


Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Based on the analysis of the four datasets above, while all four datasets have same or very similar means, medians, standard deviations, correlations between two variables, linear regression equations and r-square of the linear models, three out of the four datasets, especially for dataset 2 and 4, don’t appear to have linear relationship between the two variables. In this case, visualization seems to be the most straight forward way to observe the data and determine the relationships between variables.