CUNY DATA606 -Final Exam

Q1

Answer -> b. daysDrive (quantitative and discrete variable)

Car is a qualitative categorical variable with numerical labels -> rejected
Color is a qualitative or categorical variable -> rejected
GasMonth is a Quantitative continuous variable -> rejected

Q2

A histogram of the GPA of 132 students from this course in Fall 2012 class is presented below. Which estimates of the mean and median are most plausible?

#### Answer -> a. mean=3.3, median=3.5

Above distribution is left skewed, then the mean is smaller than the median. The middle value could be around 66 ( 132+1)/2. When considering the density of the above histogram this value could fall close to 3.5 GPA which is the median of the distribution. 3.8 median is not possible and it is bit higher.

Q3

Answer -> d. Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.

Q4

Answer -> a.there is a difference between average eye color and average hair color.

Above study is based on two categorical variables done in a large population. It is possible to create below probability calculation table to find values for these two categorical variables.

##       brunette blond  red   
## blue  |   n1   |   n2 |   n3
## green |   n4   |   n5 |   n6
## brown |   n7   |   n8 |   n9

I can setting up the hypothesis test as follows:

\(H_0:\) Exists association between natural hair color and eye color.

\(H_A:\) No association between natural hair color and eye color.

Since we have a large chi-square values it would suggest strong evidence favoring the alternative hypothesis. There for the answer would be -> a.there is a difference between average eye color and average hair color.

Q5

Answer -> b. 17.8 and 69.0

\(IQR = Q_3 - Q_1 => IQR=49.8 - 37 => 12.8\)

\(Lower_outlier => Q_1 - 1.5*IQR => 37-1.5*12.8 =>17.8\)
\(Upper_outlier => Q_3 + 1.5*IQR => 49.8+1.5*12.8 => 69\)

Q6

Answer -> d. The median and interquartile range are resistant to outliers, whereas the mean and standard deviation are not.

Q7

Part I

A. Observations

\(\mu = 5.05\) , \(\sigma = 3.22\) , \(n = 500\)

B. Sampling Distribution

\(\mu\bar{x} = 5.04\) , \(\sigma\bar{x} = \frac{\sigma}{\sqrt(n)} = 0.58\) , \(n = 30\)

a. Describe the two distributions (2 pts).

Both A. Observations and B. Sampling Distribution are uni-model.
A. Observations is skewed to the right .
B. Sampling looks normally distributed.
Median is lower than the Mean in A. Observations

b. Explain why the means of these two distributions are similar but the standard deviations are not (2pts)

B. Sampling Distribution has the mean from 500 random samples of size 30 from A, expect the mean of this distribution to be similar to the mean of the original population A as per application of the central Limit Theorem. The standard deviation of the sample mean describe the margin of error from the estimate to the true mean of the population. It is call the Standard Error and can be found by formula \(SE = \frac{\sigma}{\sqrt(n)} = 0.58\)

c. What is the statistical principal that describes this phenomenon (2 pts)?

Central Limit theorem

Part II

Data Sets:

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))

data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))

data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))

data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

d1mean.x <- round(mean(data1$x),2)
d1mean.y <- round(mean(data1$y),2)

d2mean.x <- round(mean(data2$x),2)
d2mean.y <- round(mean(data2$y),2)

d3mean.x<- round(mean(data3$x),2)
d3mean.y <- round(mean(data3$y),2)

d4mean.x <- round(mean(data4$x),2)
d4mean.y <- round(mean(data4$y),2)

Data	Mean x	Mean y
data1	9	7.5
data2	9	7.5
data3	9	7.5
data4	9	7.5
———————————–

b. The median (for x and y separately; 1 pt).

d1med.x <- round(median(data1$x),2)
d1med.y <- round(median(data1$y),2)

d2med.x <- round(median(data2$x),2)
d2med.y <- round(median(data2$y),2)

d3med.x<- round(median(data3$x),2)
d3med.y <- round(median(data3$y),2)

d4med.x <- round(median(data4$x),2)
d4med.y <- round(median(data4$y),2)

Data	Median x	Median y
data1	9	7.58
data2	9	8.14
data3	9	7.11
data4	8	7.04
—————————————

c. The standard deviation (for x and y separately; 1 pt).

d1sd.x <- round(sd(data1$x),2)
d1sd.y <- round(sd(data1$y),2)

d2sd.x <- round(sd(data2$x),2)
d2sd.y <- round(sd(data2$y),2)

d3sd.x <- round(sd(data3$x),2)
d3sd.y <- round(sd(data3$y),2)

d4sd.x <- round(sd(data4$x),2)
d4sd.y <- round(sd(data4$y),2)

Data	SD x	SD y
data1	3.32	2.03
data2	3.32	2.03
data3	3.32	2.03
data4	3.32	2.03

summary(data1)

##        x              y       
##  Min.   : 4.0   Min.   : 4.3  
##  1st Qu.: 6.5   1st Qu.: 6.3  
##  Median : 9.0   Median : 7.6  
##  Mean   : 9.0   Mean   : 7.5  
##  3rd Qu.:11.5   3rd Qu.: 8.6  
##  Max.   :14.0   Max.   :10.8

summary(data2)

##        x              y      
##  Min.   : 4.0   Min.   :3.1  
##  1st Qu.: 6.5   1st Qu.:6.7  
##  Median : 9.0   Median :8.1  
##  Mean   : 9.0   Mean   :7.5  
##  3rd Qu.:11.5   3rd Qu.:8.9  
##  Max.   :14.0   Max.   :9.3

summary(data3)

##        x              y       
##  Min.   : 4.0   Min.   : 5.4  
##  1st Qu.: 6.5   1st Qu.: 6.2  
##  Median : 9.0   Median : 7.1  
##  Mean   : 9.0   Mean   : 7.5  
##  3rd Qu.:11.5   3rd Qu.: 8.0  
##  Max.   :14.0   Max.   :12.7

summary(data4)

##        x            y       
##  Min.   : 8   Min.   : 5.2  
##  1st Qu.: 8   1st Qu.: 6.2  
##  Median : 8   Median : 7.0  
##  Mean   : 9   Mean   : 7.5  
##  3rd Qu.: 8   3rd Qu.: 8.2  
##  Max.   :19   Max.   :12.5

For each x and y pair, calculate (also to two decimal places; 1 pt)

d. The correlation (1 pt).

par(mfrow=c(2,2))
plot(data1,main = "data1")
plot(data2,main = "data2")
plot(data3,main = "data3")
plot(data4,main = "data4")

Data1

#plot(data1)
round(cor(data1),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

Data2

#plot(data2)
round(cor(data2),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

Data3

#plot(data3)
round(cor(data3),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

Data4

#plot(data4)
round(cor(data4),2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

e. Linear regression equation (2 pts).

Data1

lm1<-lm(y~x,data=data1)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

par(mfrow=c(2,2))
plot(lm1)

Data2

lm2<-lm(y~x,data=data2)
summary(lm2)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

par(mfrow=c(2,2))
plot(lm2)

Data3

lm3<-lm(y~x,data=data3)
summary(lm3)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

par(mfrow=c(2,2))
plot(lm3)

Data4

lm4<-lm(y~x,data=data4)
summary(lm4)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

par(mfrow=c(2,2))
plot(lm4)

## Warning: not plotting observations with leverage one:
##   8

## Warning: not plotting observations with leverage one:
##   8

Linear Regression Equations are

\(Y1 = 0.5*X + 3\)
\(Y2 = 0.5*X + 3\)
\(Y3 = 0.5*X + 3\)
\(Y4 = 0.5*X + 3\)

par(mfrow=c(2,2))
plot(data1$y ~ data1$x )
abline(lm1)
plot(data2$y ~ data2$x)
abline(lm2)
plot(data3$y ~ data3$x)
abline(lm3)
plot(data4$y ~ data4$x)
abline(lm4)

f. R-Squared (2 pts).

round(summary(lm1)$r.squared,2)

## [1] 0.67

round(summary(lm2)$r.squared,2)

## [1] 0.67

round(summary(lm3)$r.squared,2)

## [1] 0.67

round(summary(lm4)$r.squared,2)

## [1] 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Data1

par(mfrow=c(2,2))
plot(data1, main = "Data1") 
hist(lm1$residuals)
qqnorm(lm1$residuals)
qqline(lm1$residuals)

data1 shows linear trend but not following normal distribution.
For the most part the distribution of residuals is nearly normal.

Data2

par(mfrow=c(2,2))
plot(data2, main = "Data2") 
hist(lm2$residuals)
qqnorm(lm2$residuals)
qqline(lm2$residuals)

data2 shows relationship between x and y, but it is not linear.
The distribution is not nearly normal.

Data3

par(mfrow=c(2,2))
plot(data3, main = "Data3") 
hist(lm1$residuals)
qqnorm(lm3$residuals)
qqline(lm3$residuals)

data3 has a very linear relationship, but has an outlier that may skew the model to right.
It is nearly normal and has an outlier.

Data4

par(mfrow=c(2,2))
plot(data4, main = "Data4") 
hist(lm4$residuals)
qqnorm(lm4$residuals)
qqline(lm4$residuals)

data4 has mostly a linear relationship except for one very extreme outlier that completely skews the model.
For the most part the distribution of residuals is nearly normal

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)(4 pts)

Visualizations are very important because they help identify outliers and trends. Also help to compare behaviors of different data sets. When these data sets have very similar medians , means, standard deviations, correlations, R-squared and linear regression equations only by looking it hard to conclude the features. So the plotting is must at this point.

DATA606_Final_Exam……..

Matheesha Thambeliyagodage

May 23, 2017

CUNY DATA606 -Final Exam

Q1

Answer -> b. daysDrive (quantitative and discrete variable)

Q2

Q3

Answer -> d. Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.

Q4

Answer -> a.there is a difference between average eye color and average hair color.

Q5

Answer -> b. 17.8 and 69.0

Q6

Answer -> d. The median and interquartile range are resistant to outliers, whereas the mean and standard deviation are not.

Q7

Part I

A. Observations

B. Sampling Distribution

a. Describe the two distributions (2 pts).

b. Explain why the means of these two distributions are similar but the standard deviations are not (2pts)

c. What is the statistical principal that describes this phenomenon (2 pts)?

Part II

Data Sets:

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

b. The median (for x and y separately; 1 pt).

c. The standard deviation (for x and y separately; 1 pt).

For each x and y pair, calculate (also to two decimal places; 1 pt)

d. The correlation (1 pt).

Data1

Data2

Data3

Data4

e. Linear regression equation (2 pts).

Data1

Data2

Data3

Data4

Linear Regression Equations are

f. R-Squared (2 pts).

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Data1

Data2

Data3

Data4

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)(4 pts)