DATA 606 Spring 2017

Answer :

b. daysDrive

Answer :

a. mean = 3.3, median = 3.5

Answer :

d. Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.

Answer :

a. there is a difference between average eye color and average hair color.

Answer :

a. 37.0 and 49.8

(Interestingly The question is not to find the outliers. The question is to find the values the researchers should use to determine the ouliers. The outliers can be calculated using Q1 and Q3. So the outliers are Q1 - 1.5IQR and Q3 + 1.5 IQR. That is 17.8 and 69.0)

Answer :

d. median and interquartile range; mean and standard deviation

a. Describe the two distributions (2 pts).

Answer :

The distribution A is moderately right skewed with mean 5.05 and standard deviation 3.22. It is a uni model distribution

Central Limit Theorem, says that If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model.

So as per central limit thearem, disbution B is a normal distribution with mean 5.04 and stadard devation 0.58. So obiviously its a uni model distribution.

b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

Answer :

The two data sets have the same central tendency as it is expressed by the mean and both distribution represents the same data set. As stated in (a) distribution B is a normal distribution

But the two data sets have different dispersion as it is expressed by the standard deviation. In the distribution B, the observations are located more evenly around the mean compared to the distribution A, where they are more dispersed and outliers affects the standard deviation.

c. What is the statistical principal that describes this phenomenon (2 pts)?

Answer :

Central Limit Theorem, says that If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

Answer :

mean_data1_x <-format(round(mean(data1$x),2), nsmall = 2)
mean_data1_y <-format(round(mean(data1$y),2), nsmall = 2)
mean_data2_x <-format(round(mean(data2$x),2), nsmall = 2)
mean_data2_y <-format(round(mean(data2$y),2), nsmall = 2)
mean_data3_x <-format(round(mean(data3$x),2), nsmall = 2)
mean_data3_y <-format(round(mean(data3$y),2), nsmall = 2)
mean_data4_x <-format(round(mean(data4$x),2), nsmall = 2)
mean_data4_y <-format(round(mean(data4$y),2), nsmall = 2)

#paste("Mean of data1$x : ", mean_data1_x)
#paste("Mean of data1$y : ", mean_data1_y)
#paste("Mean of data2$x : ", mean_data2_x)
#paste("Mean of data2$y : ", mean_data2_y)
#paste("Mean of data3$x : ", mean_data3_x)
#paste("Mean of data3$y : ", mean_data3_y)
#paste("Mean of data4$x : ", mean_data4_x)
#paste("Mean of data4$y : ", mean_data4_y)

“Mean of data1$x : 9.00”

“Mean of data1$y : 7.50”

“Mean of data2$x : 9.00”

“Mean of data2$y : 7.50”

“Mean of data3$x : 9.00”

“Mean of data3$y : 7.50”

“Mean of data4$x : 9.00”

“Mean of data4$y : 7.50”

b. The median (for x and y separately; 1 pt).

Answer :

med_data1_x <-format(round(median(data1$x),2), nsmall = 2)
med_data1_y <-format(round(median(data1$y),2), nsmall = 2)
med_data2_x <-format(round(median(data2$x),2), nsmall = 2)
med_data2_y <-format(round(median(data2$y),2), nsmall = 2)
med_data3_x <-format(round(median(data3$x),2), nsmall = 2)
med_data3_y <-format(round(median(data3$y),2), nsmall = 2)
med_data4_x <-format(round(median(data4$x),2), nsmall = 2)
med_data4_y <-format(round(median(data4$y),2), nsmall = 2)

#paste("Meadian of data1$x : ", med_data1_x)
#paste("Meadian of data1$y : ", med_data1_y)
#paste("Meadian of data2$x : ", med_data2_x)
#paste("Meadian of data2$y : ", med_data2_y)
#paste("Meadian of data3$x : ", med_data3_x)
#paste("Meadian of data3$y : ", med_data3_y)
#paste("Meadian of data4$x : ", med_data4_x)
#paste("Meadian of data4$y : ", med_data4_y)

“Meadian of data1$x : 9.00”

“Meadian of data1$y : 7.58”

“Meadian of data2$x : 9.00”

“Meadian of data2$y : 8.14”

“Meadian of data3$x : 9.00”

“Meadian of data3$y : 7.11”

“Meadian of data4$x : 8.00”

“Meadian of data4$y : 7.04”

c. The standard deviation (for x and y separately; 1 pt).

Answer :

sd_data1_x <-format(round(sd(data1$x),2), nsmall = 2)
sd_data1_y <-format(round(sd(data1$y),2), nsmall = 2)
sd_data2_x <-format(round(sd(data2$x),2), nsmall = 2)
sd_data2_y <-format(round(sd(data2$y),2), nsmall = 2)
sd_data3_x <-format(round(sd(data3$x),2), nsmall = 2)
sd_data3_y <-format(round(sd(data3$y),2), nsmall = 2)
sd_data4_x <-format(round(sd(data4$x),2), nsmall = 2)
sd_data4_y <-format(round(sd(data4$y),2), nsmall = 2)

#paste("Standard deviation of data1$x : ", sd_data1_x)
#paste("Standard deviation of data1$y : ", sd_data1_y)
#paste("Standard deviation of data2$x : ", sd_data2_x)
#paste("Standard deviation of data2$y : ", sd_data2_y)
#paste("Standard deviation of data3$x : ", sd_data3_x)
#paste("Standard deviation of data3$y : ", sd_data3_y)
#paste("Standard deviation of data4$x : ", sd_data4_x)
#paste("Standard deviation of data4$y : ", sd_data4_y)

“Standard deviation of data1$x : 3.32”

“Standard deviation of data1$y : 2.03”

“Standard deviation of data2$x : 3.32”

“Standard deviation of data2$y : 2.03”

“Standard deviation of data3$x : 3.32”

“Standard deviation of data3$y : 2.03”

“Standard deviation of data4$x : 3.32”

“Standard deviation of data4$y : 2.03”

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

Answer :

cor_data1<-format(round(cor(data1$x,data1$y),2), nsmall = 2)
cor_data2<-format(round(cor(data1$x,data2$y),2), nsmall = 2)
cor_data3<-format(round(cor(data1$x,data3$y),2), nsmall = 2)
cor_data4<-format(round(cor(data1$x,data4$y),2), nsmall = 2)

#paste("Correlation of x and y pair in data set data1 : ",cor_data1)
#paste("Correlation of x and y pair in data set data2 : ",cor_data2)
#paste("Correlation of x and y pair in data set data3 : ",cor_data3)
#paste("Correlation of x and y pair in data set data4 : ",cor_data4)

“Correlation of x and y pair in data set data1 : 0.82”

“Correlation of x and y pair in data set data2 : 0.82”

“Correlation of x and y pair in data set data3 : 0.82”

“Correlation of x and y pair in data set data4 : -0.31”

e. Linear regression equation (2 pts).

lm_data1 <- lm(data1$x~data1$y)
lm_data1

## 
## Call:
## lm(formula = data1$x ~ data1$y)
## 
## Coefficients:
## (Intercept)      data1$y  
##      -0.998        1.333

The linear regression equation of x and y pair the data set data1

y = -0.998 + 1.333x

lm_data2 <- lm(data2$x~data2$y)
lm_data2

## 
## Call:
## lm(formula = data2$x ~ data2$y)
## 
## Coefficients:
## (Intercept)      data2$y  
##      -0.995        1.332

The linear regression equation of x and y pair the data set data2

y = -0.995 + 1.332x

lm_data3 <- lm(data3$x~data3$y)
lm_data3

## 
## Call:
## lm(formula = data3$x ~ data3$y)
## 
## Coefficients:
## (Intercept)      data3$y  
##       -1.00         1.33

The linear regression equation of x and y pair the data set data3

y = -1.00 + 1.33x

lm_data4 <- lm(data4$x~data4$y)
lm_data4

## 
## Call:
## lm(formula = data4$x ~ data4$y)
## 
## Coefficients:
## (Intercept)      data4$y  
##       -1.00         1.33

The linear regression equation of x and y pair the data set data4

y = -1.00 + 1.33x

f. R-Squared (2 pts).

#str(summary(lm_data1)) 
## adjusted R²
#summary(lm_data4)$adj.r.squared
# R²
r_squared_data1 <- format(round(summary(lm_data1)$r.squared,2), nsmall = 2)
r_squared_data2 <- format(round(summary(lm_data2)$r.squared,2), nsmall = 2)
r_squared_data3 <- format(round(summary(lm_data3)$r.squared,2), nsmall = 2)
r_squared_data4 <- format(round(summary(lm_data4)$r.squared,2), nsmall = 2)

#paste("R-squared of x and y pair in data1" , r_squared_data1 )
#paste("R-squared of x and y pair in data2" , r_squared_data2 )
#paste("R-squared of x and y pair in data3" , r_squared_data3 )
#paste("R-squared of x and y pair in data4" , r_squared_data4 )

“R-squared of x and y pair in data1 0.67”

“R-squared of x and y pair in data2 0.67”

“R-squared of x and y pair in data3 0.67”

“R-squared of x and y pair in data4 0.67”

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Answer :

If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

Data1

require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.3.3

ggplot(data1, aes(x=data1$x,y = data1$y))+geom_point(size = 4,color = "red")+geom_smooth(method = "lm") + labs(x="X-Values", y = "Y-Values")

data1_lm <- lm(data1$x ~ data1$y, data=data1) 
data1_res <- resid(data1_lm)
plot(data1$y, data1_res, ylab="Residuals", xlab="X-Value", main="Residual Plot - Data1") 
abline(0, 0)

Moderately strong relationship is there. The residual plot is randomly dispersed around the horizontal axis. So a linear regression model would be appropriate in this case.

Data2

ggplot(data2, aes(x=data2$x,y = data2$y))+geom_point(size = 4,color = "red")+geom_smooth(method = "lm") + labs(x="X-Values", y = "Y-Values")

data2_lm <- lm(data2$x ~ data2$y, data=data2) 
data2_res <- resid(data2_lm)
plot(data2$y, data2_res, ylab="Residuals", xlab="X-Value", main="Residual Plot - Data2") 
abline(0, 0)

Strong relationship is there. The residual plot is NOT randomly dispersed around the horizontal axis and the plots are forming a curve. So a linear regression model would NOT be appropriate in this case

Data3

ggplot(data3, aes(x=data3$x,y = data3$y))+geom_point(size = 4,color = "red")+geom_smooth(method = "lm") + labs(x="X-Values", y = "Y-Values")

data3_lm <- lm(data3$x ~ data3$y, data=data3) 
data3_res <- resid(data3_lm)
plot(data3$y, data3_res, ylab="Residuals", xlab="X-Value", main="Residual Plot - Data3") 
abline(0, 0)

Strong relationship is there. The residual plot is NOT randomly dispersed around the horizontal axis and the plots are forming a straight line. So a linear regression model would NOT be appropriate in this case.

Data4

ggplot(data4, aes(x=data4$x,y = data4$y))+geom_point(size = 4,color = "red")+geom_smooth(method = "lm") + labs(x="X-Values", y = "Y-Values")

data4_lm <- lm(data4$x ~ data4$y, data=data4) 
data4_res <- resid(data4_lm)
plot(data4$y, data4_res, ylab="Residuals", xlab="X-Value", main="Residual Plot - Data4") 
abline(0, 0)

Strong relationship is there. The residual plot is NOT randomly dispersed around the horizontal axis and the plots are forming a straight line. So linear regression model would NOT be appropriate in this case.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Answer :

With visualizations,

Patterns emerge quickly
Exceptions and Outliers are Made Obvious
Quicker Analysis of Data over Time

Because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets or reports. Data visualization is a quick, easy way to convey concepts in a universal manner. Data visualization can also Identify areas that need attention or improvement. Data visualization can Clarify which factors influence customer behavior. Data visualization can help you understand which products to place where. etc. etc.

Example : Following visualization of the spread of data set data1 gives much clarity in predicting

ggplot(data1, aes(x=data1$x,y = data1$y))+geom_point(size = 4,color = "red")+geom_smooth(method = "lm") + labs(x="X-Values", y = "Y-Values")

DATA 606 Spring 2017 - Final Exam

James Kuruvilla

May 19, 2017

Data1

Data2

Data3

Data4