Part I

Please put the answers for Part I next to the question number (2pts each):

daysDrive is the only quantitative and descrete variable since it can be measured as well as a variable with posible values of only distinct points on sacale.
mean = 3.3, median = 3.5 ; Since the distribution is left skewed, mean is smaller than the median. Also with the observation of histogram we can conclude that the median is 3.5 as 3.8 is bit too high.
Answer: d.Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients
Answer: d. eye color and natural hair color are independent Larger Chi-square value indicates stronger evidence against for null hypothesis as observed and expected frequencies a far apart. There for answer d. is the right choice.
Answer:

min <- 26
Q1 <- 37
median <- 45
Q3 <- 49.8
max <- 65
mean <- 44.4
sd <- 8.4
n <- 50

IQR <- Q3 - Q1
IQR <- 49.8 - 37

Upper_limit <- Q3 + 1.5 * IQR
Upper_limit

## [1] 69

Lower_Limit <- Q1 - 1.5 * IQR
Lower_Limit

## [1] 18

b. 17.8 and 69.0

Answer: d. median and interquartile range; mean and standard deviation

7a. Describe the two distributions (2pts).

Answer:

Figure A : Observations: The distribution is unimodel and right skewed.Therefore we can say median > mean. The spread for figure A is narrower as compared to spread of figure B.
Figure B : Sampling Distribution : The distribution is unimodel and fairly symmetrical. The spread is wider with a sample size of 30.

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

The means of the two distributions are similar because distribution B is a sample of A. Sample mean and the population means are similar. The standard deviations are different because distribution B has wider spread with a smaller population than A.

7c. What is the statistical principal that describes this phenomenon (2 pts)?

Central Limit Theorem. Essential component of the Central limit theorem is that the sample mean will be the population mean. Similarly, if you find the average of all of the standard deviations in your sample, you will find the actual standard deviation for your population.

Part II

Consider the four datasets, each with two columns (x and y), provided below. Be sure to replace the NA with your answer for each part (e.g. assign the mean of x for data1 to the data1.x.mean variable). When you Knit your answer document, a table will be generated with all the answers.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

data1.x.mean <- round(mean(data1$x),2)
data1.y.mean <- round(mean(data1$y),2)
data2.x.mean <- round(mean(data2$x),2)
data2.y.mean <- round(mean(data2$y),2)
data3.x.mean <- round(mean(data3$x),2)
data3.y.mean <- round(mean(data3$y),2)
data4.x.mean <- round(mean(data4$x),2)
data4.y.mean <- round(mean(data4$y),2)

data1.x.mean; data1.y.mean; data2.x.mean; data2.y.mean;

## [1] 9

## [1] 7.5

## [1] 9

## [1] 7.5

data3.x.mean; data3.y.mean; data4.x.mean; data4.y.mean

## [1] 9

## [1] 7.5

## [1] 9

## [1] 7.5

b. The median (for x and y separately; 1 pt).

data1.x.median <- median(data1$x)
data1.y.median <- median(data1$y)
data2.x.median <- median(data2$x)
data2.y.median <- median(data2$y)
data3.x.median <- median(data3$x)
data3.y.median <- median(data3$y)
data4.x.median <- median(data4$x)
data4.y.median <- median(data4$y)

data1.x.median; data1.y.median; data2.x.median; data2.y.median;

## [1] 9

## [1] 7.6

## [1] 9

## [1] 8.1

data3.x.median; data3.y.median; data4.x.median; data4.y.median

## [1] 9

## [1] 7.1

## [1] 8

## [1] 7

c. The standard deviation (for x and y separately; 1 pt).

data1.x.sd <- sd(data1$x)
data1.y.sd <- sd(data1$y)
data2.x.sd <- sd(data2$x)
data2.y.sd <- sd(data2$y)
data3.x.sd <- sd(data3$x)
data3.y.sd <- sd(data3$y)
data4.x.sd <- sd(data4$x)
data4.y.sd <- sd(data4$y)

data1.x.sd; data1.y.sd; data2.x.sd; data2.y.sd; data3.x.sd;

## [1] 3.3

## [1] 2

## [1] 3.3

## [1] 2

## [1] 3.3

data3.y.sd; data4.x.sd; data4.y.sd

## [1] 2

## [1] 3.3

## [1] 2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

data1.correlation <- cor(data1)
data2.correlation <- cor(data2)
data3.correlation <- cor(data3)
data4.correlation <- cor(data4)

data1.correlation; data2.correlation; data3.correlation; data4.correlation

##      x    y
## x 1.00 0.82
## y 0.82 1.00

##      x    y
## x 1.00 0.82
## y 0.82 1.00

##      x    y
## x 1.00 0.82
## y 0.82 1.00

##      x    y
## x 1.00 0.82
## y 0.82 1.00

par(mfrow=c(2,2))
plot(data1)
title(main = "data1")
plot(data2)
title(main = "data2")
plot(data3)
title(main = "data3")
plot(data4)
title(main = "data4")

e. Linear regression equation (2 pts).

\(Slope = r\frac{s_y}{s_x}\)

\(r\) = correlation coefficient between x and y

\(s_y\) = standard deviation of y

\(s_x\) = standard deviation of x

\(Intercept = \overline{y}-b_1\overline{x}\)

\(\overline{y}\) = mean of \(y\)

\(\overline{x}\) = mean of \(x\)

\(b_1\) = slope

data1.slope <- data1.correlation*((data1.y.sd)/(data1.x.sd))
data2.slope <- data2.correlation*((data2.y.sd)/(data2.x.sd))
data3.slope <- data3.correlation*((data3.y.sd)/(data3.x.sd))
data4.slope <- data4.correlation*((data4.y.sd)/(data4.x.sd))


data1.intercept <- (data1.y.mean)-(data1.slope*data1.x.mean)
data2.intercept <- (data2.y.mean)-(data2.slope*data2.x.mean)
data3.intercept <- (data3.y.mean)-(data3.slope*data3.x.mean)
data4.intercept <- (data4.y.mean)-(data4.slope*data4.x.mean)

#data 1 slope and intercept
data1.lm <- lm(y~x, data=data1)
summary(data1.lm)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

\(\hat{y}_1 = 3.000 + 0.500x\)

par(mfrow=c(2,2))
plot(data1.lm)

#data2 slope and intercept
data2.lm <- lm(y~x, data=data2)
summary(data2.lm)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

\(\hat{y}_2 = 3.001 + 0.500x\)

par(mfrow=c(2,2))
plot(data2.lm)

#data3 slope and intercept
data3.lm <- lm(y~x, data=data3)
summary(data3.lm)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

\(\hat{y}_3 = 3.002 + 0.500x\)

par(mfrow=c(2,2))
plot(data3.lm)

#data4 slope and intercept
data4.lm <- lm(y~x, data=data4)
summary(data4.lm)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

\(\hat{y}_4 = 3.002 + 0.500x\)

par(mfrow=c(2,2))
plot(data4.lm)

## Warning: not plotting observations with leverage one:
##   8

## Warning: not plotting observations with leverage one:
##   8

f. R-Squared (2 pts).

data1.rsquared <- summary(data1.lm)$r.squared
data2.rsquared <- summary(data1.lm)$r.squared
data3.rsquared <- summary(data1.lm)$r.squared
data4.rsquared <- summary(data1.lm)$r.squared

data1.rsquared;

## [1] 0.67

data2.rsquared;

## [1] 0.67

data3.rsquared;

## [1] 0.67

data4.rsquared

## [1] 0.67

\(R^2 = 0.67\)

g. For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Conditions to check for linear regression model:

Linearity Nearly normal residuals Constant variability Observations are independent of each other.(we can make an assumption here)

Plots for data1

par(mfrow=c(2,2))
plot(data1$x, data1$y)
hist(data1.lm$residuals)
qqnorm(data1.lm$residuals)
qqline(data1.lm$residuals)

Main plot for data1 does not show much of linearity,the Q-Q plot has fairly normal nature with some outliers,Also histogram is quite unclear.therefore linear regression model not appropriate.

Plots for data2

par(mfrow=c(2,2))
plot(data2$x, data2$y)
hist(data2.lm$residuals)
qqnorm(data2.lm$residuals)
qqline(data2.lm$residuals)

From the above graphs, we can identify that the given data follows a curve and not necessarily a linear model; also the residuals do not seems to follow a normal distribution as well.

Plots for data3

par(mfrow=c(2,2))
plot(data3$x, data3$y)
hist(data3.lm$residuals)
qqnorm(data3.lm$residuals)
qqline(data3.lm$residuals)

For data3, it seems that the given data follows some sort of linearity with some outliers producing leverage and the distribution seems to be normal but due to leverage it affects the outcome.

Plots for data3

par(mfrow=c(2,2))
plot(data4$x, data4$y)
hist(data4.lm$residuals)
qqnorm(data4.lm$residuals)
qqline(data4.lm$residuals)

In this case there’s an outlier point producing leverage; and also the residuals distribution does not seems to be normal.

h. Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Visualization plays very important role while anayzing data.Visualizations can help us identify outliers in model,help us build conclusions and prediction for our dataset.

For example,the plots from above models.Graphs helped to conclude the brief analysis of each dataset in question and to visualize the linear model for the plot.

DATA 606 Fall 2019 - Final Exam

Don Padmaperuma