Please put the answers for Part I next to the question number (2pts each):
daysDrive is the only quantitative and descrete variable since it can be measured as well as a variable with posible values of only distinct points on sacale.
mean = 3.3, median = 3.5 ; Since the distribution is left skewed, mean is smaller than the median. Also with the observation of histogram we can conclude that the median is 3.5 as 3.8 is bit too high.
Answer: d.Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients
Answer: d. eye color and natural hair color are independent Larger Chi-square value indicates stronger evidence against for null hypothesis as observed and expected frequencies a far apart. There for answer d. is the right choice.
Answer:
min <- 26
Q1 <- 37
median <- 45
Q3 <- 49.8
max <- 65
mean <- 44.4
sd <- 8.4
n <- 50
IQR <- Q3 - Q1
IQR <- 49.8 - 37
Upper_limit <- Q3 + 1.5 * IQR
Upper_limit
## [1] 69
Lower_Limit <- Q1 - 1.5 * IQR
Lower_Limit
## [1] 18
b. 17.8 and 69.0
7a. Describe the two distributions (2pts).
Answer:
Figure A : Observations: The distribution is unimodel and right skewed.Therefore we can say median > mean. The spread for figure A is narrower as compared to spread of figure B.
Figure B : Sampling Distribution : The distribution is unimodel and fairly symmetrical. The spread is wider with a sample size of 30.
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
The means of the two distributions are similar because distribution B is a sample of A. Sample mean and the population means are similar. The standard deviations are different because distribution B has wider spread with a smaller population than A.
7c. What is the statistical principal that describes this phenomenon (2 pts)?
Central Limit Theorem. Essential component of the Central limit theorem is that the sample mean will be the population mean. Similarly, if you find the average of all of the standard deviations in your sample, you will find the actual standard deviation for your population.
Consider the four datasets, each with two columns (x and y), provided below. Be sure to replace the NA with your answer for each part (e.g. assign the mean of x for data1 to the data1.x.mean variable). When you Knit your answer document, a table will be generated with all the answers.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
data1.x.mean <- round(mean(data1$x),2)
data1.y.mean <- round(mean(data1$y),2)
data2.x.mean <- round(mean(data2$x),2)
data2.y.mean <- round(mean(data2$y),2)
data3.x.mean <- round(mean(data3$x),2)
data3.y.mean <- round(mean(data3$y),2)
data4.x.mean <- round(mean(data4$x),2)
data4.y.mean <- round(mean(data4$y),2)
data1.x.mean; data1.y.mean; data2.x.mean; data2.y.mean;
## [1] 9
## [1] 7.5
## [1] 9
## [1] 7.5
data3.x.mean; data3.y.mean; data4.x.mean; data4.y.mean
## [1] 9
## [1] 7.5
## [1] 9
## [1] 7.5
data1.x.median <- median(data1$x)
data1.y.median <- median(data1$y)
data2.x.median <- median(data2$x)
data2.y.median <- median(data2$y)
data3.x.median <- median(data3$x)
data3.y.median <- median(data3$y)
data4.x.median <- median(data4$x)
data4.y.median <- median(data4$y)
data1.x.median; data1.y.median; data2.x.median; data2.y.median;
## [1] 9
## [1] 7.6
## [1] 9
## [1] 8.1
data3.x.median; data3.y.median; data4.x.median; data4.y.median
## [1] 9
## [1] 7.1
## [1] 8
## [1] 7
data1.x.sd <- sd(data1$x)
data1.y.sd <- sd(data1$y)
data2.x.sd <- sd(data2$x)
data2.y.sd <- sd(data2$y)
data3.x.sd <- sd(data3$x)
data3.y.sd <- sd(data3$y)
data4.x.sd <- sd(data4$x)
data4.y.sd <- sd(data4$y)
data1.x.sd; data1.y.sd; data2.x.sd; data2.y.sd; data3.x.sd;
## [1] 3.3
## [1] 2
## [1] 3.3
## [1] 2
## [1] 3.3
data3.y.sd; data4.x.sd; data4.y.sd
## [1] 2
## [1] 3.3
## [1] 2
data1.correlation <- cor(data1)
data2.correlation <- cor(data2)
data3.correlation <- cor(data3)
data4.correlation <- cor(data4)
data1.correlation; data2.correlation; data3.correlation; data4.correlation
## x y
## x 1.00 0.82
## y 0.82 1.00
## x y
## x 1.00 0.82
## y 0.82 1.00
## x y
## x 1.00 0.82
## y 0.82 1.00
## x y
## x 1.00 0.82
## y 0.82 1.00
par(mfrow=c(2,2))
plot(data1)
title(main = "data1")
plot(data2)
title(main = "data2")
plot(data3)
title(main = "data3")
plot(data4)
title(main = "data4")
\(Slope = r\frac{s_y}{s_x}\)
\(r\) = correlation coefficient between x and y
\(s_y\) = standard deviation of y
\(s_x\) = standard deviation of x
\(Intercept = \overline{y}-b_1\overline{x}\)
\(\overline{y}\) = mean of \(y\)
\(\overline{x}\) = mean of \(x\)
\(b_1\) = slope
data1.slope <- data1.correlation*((data1.y.sd)/(data1.x.sd))
data2.slope <- data2.correlation*((data2.y.sd)/(data2.x.sd))
data3.slope <- data3.correlation*((data3.y.sd)/(data3.x.sd))
data4.slope <- data4.correlation*((data4.y.sd)/(data4.x.sd))
data1.intercept <- (data1.y.mean)-(data1.slope*data1.x.mean)
data2.intercept <- (data2.y.mean)-(data2.slope*data2.x.mean)
data3.intercept <- (data3.y.mean)-(data3.slope*data3.x.mean)
data4.intercept <- (data4.y.mean)-(data4.slope*data4.x.mean)
#data 1 slope and intercept
data1.lm <- lm(y~x, data=data1)
summary(data1.lm)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
\(\hat{y}_1 = 3.000 + 0.500x\)
par(mfrow=c(2,2))
plot(data1.lm)
#data2 slope and intercept
data2.lm <- lm(y~x, data=data2)
summary(data2.lm)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
\(\hat{y}_2 = 3.001 + 0.500x\)
par(mfrow=c(2,2))
plot(data2.lm)
#data3 slope and intercept
data3.lm <- lm(y~x, data=data3)
summary(data3.lm)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
\(\hat{y}_3 = 3.002 + 0.500x\)
par(mfrow=c(2,2))
plot(data3.lm)
#data4 slope and intercept
data4.lm <- lm(y~x, data=data4)
summary(data4.lm)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
\(\hat{y}_4 = 3.002 + 0.500x\)
par(mfrow=c(2,2))
plot(data4.lm)
## Warning: not plotting observations with leverage one:
## 8
## Warning: not plotting observations with leverage one:
## 8
data1.rsquared <- summary(data1.lm)$r.squared
data2.rsquared <- summary(data1.lm)$r.squared
data3.rsquared <- summary(data1.lm)$r.squared
data4.rsquared <- summary(data1.lm)$r.squared
data1.rsquared;
## [1] 0.67
data2.rsquared;
## [1] 0.67
data3.rsquared;
## [1] 0.67
data4.rsquared
## [1] 0.67
\(R^2 = 0.67\)
Conditions to check for linear regression model:
Linearity Nearly normal residuals Constant variability Observations are independent of each other.(we can make an assumption here)
Plots for data1
par(mfrow=c(2,2))
plot(data1$x, data1$y)
hist(data1.lm$residuals)
qqnorm(data1.lm$residuals)
qqline(data1.lm$residuals)
Main plot for data1 does not show much of linearity,the Q-Q plot has fairly normal nature with some outliers,Also histogram is quite unclear.therefore linear regression model not appropriate.
Plots for data2
par(mfrow=c(2,2))
plot(data2$x, data2$y)
hist(data2.lm$residuals)
qqnorm(data2.lm$residuals)
qqline(data2.lm$residuals)
From the above graphs, we can identify that the given data follows a curve and not necessarily a linear model; also the residuals do not seems to follow a normal distribution as well.
Plots for data3
par(mfrow=c(2,2))
plot(data3$x, data3$y)
hist(data3.lm$residuals)
qqnorm(data3.lm$residuals)
qqline(data3.lm$residuals)
For data3, it seems that the given data follows some sort of linearity with some outliers producing leverage and the distribution seems to be normal but due to leverage it affects the outcome.
Plots for data3
par(mfrow=c(2,2))
plot(data4$x, data4$y)
hist(data4.lm$residuals)
qqnorm(data4.lm$residuals)
qqline(data4.lm$residuals)
In this case there’s an outlier point producing leverage; and also the residuals distribution does not seems to be normal.
Visualization plays very important role while anayzing data.Visualizations can help us identify outliers in model,help us build conclusions and prediction for our dataset.
For example,the plots from above models.Graphs helped to conclude the brief analysis of each dataset in question and to visualize the linear model for the plot.