Please put the answers for Part I next to the question number (2pts each):
7a. Describe the two distributions (2pts). Distribution A is Right skewed and unimodal Ditribution B is normal, symmetrical, and unimodal
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts). Due to independent observations the distribution of the sample mean is approximated by a normal distribution. The Standard deviation or variability can differ (one chart looking at observed variables and the other looking at sampling distribution)
7c. What is the statistical principal that describes this phenomenon (2 pts)? Central Limit Theorem
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
Means
meand1x <- round(mean(data1$x), 2)
meand1y <- round(mean(data1$y), 2)
meand2x <- round(mean(data2$x), 2)
meand2y <- round(mean(data2$y), 2)
meand3x <- round(mean(data3$x), 2)
meand3y <- round(mean(data3$y), 2)
meand4x <- round(mean(data4$x), 2)
meand4y <- round(mean(data4$y), 2)
Medians
medd1x <- round(median(data1$x), 2)
medd1y <- round(median(data1$y), 2)
medd2x <- round(median(data2$x), 2)
medd2y <- round(median(data2$y), 2)
medd3x <- round(median(data3$x), 2)
medd3y <- round(median(data3$y), 2)
medd4x <- round(median(data4$x), 2)
medd4y <- round(median(data4$y), 2)
sdd1x <- round(sd(data1$x), 2)
sdd1y <- round(sd(data1$y), 2)
sdd2x <- round(sd(data2$x), 2)
sdd2y <- round(sd(data2$y), 2)
sdd3x <- round(sd(data3$x), 2)
sdd3y <- round(sd(data3$y), 2)
sdd4x <- round(sd(data4$x), 2)
sdd4y <- round(sd(data4$y), 2)
cord1 <- cor(data1$x, data1$y)
cord2 <- cor(data2$x, data2$y)
cord3 <- cor(data3$x, data3$y)
cord4 <- cor(data4$x, data4$y)
data.frame(data = c("Data1", "Data2", "Data3", "Data4"), Meanx=c(meand1x, meand2x, meand3x, meand4x), Meany=c(meand1y, meand2y, meand3y, meand4y), Medianx=c(medd1x,medd2x,medd3x, medd4x), Mediany=c(medd1y, medd2y, medd3y, medd4y), SDx=c(sdd1x, sdd2x, sdd3x, sdd4x), SDy=c(sdd1y, sdd2y, sdd3y, sdd4y), SDy=c(sdd1y, sdd2y, sdd3y, sdd4y), Correlation=c(cord1, cord2, cord3, cord4))
## data Meanx Meany Medianx Mediany SDx SDy SDy.1 Correlation
## 1 Data1 9 7.5 9 7.6 3.3 2 2 0.82
## 2 Data2 9 7.5 9 8.1 3.3 2 2 0.82
## 3 Data3 9 7.5 9 7.1 3.3 2 2 0.82
## 4 Data4 9 7.5 8 7.0 3.3 2 2 0.82
Rd1 <- lm(data1$y~data1$x)
Rd2 <- lm(data2$y~data2$x)
Rd3 <- lm(data3$y~data3$x)
Rd4 <- lm(data4$y~data4$x)
Rd1
##
## Call:
## lm(formula = data1$y ~ data1$x)
##
## Coefficients:
## (Intercept) data1$x
## 3.0 0.5
Rd2
##
## Call:
## lm(formula = data2$y ~ data2$x)
##
## Coefficients:
## (Intercept) data2$x
## 3.0 0.5
Rd3
##
## Call:
## lm(formula = data3$y ~ data3$x)
##
## Coefficients:
## (Intercept) data3$x
## 3.0 0.5
Rd4
##
## Call:
## lm(formula = data4$y ~ data4$x)
##
## Coefficients:
## (Intercept) data4$x
## 3.0 0.5
The linear regression equation for all data sets: y = 3.00 + 0.50x
summary(Rd1)
##
## Call:
## lm(formula = data1$y ~ data1$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## data1$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
summary(Rd2)
##
## Call:
## lm(formula = data2$y ~ data2$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## data2$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(Rd3)
##
## Call:
## lm(formula = data3$y ~ data3$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## data3$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(Rd4)
##
## Call:
## lm(formula = data4$y ~ data4$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## data4$x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
The R-squared for all data sets is 0.67
Requirements for calculating linear regression:
1.Linearity
2.Nearly Normal Residuals
3.Constant Variability
Linearity and Constant Variability with Scatter Plot
plot(data1$x, data1$y, main = "Data1")
abline(Rd1)
plot(data2$x, data2$y, main = "Data2")
abline(Rd2)
plot(data3$x, data3$y, main = "Data3")
abline(Rd3)
plot(data4$x, data4$y, main = "Data4")
abline(Rd4)
Data set 1 is linear and has constant variability
Data set 2 is not linear
Data set 3 has an outlier
Data set 4 does not have variability
Linearity:
plot(data1$x, Rd1$residuals, main = "Data1")
abline(h=0,lty=2)
plot(data2$x, Rd2$residuals, main = "Data2")
abline(h=0,lty=2)
plot(data3$x, Rd3$residuals, main = "Data3")
abline(h=0,lty=2)
plot(data4$x, Rd4$residuals, main = "Data4")
abline(h=0,lty=2)
The above plots show that only Data set 1 has nearly normal residuals.
Nearly normal Residuals:
qqnorm(Rd1$residuals)
qqline(Rd1$residuals)
qqnorm(Rd2$residuals)
qqline(Rd2$residuals)
qqnorm(Rd3$residuals)
qqline(Rd3$residuals)
qqnorm(Rd4$residuals)
qqline(Rd4$residuals)
All Data sets have nearly normal residuals
It is only appropriate to estimate a linear regression model because it meets all the conditions needed (Linearity, Nearly Normal Residuals, Constant Variability) shown by the scatterplot, the Normal Q-Q Plot, and the residual plot.
It is important to include the visualizations when analyzing data to see if there are any trends and also to ensure that all neccessary conditions are met. Many conditions are not noted from simply the standard deviation, mean, or median (which were the same for each data set). We were only able to see that the Linear Regression was not applicable for all data sets through the visualizations.