Please put the answers for Part I next to the question number (2pts each):
IQR = 49.8-37
37-(1.5*IQR)
## [1] 17.8
49.8+(1.5*IQR)
## [1] 69
7a. Describe the two distributions (2pts). Both have one peak so, they are a unimodel. A is skewed to the right while the sample distribution looks normal fit.
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
Since A is skewed to the right, the data is more spread out. For B, the data is closely distributed.
7c. What is the statistical principal that describes this phenomenon (2 pts)? Central limit theorem
Consider the four data sets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
options(digits = 3)
mean(data1$x)
## [1] 9
mean(data1$y)
## [1] 7.5
mean(data2$x)
## [1] 9
mean(data2$y)
## [1] 7.5
mean(data3$x)
## [1] 9
mean(data3$y)
## [1] 7.5
mean(data4$x)
## [1] 9
mean(data4$y)
## [1] 7.5
median(data1$x)
## [1] 9
median(data1$y)
## [1] 7.58
median(data2$x)
## [1] 9
median(data2$y)
## [1] 8.14
median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.11
median(data4$x)
## [1] 8
median(data4$y)
## [1] 7.04
sd(data1$x)
## [1] 3.32
sd(data1$y)
## [1] 2.03
sd(data2$x)
## [1] 3.32
sd(data2$y)
## [1] 2.03
sd(data3$x)
## [1] 3.32
sd(data3$y)
## [1] 2.03
sd(data4$x)
## [1] 3.32
sd(data4$y)
## [1] 2.03
cor(data1$x,data1$y)
## [1] 0.816
cor(data2$y, data2$x)
## [1] 0.816
cor(data3$y, data3$x)
## [1] 0.816
cor(data4$y, data4$x)
## [1] 0.817
data_1 <- lm(y~x, data1)
data_2 <- lm(y~x, data2)
data_3 <- lm(y~x, data3)
data_4 <- lm(y~x, data4)
summary(data_1)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
summary(data_2)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(data_3)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
summary(data_4)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
par(mfrow = c(2,2))
plot(x = data1$x, y = data1$y)
hist(data_1$residuals)
qqnorm(data_1$residuals)
qqline(data_1$residuals)
plot(data_1$residuals ~ data1$x)
abline(h = 0)
There is a linear upward trend. The normal residual plot looks good and the histogram of the residuals for the most part are centered around 0. The Residual vs Fitted looks random. Yes, this data passes the conditions for Linear Regression.
par(mfrow = c(2,2))
plot(x = data2$x, y = data2$y)
hist(data_2$residuals)
qqnorm(data_2$residuals)
qqline(data_2$residuals)
plot(data_2$residuals ~ data2$x)
abline(h = 0)
Data 2 does not pass the linear regression conditions as it is curved and the Residuals vs Fitted model does not look random.
par(mfrow = c(2,2))
plot(x = data3$x, y = data3$y)
hist(data_3$residuals)
qqnorm(data_3$residuals)
qqline(data_3$residuals)
plot(data_3$residuals ~ data3$x)
abline(h = 0)
Data3 looks linear except for an outlier, which is giving it a right skew. without the outlier, the data might pass the conditions for linear regression, but the Risiduals vs Fitted graph does not look random.
par(mfrow = c(2,2))
plot(x = data4$x, y = data4$y)
hist(data_4$residuals)
qqnorm(data_4$residuals)
qqline(data_4$residuals)
plot(data_4$residuals ~ data4$x)
abline(h = 0)
Data 4, the x is constant except for 1 point, which gives it a strange plot. The residuals look good but it doesn’t look random.
Even though datas’ 1, 2, 3 and 4 had similar mean, median, sd, and correlations, visualizing it helped determine if the data is fit by visualizing to see if the model passes the linear regression conditions. You are not fully able to see the stories the data tell you, without visualizing it.