Please put the answers for Part I next to the question number:
7a. Describe the two distributions.
Distribution A is unimodal and right-skewed. Distribution B is unimodal, symmetrical and nearly normal.
7b. Explain why the means of these two distributions are similar but the standard deviations are not.
Means are similar because when the the mean from 500 random samples of size 30 from the Observations distribution was performed it followed a normal distribution since it satisfies the randomness and minimum number of samples. Standart deviations are not similar because the Observations distribution has observations farther from the mean than in the Sampling Distribution.
7c. What is the statistical principal that describes this phenomenon (2 pts)?
It’s described by Central Limit Theorem.
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
mean_x1
## [1] 9
mean_y1
## [1] 7.5
mean_x2
## [1] 9
mean_y2
## [1] 7.5
mean_x3
## [1] 9
mean_y3
## [1] 7.5
mean_x4
## [1] 9
mean_y4
## [1] 7.5
median_x1
## [1] 9
median_y1
## [1] 7.6
median_x2
## [1] 9
median_y2
## [1] 8.1
median_x3
## [1] 9
median_y3
## [1] 7.1
median_x4
## [1] 8
median_y4
## [1] 7
sd_x1
## [1] 3.3
sd_y1
## [1] 2
sd_x2
## [1] 3.3
sd_y2
## [1] 2
sd_x3
## [1] 3.3
sd_y3
## [1] 2
sd_x4
## [1] 3.3
sd_y4
## [1] 2
cor(data1$x, data1$y)
## [1] 0.82
cor(data2$x, data2$y)
## [1] 0.82
cor(data3$x, data3$y)
## [1] 0.82
cor(data4$x, data4$y)
## [1] 0.82
lm_1 <- lm(y ~ x, data=data1)
summary(lm_1)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
lm_2 <- lm(y ~ x, data=data2)
summary(lm_2)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
lm_3 <- lm(y ~ x, data=data3)
summary(lm_3)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
lm_4 <- lm(y ~ x, data=data4)
summary(lm_4)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
Linear regression equations is the same for all data sets: y=3+0.5x
Data1 R-Squared = 0.667
Data2 R-Squared= 0.666
Data3 R-Squared = 0.666
Data4 R-Squared = 0.667
Data1
Conditions:
1-Linearity CHECK
2-Nearly normal residuals X
3-Constant variability CHECK
4-Independent observations UNKNOWN
Linear regression model is not appropriate.
par(mfrow=c(2,2))
plot(data1)
hist(lm_1$residuals)
qqnorm(lm_1$residuals)
qqline(lm_1$residuals)
Data2
Conditions:
1-Linearity X
2-Nearly normal residuals X
3-Constant variability X
4-Independent observations UNKNOWN
Linear regression model is not appropriate.
par(mfrow=c(2,2))
plot(data2)
hist(lm_2$residuals)
qqnorm(lm_2$residuals)
qqline(lm_2$residuals)
Data3
Conditions:
1-Linearity CHECK
2-Nearly normal residuals CHECK
3-Constant variability X
4-Independent observations UNKNOWN
Linear regression model is not appropriate.
par(mfrow=c(2,2))
plot(data3)
hist(lm_3$residuals)
qqnorm(lm_3$residuals)
qqline(lm_3$residuals)
Data4
Conditions:
1-Linearity X (very extreme outlier)
2-Nearly normal residuals X
3-Constant variability X
4-Independent observations UNKNOWN
Linear regression model is not appropriate.
par(mfrow=c(2,2))
plot(data4)
hist(lm_4$residuals)
qqnorm(lm_4$residuals)
qqline(lm_4$residuals)
It is critical to visualize the data and check all conditions when creating a model. These data sets have very similar means, standard deviations, R-squared and linear regression equations, however using visualization methods we can conclude that it’s completely inappropriate for some data sets to use linear regression equation.