A is positively skewed, with a center around 5. It is not widely distributed; i.e. the spread is small.
B is normally distributed, but is very widely distributed. The center is still around 5.
As sample size decreases, the spread of the distribution increases. This is why the standard deviations are different, but the means are the same. The larger the sample, the closer we will get to approaching the population mean, and the smaller the spread will be. Therefore, the smaller the standard deviation will be.
The statistical theory that describes this phenomenon is the central limit theorem, or CLT. The CLT states that given a sufficiently large sample size, the mean of the sample will approximately equal that of the population.
#Part II
#load the data
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
#mean and median for 1
summary(data1$x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
summary(data1$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.3 6.3 7.6 7.5 8.6 10.8
#standard deviation for 1
sd(data1$x)
## [1] 3.3
sd(data1$y)
## [1] 2
#mean and median for 2
summary(data2$x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
summary(data2$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.1 6.7 8.1 7.5 8.9 9.3
#standard deviation for 2
sd(data2$x)
## [1] 3.3
sd(data2$y)
## [1] 2
#mean and median for 3
summary(data3$x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
summary(data3$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.4 6.2 7.1 7.5 8.0 12.7
#standard deviation for 3
sd(data3$x)
## [1] 3.3
sd(data3$y)
## [1] 2
#mean and median for 4
summary(data4$x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8 8 8 9 8 19
summary(data4$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.2 6.2 7.0 7.5 8.2 12.5
#standard deviation for 4
sd(data4$x)
## [1] 3.3
sd(data4$y)
## [1] 2
#correlation for 1
cor(data1$x, data1$y)
## [1] 0.82
#correlation for 2
cor(data2$x, data2$y)
## [1] 0.82
#correlation for 3
cor(data3$x, data3$y)
## [1] 0.82
#correlation for 4
cor(data4$x, data4$y)
## [1] 0.82
#linear regression equation for 1
data1c <- lm(x ~ y, data = data1)
summary(data1c)
##
## Call:
## lm(formula = x ~ y, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.652 -1.512 -0.266 1.234 3.895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.998 2.434 -0.41 0.6916
## y 1.333 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
#linear regression equation for 2
data2c <- lm(x ~ y, data = data2)
summary(data2c)
##
## Call:
## lm(formula = x ~ y, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.852 -1.432 -0.344 0.847 4.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.995 2.435 -0.41 0.6925
## y 1.332 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
#linear regression equation for 3
data3c <- lm(x ~ y, data = data3)
summary(data3c)
##
## Call:
## lm(formula = x ~ y, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.987 -1.373 -0.027 1.320 3.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.000 2.436 -0.41 0.6910
## y 1.333 0.315 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
#linear regression equation for 4
data4c <- lm(x ~ y, data = data4)
summary(data4c)
##
## Call:
## lm(formula = x ~ y, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.786 -1.412 -0.185 1.455 3.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004 2.435 -0.41 0.6898
## y 1.334 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
\[ \mu_x = 9.0 \]
\[ \mu_y = 7.5 \]
\[ M_x = 9.0 \]
\[ M_y = 7.6 \]
\[ \sigma_x = 3.3 \]
\[ \sigma_y = 2.0 \]
\[ 0.82 \]
\[ \hat{y}(1) = -0.998 + 1.333 * y \]
\[ R^2 = 0.667 \]
\[ \mu_x = 9.0 \]
\[ \mu_y = 7.5 \]
\[ M_x = 9.0 \]
\[ M_y = 8.1 \]
\[ \sigma_x = 3.3 \]
\[ \sigma_y = 2.0 \]
\[ 0.82 \]
\[ \hat{y}(2) = -0.995 + 1.332 * y \]
\[ R^2 = 0.666 \]
\[ \mu_x = 9.0 \]
\[ \mu_y = 7.5 \]
\[ M_x = 9.0 \]
\[ M_y = 7.1 \]
\[ \sigma_x = 3.3 \]
\[ \sigma_y = 2.0 \]
\[ 0.82 \]
\[ \hat{y}(3) = -1.000 + 1.333 * y \]
\[ R^2 = 0.666 \]
\[ \mu_x = 9.0 \]
\[ \mu_y = 7.5 \]
\[ M_x = 8.0 \]
\[ M_y = 7.0 \]
\[ \sigma_x = 3.3 \]
\[ \sigma_y = 2.0 \]
\[ 0.82 \]
\[ \hat{y}(4) = -1.004 + 1.334 * y \]
\[ R^2 = 0.667 \]
plot(data1$x ~ data1$y, main = "Data 1")
abline(data1c)
plot(data2$x ~ data2$y, main = "Data 2")
abline(data2c)
plot(data3$x ~ data3$y, main = "Data 3")
abline(data3c)
plot(data4$x ~ data4$y, main = "Data 4")
abline(data4c)
It is appropriate to estimate a linear regression model for this data set. It follows a linear pattern when looking at the visualization.
It is not appropriate to estimate a linear regression model for this data set. It has a parabolic shape rather than linear.
It is not appropriate to estimate a linear regression model for this data set. It does not follow a linear pattern when looking at the visualization. The outlier pulls it in the opposite direction.
It is not appropriate to estimate a linear regression model for this data set. It does not follow a linear pattern when looking at the visualization. The outlier pulls it in the opposite direction.
It’s important to include appropriate visualizations when analyzing data, because they can show you relationships that you might not otherwise notice. By using the summary statistics alone, I never would have known that Data 1, 2, 3, and 4 were so drastically different. Using the visualizations, it’s much clearer to me how different they are. Visualizations help the audience understand the data in a more efficient and effective way.