Please put the answers for Part I next to the question number (2pts each):
7a. Describe the two distributions (2pts).
Observations (A) is rightly skewed and the standard deviation of 3.22 shows that the variation on the data is very high. However the sampling distribution from Observations appear to be normally distributed with much lesser standard deviation.
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
The standard deviation is higher with Observations(A) due to the data points in the observations because they are spread widely however When the sample are taken, the data points are much closer in the samples and hence much lower standard deviation
7c. What is the statistical principal that describes this phenomenon (2 pts)?
The statistical principal of Central Limit Theorem is proven here - distribution of sample means will be nearly normal
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
executeFunction <- function(dataSet, funcName)
{
return(lapply(dataSet, funcName))
}
executeFunction(data1, mean)
## $x
## [1] 9
##
## $y
## [1] 7.5
executeFunction(data2, mean)
## $x
## [1] 9
##
## $y
## [1] 7.5
executeFunction(data3, mean)
## $x
## [1] 9
##
## $y
## [1] 7.5
executeFunction(data4, mean)
## $x
## [1] 9
##
## $y
## [1] 7.5
executeFunction(data1, median)
## $x
## [1] 9
##
## $y
## [1] 7.6
executeFunction(data2, median)
## $x
## [1] 9
##
## $y
## [1] 8.1
executeFunction(data3, median)
## $x
## [1] 9
##
## $y
## [1] 7.1
executeFunction(data4, median)
## $x
## [1] 8
##
## $y
## [1] 7
executeFunction(data1, sd)
## $x
## [1] 3.3
##
## $y
## [1] 2
executeFunction(data2, sd)
## $x
## [1] 3.3
##
## $y
## [1] 2
executeFunction(data3, sd)
## $x
## [1] 3.3
##
## $y
## [1] 2
executeFunction(data4, sd)
## $x
## [1] 3.3
##
## $y
## [1] 2
plot(data1)
cor(data1)
## x y
## x 1.00 0.82
## y 0.82 1.00
plot(data2)
cor(data2)
## x y
## x 1.00 0.82
## y 0.82 1.00
plot(data3)
cor(data3)
## x y
## x 1.00 0.82
## y 0.82 1.00
plot(data4)
cor(data4)
## x y
## x 1.00 0.82
## y 0.82 1.00
lm1 <- lm(y~x, data1)
summary(lm1)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
plot(lm1)
lm2 <- lm(y~x, data2)
summary(lm2)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
plot(lm2)
lm3 <- lm(y~x, data3)
summary(lm3)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
plot(lm3)
lm4 <- lm(y~x, data4)
summary(lm4)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
plot(lm4)
## Warning: not plotting observations with leverage one:
## 8
## Warning: not plotting observations with leverage one:
## 8
summary(lm1)$r.squared
## [1] 0.67
summary(lm2)$r.squared
## [1] 0.67
summary(lm3)$r.squared
## [1] 0.67
summary(lm4)$r.squared
## [1] 0.67
plot(lm1)
hist(lm1$residuals)
qqnorm(lm1$residuals)
qqline(lm1$residuals)
**For data1, the plot shows linear relationship since the data shows an upward linear trend. It is appropriate to estimate that the linear regression model is reliable for* data1*
plot(lm2)
hist(lm2$residuals)
qqnorm(lm2$residuals)
qqline(lm2$residuals)
For data2, a straight line could not hold the data. so it is NOT appropriate to estimate the linear regression model
plot(lm3)
hist(lm3$residuals)
qqnorm(lm3$residuals)
qqline(lm3$residuals)
For data3, the plot shows strong linear relationship however the residuals are non-normal. The outliers are far away from the line so it is appropriate to estimate that the linear regression model is reliable for data3
plot(lm4)
## Warning: not plotting observations with leverage one:
## 8
## Warning: not plotting observations with leverage one:
## 8
hist(lm4$residuals)
qqnorm(lm4$residuals)
qqline(lm4$residuals)
For data4, the plot shows NO linear relationship so linear regression model is NOT reliable for data4
It will be easier and quicker to interpret the facts with visualizations. We can understand the trend (linear vs non-linear) more easily with the visualizations