Please put the answers for Part I next to the question number (2pts each):
7a. Describe the two distributions (2pts).
Ans Both figures A and B appear to have normal distribution. The spread of the sampling distribution in figure B. is much less than the figure A. spread of the distribution.
In figure A, the a distribution is fairly moderate right sided skew and lower kurtosis. In figure B, the distribution is normal but with a high kurtosis.
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
Ans The figure A is a distribution of an observed variable, whereas Figure B is a distribution of the mean from 500 random samples of size 30 from A which is spread of distribution.
7c. What is the statistical principal that describes this phenomenon (2 pts)?
Ans central Limit Theorem
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
round(mean(data1$x),2)
## [1] 9
round(mean(data1$y),2)
## [1] 7.5
round(mean(data2$x),2)
## [1] 9
round(mean(data2$y),2)
## [1] 7.5
round(mean(data3$x),2)
## [1] 9
round(mean(data3$y),2)
## [1] 7.5
round(mean(data4$x),2)
## [1] 9
round(mean(data4$y),2)
## [1] 7.5
round(median(data1$x),2)
## [1] 9
round(median(data1$y),2)
## [1] 7.6
round(median(data2$x),2)
## [1] 9
round(median(data2$y),2)
## [1] 8.1
round(median(data3$x),2)
## [1] 9
round(median(data3$y),2)
## [1] 7.1
round(median(data4$x),2)
## [1] 8
round(median(data4$y),2)
## [1] 7
round(sd(data1$x),2)
## [1] 3.3
round(sd(data1$y),2)
## [1] 2
round(sd(data2$x),2)
## [1] 3.3
round(sd(data2$y),2)
## [1] 2
round(sd(data3$x),2)
## [1] 3.3
round(sd(data3$y),2)
## [1] 2
round(sd(data4$x),2)
## [1] 3.3
round(sd(data4$y),2)
## [1] 2
round(cor(data1),2)
## x y
## x 1.00 0.82
## y 0.82 1.00
round(cor(data2),2)
## x y
## x 1.00 0.82
## y 0.82 1.00
round(cor(data3),2)
## x y
## x 1.00 0.82
## y 0.82 1.00
round(cor(data4),2)
## x y
## x 1.00 0.82
## y 0.82 1.00
lm1 <- lm(y ~ x, data = data1)
summary(lm1)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
lm2 <- lm(y ~ x, data = data2)
summary(lm2)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
lm3 <- lm(y ~ x, data = data3)
summary(lm3)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
lm4 <- lm(y ~ x, data = data4)
summary(lm4)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
summary(lm1)$r.squared
## [1] 0.67
summary(lm2)$r.squared
## [1] 0.67
summary(lm3)$r.squared
## [1] 0.67
summary(lm4)$r.squared
## [1] 0.67
Conditions for a pair to be appropriate for a linear regression model:
par(mfrow=c(2,2))
plot(data1)
hist(lm1$residuals)
qqnorm(lm1$residuals)
qqline(lm1$residuals)
Data 1 does NOT appropriate for linear model regression due to violation of criteria for nearly normal residuals
par(mfrow=c(2,2))
plot(data2)
hist(lm2$residuals)
qqnorm(lm2$residuals)
qqline(lm2$residuals)
Data 2 is NOT appropriate for the linear regression model as it violates criterias for linearity, nearly normal residuals and constant variability
par(mfrow=c(2,2))
plot(data3)
hist(lm3$residuals)
qqnorm(lm3$residuals)
qqline(lm3$residuals)
Data 2 is NOT appropriate for the linear regression model as it violates criterias for nearly normal residuals and constant variability
par(mfrow=c(2,2))
plot(data4)
hist(lm4$residuals)
qqnorm(lm4$residuals)
qqline(lm4$residuals)
Data 2 is NOT appropriate for the linear regression model as it violates criterias for linearity and constant variability
Data Visualization is an important part and step when analyzing data as it provides more insights into the underlying data by exposing the patterns associated with it. Statiscal analysis such as finding mean, median, sd etc are important and when we visualize or plot the data it confirms that. It also reveals other statistical characteristics. Below is an example for data1.
plot(lm1)