Final Exam

1 a. In Figure A, the distribution is skewed right. In Figure B, the distribution is almost normal or bell-shaped.

1 b. Figure A is the distribution of an observed variable. Figure B is actually the sampling distribution of the mean. Whenever more samples of the mean are taken, the distribution becomes close to normal. Sampling does tend to provide accurate estimates because all of the points closely fall around a straight line.

1 c. This phenomenon is best described by the Central Limit Theorem which states that if a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model.

2 a.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

options(digits = 2)
mean(data1$x)

## [1] 9

mean(data2$x)

## [1] 9

mean(data3$x)

## [1] 9

mean(data4$x)

## [1] 9

options(digits = 2)
mean(data1$y)

## [1] 7.5

mean(data2$y)

## [1] 7.5

mean(data3$y)

## [1] 7.5

mean(data4$y)

## [1] 7.5

2 b.

options(digits = 2)
median(data1$x)

## [1] 9

median(data2$x)

## [1] 9

median(data3$x)

## [1] 9

median(data4$x)

## [1] 8

options(digits=3)
median(data1$y)

## [1] 7.58

median(data2$y)

## [1] 8.14

median(data3$y)

## [1] 7.11

median(data4$y)

## [1] 7.04

2 c.

options(digits=3)
sd(data1$x)

## [1] 3.32

sd(data2$x)

## [1] 3.32

sd(data3$x)

## [1] 3.32

sd(data4$x)

## [1] 3.32

options(digits=3)
sd(data1$y)

## [1] 2.03

sd(data2$y)

## [1] 2.03

sd(data3$y)

## [1] 2.03

sd(data4$y)

## [1] 2.03

2 d.

options(digits=2)
cor(data1)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

cor(data2)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

cor(data3)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

cor(data4)

##      x    y
## x 1.00 0.82
## y 0.82 1.00

2 e.

options(digits = 2)
my1 = lm(data1)
my2 = lm(data2)
my3 = lm(data3)
my4 = lm(data4)

For data 1 we get y = 1.33x - 1.00.

For data 2 we get y = 1.33x - 1.00.

For data 3 we get y = 1.33x - 1.00.

For data 4 we get y = 1.33x - 1.00.

2 f.

summary(my1)$r.squared

## [1] 0.67

summary(my2)$r.squared

## [1] 0.67

summary(my3)$r.squared

## [1] 0.67

summary(my4)$r.squared

## [1] 0.67

plot(data1)
abline(my1)
lines(lowess(data1$y ~ data1$x))

There is no linear trend shown in this data. Most of the data does not lie near the line of best fit.

hist(my1$residuals)

qqnorm(my1$residuals)
qqline(my1$residuals)

The residuals do not appear to be nearly normal because the residuals histogram is skewed right. In the normal probability plot of the residuals, there are two outliers which are relatively distant from the rest of the data as well as from the line.

plot(my1$residuals ~ data1$x, xlab = "x", ylab = "Residuals")
abline(h = 0, lty = 3)
grid()

According to this plot, the variability is non-constant. Generally with larger values of x, there is more variability.

Therefore it is not appropriate to estimate a linear regression model for this pair.

plot(data2)
abline(my2)
lines(lowess(data2$y ~ data2$x))

There is no linear trend shown in this data. The data points constitute the Lowess curve of best fit.

hist(my2$residuals)

qqnorm(my2$residuals)
qqline(my2$residuals)

The residuals do not appear to be nearly normal because the residuals histogram is skewed right. In the normal probability plot of the residuals, there are at least three outliers, one of which is the most distant from the line of best fit as well as the rest of the data.

plot(my2$residuals ~ data2$x, xlab = "x", ylab = "Residuals")
abline(h = 0, lty = 3)
grid()

According to this plot, the variability is non-constant because there is a non-random pattern shown. All of the points constitute a curve.

Therefore it is not appropriate to estimate a linear regression model for this pair.

plot(data3)
abline(my3)
lines(lowess(data3$y ~ data3$x))

There is some linear trend shown in this data although the line of best fit does not pass through all the data points.

hist(my3$residuals)

qqnorm(my3$residuals)
qqline(my3$residuals)

The residuals appear to be nearly normal here because there is no skew in the histogram. In the normal probability plot of the residuals, almost all of the points are either in proximity of the line of best fit or lie on the line of best fit.

plot(my3$residuals ~ data3$x, xlab = "x", ylab = "Residuals")
abline(h = 0, lty = 3)
grid()

This plot also shows non-constant variance because the residual starts negative and linearly increases as x increases.

Although variance is non-constant, we can still be lenient with it and estimate a linear regression model for this pair because there is some linear trend shown in the normal probability plot and in the scatter plot.

plot(data4)
abline(my4)
lines(lowess(data4$y ~ data4$x))

In this data, there is no functional relation shown. All but one of the x-values are 8. Therefore there is no independence in the observations. We do not have to check for any of the other conditions as this dataset is already not appropriate for linear regression analysis.

The reason why it is important to include appropriate visualizations when analyzing data is so that we can get a better picture of how the data is distributed and understand information in a quicker and easier way. We would also be able to spot any recurring trends.

Final Exam

Yadu

December 14, 2015