Problem 1

Part a

Distribution A is centered around 5.0, with a strong right skewness and a range between roughly 0 and 20.

Distribution B is centered around 5.0, appears nearly normal, and has a range of roughly 3 to 6.5.

Part b

Ditribution B has roughly the same mean as distribution A because it is a distribution of the sample means \(\bar{x}\), and for sufficiently large samples (\(n \gtrsim 30\)), the sample mean \(\bar{x}\) approximates the population mean \(\mu\). The standard deviation of distribution B is smaller because the standard error for a sample mean scales inversely with the square root of the sample size: \(SE = \frac{\sigma}{\sqrt{n}}\).

Part c

The phenomena exhibited by the differences in the distributions in Part a and explained in Part b are described by the Central Limit Theorem.

Problem 2

Parts a-c

df1 <- data.frame(x1 = c(mean(data1$x), median(data1$x), sd(data1$x)),
                  y1 = c(mean(data1$y), median(data1$y), sd(data1$y)),
                  row.names = c("mean", "median", "sd"))

df2 <- data.frame(x2 = c(mean(data1$x), median(data1$x), sd(data1$x)),
                  y2 = c(mean(data1$y), median(data1$y), sd(data1$y)),
                  row.names = c("mean", "median", "sd"))

df3 <- data.frame(x3 = c(mean(data1$x), median(data1$x), sd(data1$x)),
                  y3 = c(mean(data1$y), median(data1$y), sd(data1$y)),
                  row.names = c("mean", "median", "sd"))

df4 <- data.frame(x4 = c(mean(data1$x), median(data1$x), sd(data1$x)),
                  y4 = c(mean(data1$y), median(data1$y), sd(data1$y)),
                  row.names = c("mean", "median", "sd"))

print(format(df1, nsmall = 2))
         x1   y1
mean   9.00 7.50
median 9.00 7.58
sd     3.32 2.03
print(format(df2, nsmall = 2))
         x2   y2
mean   9.00 7.50
median 9.00 7.58
sd     3.32 2.03
print(format(df3, nsmall = 2))
         x3   y3
mean   9.00 7.50
median 9.00 7.58
sd     3.32 2.03
print(format(df4, nsmall = 2))
         x4   y4
mean   9.00 7.50
median 9.00 7.58
sd     3.32 2.03

Part d

df_cor <- data.frame(data1 = cor(data1$x, data1$y),
                     data2 = cor(data2$x, data2$y),
                     data3 = cor(data3$x, data3$y),
                     data4 = cor(data4$x, data4$y))

print(format(df_cor, nsmall = 2))
  data1 data2 data3 data4
1  0.82  0.82  0.82  0.82

Parts e & f

fit_1 <- lm(y ~ x, data1)
summary(fit_1)

Call:
lm(formula = y ~ x, data = data1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9213 -0.4558 -0.0414  0.7094  1.8388 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    3.000      1.125    2.67   0.0257 * 
x              0.500      0.118    4.24   0.0022 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 9 degrees of freedom
Multiple R-squared:  0.667, Adjusted R-squared:  0.629 
F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217
fit_2 <- lm(y ~ x, data2)
summary(fit_2)

Call:
lm(formula = y ~ x, data = data2)

Residuals:
   Min     1Q Median     3Q    Max 
-1.901 -0.761  0.129  0.949  1.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    3.001      1.125    2.67   0.0258 * 
x              0.500      0.118    4.24   0.0022 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 9 degrees of freedom
Multiple R-squared:  0.666, Adjusted R-squared:  0.629 
F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
fit_3 <- lm(y ~ x, data3)
summary(fit_3)

Call:
lm(formula = y ~ x, data = data3)

Residuals:
   Min     1Q Median     3Q    Max 
-1.159 -0.615 -0.230  0.154  3.241 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    3.002      1.124    2.67   0.0256 * 
x              0.500      0.118    4.24   0.0022 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 9 degrees of freedom
Multiple R-squared:  0.666, Adjusted R-squared:  0.629 
F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218
fit_4 <- lm(y ~ x, data4)
summary(fit_4)

Call:
lm(formula = y ~ x, data = data4)

Residuals:
   Min     1Q Median     3Q    Max 
-1.751 -0.831  0.000  0.809  1.839 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    3.002      1.124    2.67   0.0256 * 
x              0.500      0.118    4.24   0.0022 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2 on 9 degrees of freedom
Multiple R-squared:  0.667, Adjusted R-squared:  0.63 
F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Pair 1

The equation is \(\hat{y} = 3.000 + 0.500 \times x\). The R-squared value is 0.667.

Pair 2

The equation is \(\hat{y} = 3.001 + 0.500 \times x\). The R-squared value is 0.666.

Pair 3

The equation is \(\hat{y} = 3.002 + 0.500 \times x\). The R-squared value is 0.666.

Pair 4

The equation is \(\hat{y} = 3.002 + 0.500 \times x\). The R-squared value is 0.667.

Conditions for Inference

Pair 1

There does not appear to be any pattern in the residuals in the scatterplot, so the condition of linearity can be accepted. The histogram does not indicate that the residuals are normally distributed, but there is not significant difference from normality. Finally, the scatterplot and Q-Q plot indicate that the residuals indicate near-constant variability. For these reasons, linear regression is reasonable.

Pair 2

There appears to be a pattern in the residuals in the scatterplot – the residuals star positive, go negative, and then return to positive – so the condition of linearity can not be accepted. The histogram does not indicate that the residuals are normally distributed. For these reasons, linear regression is likely not reasonable.

Pair 3

There appears to be one point that may be an extreme outlier in this dataset. It appears that the exlusion of this point might lead to the satisfaction of the conditions for linear regression – before proceeding with the use of the linear model, the cause of this outlier should be further investigated — if this investigation concludes that the exclusion of this point is reasonable, then linear regression is likely reasonable for the data set.

Pair 4

There does not appear to be any pattern in the residuals in the scatterplot, so the condition of linearity can be accepted. The histogram indicates that the residuals are nearly-normally distributed. Finally, the scatterplot and Q-Q plot indicate that the residuals indicate near-constant variability. For these reasons, linear regression is reasonable.

Data Visualization

Data visualizations are important to the analysis of data, as they can help to provide description of distributions, identify relationships between variables, and identify an abnormalities in the data. In addition to the histograms and density plots provided in Part a of the assignment, scatterplots can be useful, especially when overlaid with linear regression estimates. An example of each these plots is provided for each of the four datasets:

The final visualization in this set identifies one point that, while still fitting the conditions for linear regression as outlined in the previous section, lies well above the values of the other points. This was not identified in the investigation of conditions, and illustrates a benefit to creating visualizations — this point may warrant further investigation.