IS606 Final

Part I Describe the two distributions (2 pts).

Distribution A is the distribution of a randomly sampled set of observations from a survey of the 
larger population. 

It has the following characteristics:

  -  rightward skew
  -  unimodal
  -  truncated lowerbound at 0
  -  outliers greater than 6.44 (2 std) from mean  

Distribution B is a sampling distribution.  The distribution of the sampling distribution shows the 
variability of the sample means when many samples are randomly selected from the population.  Because 
the sample mean is an unbiased estimator--its randomly generated from independent samples, the sampling 
distribution is centered at the true mean of the population, and the spread of the distribution indicates 
how much variability.  

It has the following characteristics:

  - near normal (slight leftward skew)
  - centered at mean

Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

Why the means are similar:
Distribution A as a random sample from the population, approximates the mean based on the sample size of 500
observation from the population.  Distribution B as a sampling distribution from the population, approximates
the mean because it ramdomly samples 500 observations from the population multiple times and therefore its 
mean converges around the same population mean.

Why the standard deviations are not:
Distribution A is a single sample of the variablility of observations and a single statistic is more likely
to be skewed from the population mean than many randomly generated independent statistics.  Distribution B
represents the latter and is approximates a near normal distribution around the population mean even if 
Distribution A indicates the population is skewed.

What is the statistical principal that describes this phenomenon (2 pts)?

The Central Limit Theorem

Per informal description from "Open Intro Statistics" . . .
If a sample consists of at least 30 independent observations and the data are not strongly skewed, 
then the distribution of the sample mean is well approximated by a normal model.

This is the case with the sampling distribution B.

Part II Consider the four datasets, each with two columns (x and y), provided below.

#a) b) & c) - mean, median and std dev for combined xy columns "data0"
sprintf("%.2f", meta[1,]$meanx)

## [1] "9.00"

sprintf("%.2f", meta[1,]$meany)

## [1] "7.50"

sprintf("%.2f", meta[1,]$medianx)

## [1] "8.00"

sprintf("%.2f", meta[1,]$mediany)

## [1] "7.52"

sprintf("%.2f", meta[1,]$sdx)

## [1] "3.20"

sprintf("%.2f", meta[1,]$sdy)

## [1] "1.96"

#d) correlation for each x,y pair
cor(data1$x,data1$y)

## [1] 0.82

cor(data2$x,data2$y)

## [1] 0.82

cor(data3$x,data3$y)

## [1] 0.82

cor(data4$x,data4$y)

## [1] 0.82

#e) & f) linear regression equation and Rsquaredfor each x,y pair
xy1 <- lm(x ~ y, data = data1)
summary(xy1)

## 
## Call:
## lm(formula = x ~ y, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.652 -1.512 -0.266  1.234  3.895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.998      2.434   -0.41   0.6916   
## y              1.333      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

Regression Equation for xy1 \[ \hat{y} = -0.998 + 1.333 * x \] R-squared: 0.667

xy2 <- lm(x ~ y, data = data2)
summary(xy2)

## 
## Call:
## lm(formula = x ~ y, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.852 -1.432 -0.344  0.847  4.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.995      2.435   -0.41   0.6925   
## y              1.332      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

Regression Equation for xy2 \[ \hat{y} = -0.995 + 1.332 * x \] R-squared: 0.666

xy3 <- lm(x ~ y, data = data3)
summary(xy3)

## 
## Call:
## lm(formula = x ~ y, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.987 -1.373 -0.027  1.320  3.213 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.000      2.436   -0.41   0.6910   
## y              1.333      0.315    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

Regression Equation for xy3 \[ \hat{y} = -1.000 + 1.333 * x \] R-squared: 0.666

xy4 <- lm(x ~ y, data = data4)
summary(xy4)

## 
## Call:
## lm(formula = x ~ y, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.786 -1.412 -0.185  1.455  3.333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.004      2.435   -0.41   0.6898   
## y              1.334      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Regression Equation for xy4 \[ \hat{y} = -1.004 + 1.334 * x \] R-squared: 0.667

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

It is appropriate to estimate a linear regression model only for the first set of xy data1. This is because sets 2 and 3 fail the linearity test (2 has a curved relationship and 3 has an influential outlier). Also, sets 2 and 3 don’t have constant variability of their residuals, the pattern of variability is not uniform in either set 2 or 3 (2 is curved and 3 increases). Finally, set 4 appears to be a dichotomous variable that doesn’t follow the regression line at all. It fails the linearity test, the near normal residual test and the constant variablity test.

plot(data1$x ~ data1$y)
abline(xy1)

plot(data2$x ~ data2$y)
abline(xy2)

plot(data3$x ~ data3$y)
abline(xy3)

plot(data4$x ~ data4$y)
abline(xy4)

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

The regression model for all 4 pairs show similar summary statistics in nearly all respects except for the shape of the residuals in their regression line. Without visualizing this, sets 2,3 & 4 might be considered as good fits for linear regression when they are not for reasons outlined in the prior question.

More generally, visualization helps identify significant outliers, fidelity to the linear model of the plot and wheter the residuals have constant variablity from the regression line.