Please put the answers for Part I next to the question number (2pts each):
daysDrive is both quantitative and discrete because it takes an integer value that can be interpreted (e.g., 2 days driven is twice as large as one day driven)a) and the quasi-experiment b) would give results that could be interpreted causally.a) the large \(\chi^2\) implies that there is a difference between the means of the groups, implying an association.moments <- c(26, 37, 45, 49.8, 65, 44.4, 8.4, 50)
lab <- c('min', 'Q1', 'median', 'Q3', 'max', 'mean', 'sd', 'n')
names(moments) <- lab
moments
## min Q1 median Q3 max mean sd n
## 26.0 37.0 45.0 49.8 65.0 44.4 8.4 50.0
highSD <- moments[['mean']] + moments[['sd']]* 3
lowSD <- moments[['mean']] - moments[['sd']]* 3
highSD
## [1] 69.6
lowSD
## [1] 19.2
i_q_r <- moments['Q3'] - moments['Q1']
iqrLower <- moments['Q1'] - 1.5 * i_q_r
iqrHigher <- moments['Q3'] + 1.5 * i_q_r
iqrHigher
## Q3
## 69
iqrLower
## Q1
## 17.8
The data is ballanced so both the SD and IQR methods give similar results. Generally the IQR way is superior because it can account for skewed data. Thus, the answer is b)
d) The median and interquartile range are resistant to outliers, whereas the mean and standard deviation are not.7a. Describe the two distributions (2pts). * Distribution A is skewed to the right. * Distribution B is normal. * Both are unimodal.
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts). The means are similar because the expected value \(E(x)\) of each \(n(30)\) sample is equal to the population mean. This is due to CLT.
The standard deviations are different because sampeling 30 instances will tend to focus around the mean, thus decreasing the SD of distribution B when compared to A.
\[ SE = \frac{\sigma}{\sqrt{n}} \\ SE = \frac{3.22}{\sqrt{30}} = 0.589 \] The SD of the sample mean is equal to the standard error of one sampe mean. Therefor, given \(30 > 1\) we find that the SD for the sample distribution is lower.
7c. What is the statistical principal that describes this phenomenon (2 pts)?
The statistical principal is the central limit theorem. It states that given a large enough sample size and finite variance, the sample mean will approximate the population mean.
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#dfNames <- c('data1', 'data2', 'data3', 'data4')
dfNames <- list(data1, data2, data3, data4)
dfFunc <- function(df, dfOpp){
df %>% summarise(xStat = dfOpp(x),
yStat = dfOpp(y))
}
loopFunc <- function(opper){
for (i in dfNames){
x = dfFunc(i, opper)
print(x)
}
}
loopFunc(mean)
## xStat yStat
## 1 9 7.5
## xStat yStat
## 1 9 7.5
## xStat yStat
## 1 9 7.5
## xStat yStat
## 1 9 7.5
loopFunc(median)
## xStat yStat
## 1 9 7.6
## xStat yStat
## 1 9 8.1
## xStat yStat
## 1 9 7.1
## xStat yStat
## 1 8 7
loopFunc(sd)
## xStat yStat
## 1 3.3 2
## xStat yStat
## 1 3.3 2
## xStat yStat
## 1 3.3 2
## xStat yStat
## 1 3.3 2
for (i in dfNames){
print(cor(i$x, i$y))
}
## [1] 0.82
## [1] 0.82
## [1] 0.82
## [1] 0.82
lm1 <- lm(y~x, data1)
lm2 <- lm(y~x, data2)
lm3 <- lm(y~x, data3)
lm4 <- lm(y~x, data4)
lmList <- list(lm1, lm2, lm3, lm4)
j <- 1
for (i in lmList){
print(paste('The RSquard for model',j , 'is:', round(summary(i)$r.squared, 2), sep = " "))
j = j+ 1
}
## [1] "The RSquard for model 1 is: 0.67"
## [1] "The RSquard for model 2 is: 0.67"
## [1] "The RSquard for model 3 is: 0.67"
## [1] "The RSquard for model 4 is: 0.67"
library(ggplot2)
for (i in dfNames){
plot <- ggplot(i, aes(x,y))+
geom_point()
print(plot)
}
for (i in lmList){
par(mfrow=c(1,2))
plot(i$residuals)
qqnorm(i$residuals)
qqline(i$residuals)
}
For data:
A linear model looks like it would be a good aproximation because there are no trends in the residuals.
a linear model is not appropriate because the data looks like it is a square root:
data1Test <- data1
data1Test$x <- sqrt(data1Test$x)
ggplot(data1Test, aes(x,y)) +geom_point()
This could then be used to make linear inference.
This is hard to see. The data looks very linear except for one outlier. If there was more data I would have no problem using linear models.
This data is not linear and should instead have a logistic regression applied to it given that it appears to be a classification problem.
Visualizations are very important when analyzing data because data that looks very different and behaves in weird ways can have very similar summary statistics.