data car color is presented below. Which of the variables are quantitative and discrete? car 1 = compact, 2 = standard size, 3 = mini van, 4 = SUV, and 5 = truck color red, blue, green, black, white daysDrive number of days per week the student drives gasMonth the amount of money the student spends on gas per month
estimates of the mean and median are most plausible?
((2.5/100) *132*1.9 + (2.5/100) *132*2.1 + (5/100) *132*2.5 + (5/100) *132*2.7 + (14/100) *132*2.9 + (5/100) *132*3.1 + (13/100) *132*3.3 + (13/100) *132*3.5 + (21/100) *132*3.7 + (20/100) *132*3.9)/132
## [1] 3.362
red) and eye color (blue, green, brown). If a large \({ x }^{ 2 }\) test statistic is obtained, this suggests that:
a standard memory task. The researcher wants to produce a boxplot to examine this distribution. Below are summary statistics from the memory task. What values should the researcher use to determine if a particular score is a potential outlier in the boxplot?
#Q1 – 1.5×IQR or above Q3 + 1.5×IQR
37-1.5*(49.8-37)
## [1] 17.8
49.8+1.5*(49.8-37)
## [1] 69
Figure A below represents the distribution of an observed variable. Figure B below represents the distribution of the mean from 500 random samples of size 30 from A. The mean of A is 5.05 and the mean of B is 5.04. The standard deviations of A and B are 3.22 and 0.58, respectively.
A is that the population distribution is unimodal and right skewed, the mean is to the right and concentrated on the lower end of the distribution.
B is that the sample distribution is approximated symmetric, the sample size is 30.
pts).
The B represent the distribution of the mean from 500 random samples of size 30 from A, because the sample size is 30 independent observations and data are not strongly skewed, so the means distribution are similar. The standard deviation is standard error for the mean estimated from the data.
sd <- 3.22
n <-30
SE <- sd / sqrt(n)
SE
## [1] 0.5878889
Standare error, SE, is 0.5879
This is the principal of Central Limit Theorem, If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model.
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
#mean of data1
mean(data1$x)
## [1] 9
mean(data1$y)
## [1] 7.5
#mean of data2
mean(data2$x)
## [1] 9
mean(data2$y)
## [1] 7.5
#mean of data3
mean(data3$x)
## [1] 9
mean(data3$y)
## [1] 7.5
#mean of data4
mean(data4$x)
## [1] 9
mean(data4$y)
## [1] 7.5
#median of data1
median(data1$x)
## [1] 9
median(data1$y)
## [1] 7.6
#median of data2
median(data2$x)
## [1] 9
median(data2$y)
## [1] 8.1
#median of data3
median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.1
#median of data4
median(data4$x)
## [1] 8
median (data4$y)
## [1] 7
#Standard deviation of data1
sd(data1$x)
## [1] 3.3
sd(data1$y)
## [1] 2
#Standard deviation of data2
sd(data2$x)
## [1] 3.3
sd(data2$y)
## [1] 2
#Standard deviation of data3
sd(data3$x)
## [1] 3.3
sd(data3$y)
## [1] 2
#Standard deviation of data4
sd(data4$x)
## [1] 3.3
sd(data4$y)
## [1] 2
For each x and y pair, calculate (also to two decimal places; 1 pt):
cor(data1)
## x y
## x 1.00 0.82
## y 0.82 1.00
cor(data2)
## x y
## x 1.00 0.82
## y 0.82 1.00
cor(data3)
## x y
## x 1.00 0.82
## y 0.82 1.00
cor(data4)
## x y
## x 1.00 0.82
## y 0.82 1.00
data1_regression <- lm(x ~ y, data = data1)
summary(data1_regression)
##
## Call:
## lm(formula = x ~ y, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.652 -1.512 -0.266 1.234 3.895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.998 2.434 -0.41 0.6916
## y 1.333 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
data1 equation: y = -0.9975 + 1.3328 * x
data2_regression <- lm(x ~ y, data = data2)
summary(data2_regression)
##
## Call:
## lm(formula = x ~ y, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.852 -1.432 -0.344 0.847 4.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.995 2.435 -0.41 0.6925
## y 1.332 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
data2 equation: y = -0.9948 + 1.3325 * x
data3_regression <- lm(x ~ y, data = data3)
summary(data3_regression)
##
## Call:
## lm(formula = x ~ y, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.987 -1.373 -0.027 1.320 3.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.000 2.436 -0.41 0.6910
## y 1.333 0.315 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
data3 equation: y = -1.0003 + 1.3334 * x
data4_regression <- lm(x ~ y, data = data4)
summary(data4_regression)
##
## Call:
## lm(formula = x ~ y, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.786 -1.412 -0.185 1.455 3.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004 2.435 -0.41 0.6898
## y 1.334 0.314 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
data4 equation: y = -1.0036 + 1.3337 * x
summary(data1_regression)$r.squared
## [1] 0.67
summary(data2_regression)$r.squared
## [1] 0.67
summary(data3_regression)$r.squared
## [1] 0.67
summary(data4_regression)$r.squared
## [1] 0.67
plot(x ~ y, data1)
abline(data1_regression)
hist(data1_regression$residuals)
qqnorm(data1_regression$residuals)
qqline(data1_regression$residuals)
Data1 is approriate simple linear regression, the variability of the residuals is relatively constant and nearly normal distribution.
plot(x ~ y, data2)
abline(data2_regression)
hist(data2_regression$residuals)
qqnorm(data2_regression$residuals)
qqline(data2_regression$residuals)
Data2 is not a linear regression, the variability of the residuals is not constant and strong skewed distribution.
plot(x ~ y, data3)
abline(data3_regression)
hist(data3_regression$residuals)
qqnorm(data3_regression$residuals)
qqline(data3_regression$residuals)
Data3 is not a linear regression, because the residual distribution is bimodal, and not constant, one plot are shown over the outliner.
plot(x ~ y, data4)
abline(data4_regression)
hist(data4_regression$residuals)
Data4 is not a linear regression, the variability of the residuals is not constant.
It is really important to use graph and visuaalization for analyising data, like above data1 to data4, it has similar mean, median and standard deviation, if we just compare the data not visualization, it is not easy to identify the different of distribution and relationship. By using visualization, it can be more effective to compare the range, scope, trend of data change and relationship, such as residual distribution of data1 is nearly normal.