Please put the answers for Part I next to the question number (2pts each):
car 1=compact, 2=standardsize, 3=minivan, 4=SUV, and 5=truck color : red, blue, green, black, white daysDrive: number of days per week the student drives gasMonth: the amount of money the student spends on gas per month
Answer : B
Explaination : A quantitative variable with possible values of only specific points on a scale is called a discrete variableTaking this defination into account,‘daysDrive’ -number of days per week the student drives is best suitable option.
Alt text
Answer : A) Mean = 3.3,median = 3.5
Explaination: We will use elimination method here. Option B and D are eliminated as the graph is left skwed.So mean < median.Option C and E clear this criteria but if we look at the graph the median can not be 3.8.This narrows down our option to A.
there is a difference between average eye color and average hair color.
a person’s hair color is determined by his or her eye color.
there is an association between natural hair color and eye color.
eye color and natural hair color are independent.
Answer : D,eye color and natural hair color are different.
Explaination : X2 or Chi-square test is measure of the closeness between obsereved frequencies and expected frequencies. given that χ2 is large,the obsereved and expected frequencies are far apart.Therefore,answer D is right choice.
min Q1 median Q3 max mean sd n 26 37 45 49.8 65 44.4 8.4 50
min <- 26
Q1 <- 37
median <- 45
Q3 <- 49.8
max <- 65
mean <- 44.4
sd <- 8.4
n <- 50
IQR <- Q3 - Q1
IQR <- 49.8 - 37
Upper_limit <- Q3 + 1.5 * IQR
Upper_limit
## [1] 69
Lower_Limit <- Q1 - 1.5 * IQR
Lower_Limit
## [1] 17.8
Answer : B
Answer : D
The median and IQR are resistant to outliers,whereas the mean and standard deviation are not.
Alt text
Answer : Figure A : Observations: The distribution is unimodel and skwed to right.Therefore median > mean. The spread for figure A is narrower as compared to spread of figure B. Figure B : Sampling Distribution : The distribution is unimodel and fairly symmetrical. The spread is wider.
Answer : The distribution of the mean is of sample size 30 derived from from 500 random samples in figure A. The samples are independent and not strongly skewed hence means remain similar for both cases. Standard deviation of sampling distribution is SD /sqrt(n)
Standard_error <- 3.22/sqrt(30)
Standard_error
## [1] 0.5878889
The statistical principal is Central Limit Theorem (CLT) as all the conditions are satisfied. Conditions 1) samples are independent and random 2) Data is not strongly skwed. 3) Distribution is approximately normal.
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
For data1
x1 <- mean(data1$x)
y1 <- mean(data1$y)
summary(data1)
## x y
## Min. : 4.0 Min. : 4.3
## 1st Qu.: 6.5 1st Qu.: 6.3
## Median : 9.0 Median : 7.6
## Mean : 9.0 Mean : 7.5
## 3rd Qu.:11.5 3rd Qu.: 8.6
## Max. :14.0 Max. :10.8
x1
## [1] 9
y1
## [1] 7.5
For data2
x2 <- mean(data1$x)
y2 <- mean(data1$y)
summary(data2)
## x y
## Min. : 4.0 Min. :3.1
## 1st Qu.: 6.5 1st Qu.:6.7
## Median : 9.0 Median :8.1
## Mean : 9.0 Mean :7.5
## 3rd Qu.:11.5 3rd Qu.:8.9
## Max. :14.0 Max. :9.3
x2
## [1] 9
y2
## [1] 7.5
For data3
x3 <- mean(data1$x)
y3 <- mean(data1$y)
summary(data3)
## x y
## Min. : 4.0 Min. : 5.4
## 1st Qu.: 6.5 1st Qu.: 6.2
## Median : 9.0 Median : 7.1
## Mean : 9.0 Mean : 7.5
## 3rd Qu.:11.5 3rd Qu.: 8.0
## Max. :14.0 Max. :12.7
x3
## [1] 9
y3
## [1] 7.5
For data4
x4 <- mean(data1$x)
y4 <- mean(data1$y)
summary(data4)
## x y
## Min. : 8 Min. : 5.2
## 1st Qu.: 8 1st Qu.: 6.2
## Median : 8 Median : 7.0
## Mean : 9 Mean : 7.5
## 3rd Qu.: 8 3rd Qu.: 8.2
## Max. :19 Max. :12.5
x4
## [1] 9
y4
## [1] 7.5
For data1
median(data1$x)
## [1] 9
median(data1$y)
## [1] 7.6
For data2
median(data2$x)
## [1] 9
median(data2$y)
## [1] 8.1
For data3
median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.1
For data4
median(data4$x)
## [1] 8
median(data4$y)
## [1] 7
For Data1
sd(data1$x)
## [1] 3.3
sd(data1$y)
## [1] 2
For Data2
sd(data2$x)
## [1] 3.3
sd(data2$y)
## [1] 2
For Data3
sd(data3$x)
## [1] 3.3
sd(data3$y)
## [1] 2
For Data4
sd(data4$x)
## [1] 3.3
sd(data4$y)
## [1] 2
For data1
cor(data1)
## x y
## x 1.00 0.82
## y 0.82 1.00
For data2
cor(data2)
## x y
## x 1.00 0.82
## y 0.82 1.00
For data3
cor(data3)
## x y
## x 1.00 0.82
## y 0.82 1.00
For data4
cor(data4)
## x y
## x 1.00 0.82
## y 0.82 1.00
e1 <- lm(y ~ x,data = data1)
summary(e1)
##
## Call:
## lm(formula = y ~ x, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9213 -0.4558 -0.0414 0.7094 1.8388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000 1.125 2.67 0.0257 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00217
Equation for data1 : y = 3.000 + 0.500x
e2 <- lm(y ~ x,data = data2)
summary(e2)
##
## Call:
## lm(formula = y ~ x, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.901 -0.761 0.129 0.949 1.269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.67 0.0258 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
Equation for data2 : y = 3.001 + 0.500x
e3 <- lm(y ~ x,data = data3)
summary(e3)
##
## Call:
## lm(formula = y ~ x, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.159 -0.615 -0.230 0.154 3.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.629
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00218
Equation for data3: y = 3.002 + 0.500x
e4 <- lm(y ~ x,data = data4)
summary(e4)
##
## Call:
## lm(formula = y ~ x, data = data4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.002 1.124 2.67 0.0256 *
## x 0.500 0.118 4.24 0.0022 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared: 0.667, Adjusted R-squared: 0.63
## F-statistic: 18 on 1 and 9 DF, p-value: 0.00216
Equation for data3: y = 3.002 + 0.500x
For data1
summary(e1)$r.squared
## [1] 0.67
For data2
summary(e2)$r.squared
## [1] 0.67
For data3
summary(e3)$r.squared
## [1] 0.67
For data4
summary(e4)$r.squared
## [1] 0.67
Conditions to check for linear regression model:
Plots for data1
par(mfrow=c(2,2))
plot(data1$x, data1$y)
hist(e1$residuals)
qqnorm(e1$residuals)
qqline(e1$residuals)
Main plot for data1 does not show much of linearity,the Q-Q plot has fairly normal nature with some outliers,but histogram is quite unclear.therefore linear regression model not appropriate.
Plots for data2
par(mfrow=c(2,2))
plot(data2$x, data2$y)
hist(e2$residuals)
qqnorm(e2$residuals)
qqline(e2$residuals)
Main Plot for data2 has a curve and thus is not linear by nature.Linear regression model not recommended.
Plots for data3
par(mfrow=c(2,2))
plot(data3$x, data3$y)
hist(e3$residuals)
qqnorm(e3$residuals)
qqline(e3$residuals)
The plots for data3 do follow linearity with some outliers.The histogram shows that distribution is approximately normal but outlier does affect it .The variability of graph changes as changes are observed in x- values.Therefore linear regression model is not appropriate.
Plots for data4
par(mfrow=c(2,2))
plot(data4$x, data4$y)
hist(e4$residuals)
qqnorm(e4$residuals)
qqline(e4$residuals)
If we take a look at histogram for residuals,The distribution is not normal. The main plot does not have linear relationship either.Therefore linear regression model will not be appropriate.
Visualization plays very important role while anayzing data.Visualizations can help us identify outliers in model,help us build conclusions and prediction for our dataset.Visual graphics are summary of what the particular data set is all about.For example,consider the plots from above models.Graphs helped to conclude the brief analysis of each dataset in question and to visualize the linear model for the plot. “Seeing is believing” - same is power of visualization.