Part I

Please put the answers for Part I next to the question number (2pts each):

The variables daysDrive, gasMonth are discrete while the variable car is ordinal and color is nominal.

Since the distribution is left skewed the median is greater that the mean (this statement eliminates b and d). Also, mean is obviously greater than 3 (eliminates c and e).

Q1 <- 37
Q3 <- 49.8

#below Q1 – 1.5×IQR

IQR <- Q3-Q1
round(Q1 - 1.5*IQR,2)

## [1] 17.8

#above Q3 + 1.5×IQR
Q3 + 1.5*IQR

## [1] 69

7a. Describe the two distributions (2pts).

Graph A represents the distribution for the observations. It’s unimodal (there is one clear peak) bell-shaped distribution that is highly skewed to the right. Since the distribution is skewed to the right the distribution’s mean is greater than its median.

Graph B represents the distribution of means obtained through 500 random samples of size 30 drawn from a population. It’s unimodal (there is one clear peak) bell-shaped symmetric distribution. It looks normally distributed. The data spread over three standard deviations.

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

Because the standard deviation of sampling distribution is calculated based on the following formula:

SD <- 3.22
n <- 30

SE <-SD/sqrt(n)
SE

## [1] 0.5878889

7c. What is the statistical principal that describes this phenomenon (2 pts)?

The principal is called the central limit theorem. It states that even if a population distribution is strongly non‐normal, its sampling distribution of means will be approximately normal for sample sizes over 30. The central limit theorem makes it possible to estimate the true mean of the population based on the mean and standard deviation (that is called standard error) of sampling distribution.

Part II

Consider the four data sets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

data <- c("data 1","data 2","data 3","data 4")
mean_x <- c(mean(data1$x),mean(data2$x),mean(data3$x),mean(data4$x))
mean_y <- c(mean(data1$y),mean(data2$y),mean(data3$y),mean(data4$y)) 

data_statistic <- data.frame(data,mean_x,mean_y)
data_statistic

##     data mean_x mean_y
## 1 data 1      9    7.5
## 2 data 2      9    7.5
## 3 data 3      9    7.5
## 4 data 4      9    7.5

b. The median (for x and y separately; 1 pt).

median_x <- c(median(data1$x),median(data2$x),median(data3$x),median(data4$x))
median_y <- c(median(data1$y),median(data2$y),median(data3$y),median(data4$y)) 

data_statistic <- data.frame(data,median_x,median_y)
data_statistic

##     data median_x median_y
## 1 data 1        9      7.6
## 2 data 2        9      8.1
## 3 data 3        9      7.1
## 4 data 4        8      7.0

c. The standard deviation (for x and y separately; 1 pt).

sd_x <- c(sd(data1$x),sd(data2$x),sd(data3$x),sd(data4$x))
sd_y <- c(sd(data1$y),sd(data2$y),sd(data3$y),sd(data4$y)) 

data_statistic <- data.frame(data,sd_x,sd_y)
data_statistic

##     data sd_x sd_y
## 1 data 1  3.3    2
## 2 data 2  3.3    2
## 3 data 3  3.3    2
## 4 data 4  3.3    2

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

correlation_x_and_y <- c(cor(data1$x,data1$y),cor(data2$x,data2$y),cor(data3$x,data3$y),cor(data4$x,data4$y))

data_statistic <- data.frame(data,correlation_x_and_y)
data_statistic

##     data correlation_x_and_y
## 1 data 1                0.82
## 2 data 2                0.82
## 3 data 3                0.82
## 4 data 4                0.82

e. Linear regression equation (2 pts).

data1_reg <- lm(data1$x ~ data1$y)
data2_reg <- lm(data2$x ~ data2$y)
data3_reg <- lm(data3$x ~ data3$y)
data4_reg <- lm(data4$x ~ data4$y)

summary(data1_reg)

## 
## Call:
## lm(formula = data1$x ~ data1$y)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.652 -1.512 -0.266  1.234  3.895 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.998      2.434   -0.41   0.6916   
## data1$y        1.333      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

summary(data2_reg)

## 
## Call:
## lm(formula = data2$x ~ data2$y)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.852 -1.432 -0.344  0.847  4.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.995      2.435   -0.41   0.6925   
## data2$y        1.332      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(data3_reg)

## 
## Call:
## lm(formula = data3$x ~ data3$y)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.987 -1.373 -0.027  1.320  3.213 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.000      2.436   -0.41   0.6910   
## data3$y        1.333      0.315    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(data4_reg)

## 
## Call:
## lm(formula = data4$x ~ data4$y)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.786 -1.412 -0.185  1.455  3.333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.004      2.435   -0.41   0.6898   
## data4$y        1.334      0.314    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Linear equation for data set 1: y = -0.998 + 1.333*x

Linear equation for data set 2: y = -0.995 + 1.332*x

Linear equation for data set 3: y = -1 + 1.333*x

Linear equation for data set 4: y = -1.004 + 1.334*x

f. R-Squared (2 pts).

summary(data1_reg)$r.squared

## [1] 0.67

summary(data2_reg)$r.squared

## [1] 0.67

summary(data3_reg)$r.squared

## [1] 0.67

summary(data4_reg)$r.squared

## [1] 0.67

R-squires for all four data sets equal to 0.67

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

In order to estimate a linear regression model the linear regression should meet four conditions:

The first condition is linearity. The data should show a linear trend.

The second condition is nearly normal residuals. The distribution of residuals should be close to normal distribution.

The third condition is constant variability. The variability of points around the least squares line and he variability of residuals around the zero line must be reasonably constant.

The forth condition is independent observations.

Let’s check the first condition - linearity.

par(mfrow=c(2,2))

#plot for data1
plot(y ~ x, xlab = "x", ylab = "y",data = data1, main="Data Set 1")  
data1_reg <- lm(y ~ x, data = data1)
abline(data1_reg)

#plot for data2
plot(y ~ x, xlab = "x", ylab = "y",data = data2, main="Data Set 2")  
data2_reg <- lm(y ~ x, data = data2)
abline(data2_reg)

#plot for data3
plot(y ~ x, xlab = "x", ylab = "y",data = data3, main="Data Set 3")  
data2_reg <- lm(y ~ x, data = data3)
abline(data3_reg)

#plot for data4
plot(y ~ x, xlab = "x", ylab = "y",data = data4, main="Data Set 4")  
data2_reg <- lm(y ~ x, data = data4)
abline(data4_reg)

I would say that only data set 1 and data set 3 shows positive linear trends with strong relationships between response and exploratory variables (as correlation equals 0.82) while data set 3 show non-linear trend with with strong relationships between response and exploratory variables. It’s hard to estimate regression model for data set 4 since its observations might not be independent.

Only data set 1 and data set 3 are satisfied linearity condition.

Neither data set 2 nor data set 4 would be appropriate to estimate a regression line model.

Let’s check the second condition - nearly normal residuals.

par(mfrow=c(2,2))

#histograms
hist(data1_reg$residuals)
#hist(data2_reg$residuals)
hist(data3_reg$residuals)
#hist(data4_reg$residuals)

#summary statistics
summary(data1_reg$residuals)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.92   -0.46   -0.04    0.00    0.71    1.84

#summary(data2_reg$residuals)
summary(data3_reg$residuals)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -3.0    -1.4     0.0     0.0     1.3     3.2

#summary(data4_reg$residuals)


#normal probability plots 
qqnorm(data1_reg$residuals)
qqline(data1_reg$residuals)
#qqnorm(data2_reg$residuals)
#qqline(data2_reg$residuals)
qqnorm(data3_reg$residuals)
qqline(data3_reg$residuals)

#qqnorm(data4_reg$residuals)
#qqline(data4_reg$residuals)

It’s hard to make a conclusion based on histograms as sample contains only 11 observations. However, summary statistics shows that the mean and the median of the data set 1 are equal while the mean and the median of the data set 3 are almost equal.

By looking at the normal probability plots data set 3 is close normal distribution as almost all points lay on the line whereas data set 1 is not normally distributed as most of the points severely deviates from the line.

Hence, the data set 1 would not be appropriate to estimate a regression line model.

#par(mfrow=c(2,2))

#plot(data1_reg$residuals ~ data1$x)
#abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

#plot(data2_reg$residuals ~ data2$x)
#abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

plot(data3_reg$residuals ~ data3$x)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

#plot(data4_reg$residuals ~ data4$x)
#abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

The plot shows that the variability of the points is increasing as the x variable increase. The variability is therefore not constant.

The data set 3 would not be appropriate to estimate a regression line model.

Neither of the data sets would be appropriate to estimate a regression line model.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

The visualization is a critical aspect of data analysis as it allows to understand the data. For example, all four data set from the previous exercise have the same mean, standard deviation, correlation and r-squires. However, by looking at the various plots I concluded that those four data sets are completely different.

DATA 606 Fall 2017 - Final Exam