1 Part I

Please put the answers for Part I next to the question number (2pts each):

1.1 1. A student is gathering data on the driving experiences of other college students. A description of the data car color is presented below. Which of the variables are quantitative and discrete?

car 1=compact, 2=standardsize, 3=minivan, 4=SUV, and 5=truck color : red, blue, green, black, white daysDrive: number of days per week the student drives gasMonth: the amount of money the student spends on gas per month

  1. car
  2. daysDrive
  3. daysDrive, car
  4. daysDrive, gasMonth
  5. car, daysDrive, gasMonth

Answer : B

Explaination : A quantitative variable with possible values of only specific points on a scale is called a discrete variableTaking this defination into account,‘daysDrive’ -number of days per week the student drives is best suitable option.

1.2 2. A histogram of the GPA of 132 students from this course in Fall 2012 class is presented below. Which estimates of the mean and median are most plausible?

Alt text

Alt text

Answer : A) Mean = 3.3,median = 3.5

Explaination: We will use elimination method here. Option B and D are eliminated as the graph is left skwed.So mean < median.Option C and E clear this criteria but if we look at the graph the median can not be 3.8.This narrows down our option to A.

1.4 4. A study is designed to test whether there is a relationship between natural hair color (brunette, blond, red) and eye color (blue, green, brown). If a large χ2 test statistic is obtained, this suggests that:

  1. there is a difference between average eye color and average hair color.

  2. a person’s hair color is determined by his or her eye color.

  3. there is an association between natural hair color and eye color.

  4. eye color and natural hair color are independent.

Answer : D,eye color and natural hair color are different.

Explaination : X2 or Chi-square test is measure of the closeness between obsereved frequencies and expected frequencies. given that χ2 is large,the obsereved and expected frequencies are far apart.Therefore,answer D is right choice.

1.5 5. A researcher studying how monkeys remember is interested in examining the distribution of the score on a standard memory task. The researcher wants to produce a boxplot to examine this distribution. Below are summary statistics from the memory task. What values should the researcher use to determine if a particular score is a potential outlier in the boxplot?

min Q1 median Q3 max mean sd n 26 37 45 49.8 65 44.4 8.4 50

  1. 37.0 and 49.8
  2. 17.8 and 69.0
  3. 36.0 and 52.8
  4. 26.0 and 50.0
  5. 19.2 and 69.9
min <- 26
Q1 <- 37
median <- 45
Q3 <- 49.8
max <- 65
mean <- 44.4
sd <- 8.4
n <- 50

IQR <- Q3 - Q1
IQR <- 49.8 - 37

Upper_limit <- Q3 + 1.5 * IQR
Upper_limit
## [1] 69
Lower_Limit <- Q1 - 1.5 * IQR
Lower_Limit
## [1] 17.8

Answer : B

1.6 6) The – are resistant to outliers,whereas the – are not.

  1. mean and median; standard deviation and interquartile range
  2. mean and standard deviation; median and interquartile range
  3. standard deviation and interquartile range; mean and median
  4. median and interquartile range; mean and standard deviation
  5. median and standard deviation; mean and interquartile range

Answer : D

The median and IQR are resistant to outliers,whereas the mean and standard deviation are not.

1.7 7) Figure A below represents the distribution of an observed variable. Figure B below represents the distribution of the mean from 500 random samples of size 30 from A. The mean of A is 5.05 and the mean of B is 5.04. The standard deviations of A and B are 3.22 and 0.58, respectively.

Alt text

Alt text

1.8 7a. Describe the two distributions (2pts).

Answer : Figure A : Observations: The distribution is unimodel and skwed to right.Therefore median > mean. The spread for figure A is narrower as compared to spread of figure B. Figure B : Sampling Distribution : The distribution is unimodel and fairly symmetrical. The spread is wider.

1.9 7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

Answer : The distribution of the mean is of sample size 30 derived from from 500 random samples in figure A. The samples are independent and not strongly skewed hence means remain similar for both cases. Standard deviation of sampling distribution is SD /sqrt(n)

Standard_error <- 3.22/sqrt(30)
Standard_error
## [1] 0.5878889

1.10 7c. What is the statistical principal that describes this phenomenon (2 pts)?

The statistical principal is Central Limit Theorem (CLT) as all the conditions are satisfied. Conditions 1) samples are independent and random 2) Data is not strongly skwed. 3) Distribution is approximately normal.

2 Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

2.1 a. The mean (for x and y separately; 1 pt).

For data1

x1 <- mean(data1$x)
y1 <- mean(data1$y)
summary(data1)
##        x              y       
##  Min.   : 4.0   Min.   : 4.3  
##  1st Qu.: 6.5   1st Qu.: 6.3  
##  Median : 9.0   Median : 7.6  
##  Mean   : 9.0   Mean   : 7.5  
##  3rd Qu.:11.5   3rd Qu.: 8.6  
##  Max.   :14.0   Max.   :10.8
x1
## [1] 9
y1
## [1] 7.5

For data2

x2 <- mean(data1$x)
y2 <- mean(data1$y)
summary(data2)
##        x              y      
##  Min.   : 4.0   Min.   :3.1  
##  1st Qu.: 6.5   1st Qu.:6.7  
##  Median : 9.0   Median :8.1  
##  Mean   : 9.0   Mean   :7.5  
##  3rd Qu.:11.5   3rd Qu.:8.9  
##  Max.   :14.0   Max.   :9.3
x2
## [1] 9
y2
## [1] 7.5

For data3

x3 <- mean(data1$x)
y3 <- mean(data1$y)
summary(data3)
##        x              y       
##  Min.   : 4.0   Min.   : 5.4  
##  1st Qu.: 6.5   1st Qu.: 6.2  
##  Median : 9.0   Median : 7.1  
##  Mean   : 9.0   Mean   : 7.5  
##  3rd Qu.:11.5   3rd Qu.: 8.0  
##  Max.   :14.0   Max.   :12.7
x3
## [1] 9
y3
## [1] 7.5

For data4

x4 <- mean(data1$x)
y4 <- mean(data1$y)
summary(data4)
##        x            y       
##  Min.   : 8   Min.   : 5.2  
##  1st Qu.: 8   1st Qu.: 6.2  
##  Median : 8   Median : 7.0  
##  Mean   : 9   Mean   : 7.5  
##  3rd Qu.: 8   3rd Qu.: 8.2  
##  Max.   :19   Max.   :12.5
x4
## [1] 9
y4
## [1] 7.5

2.2 b. The median (for x and y separately; 1 pt).

For data1

median(data1$x)
## [1] 9
median(data1$y)
## [1] 7.6

For data2

median(data2$x)
## [1] 9
median(data2$y)
## [1] 8.1

For data3

median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.1

For data4

median(data4$x)
## [1] 8
median(data4$y)
## [1] 7

2.3 c. The standard deviation (for x and y separately; 1 pt).

For Data1

sd(data1$x)
## [1] 3.3
sd(data1$y)
## [1] 2

For Data2

sd(data2$x)
## [1] 3.3
sd(data2$y)
## [1] 2

For Data3

sd(data3$x)
## [1] 3.3
sd(data3$y)
## [1] 2

For Data4

sd(data4$x)
## [1] 3.3
sd(data4$y)
## [1] 2

2.4 For each x and y pair, calculate (also to two decimal places; 1 pt):

2.5 d. The correlation (1 pt).

For data1

cor(data1)
##      x    y
## x 1.00 0.82
## y 0.82 1.00

For data2

cor(data2)
##      x    y
## x 1.00 0.82
## y 0.82 1.00

For data3

cor(data3)
##      x    y
## x 1.00 0.82
## y 0.82 1.00

For data4

cor(data4)
##      x    y
## x 1.00 0.82
## y 0.82 1.00

2.6 e. Linear regression equation (2 pts).

e1 <- lm(y ~ x,data = data1)
summary(e1)
## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

Equation for data1 : y = 3.000 + 0.500x

e2 <- lm(y ~ x,data = data2)
summary(e2)
## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

Equation for data2 : y = 3.001 + 0.500x

e3 <- lm(y ~ x,data = data3)
summary(e3)
## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

Equation for data3: y = 3.002 + 0.500x

e4 <- lm(y ~ x,data = data4)
summary(e4)
## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Equation for data3: y = 3.002 + 0.500x

2.7 f. R-Squared (2 pts).

For data1

summary(e1)$r.squared
## [1] 0.67

For data2

summary(e2)$r.squared
## [1] 0.67

For data3

summary(e3)$r.squared
## [1] 0.67

For data4

summary(e4)$r.squared
## [1] 0.67

2.8 For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Conditions to check for linear regression model:

  1. Linearity
  2. Nearly normal residuals
  3. Constant variability
  4. Observations are independent of each other.(we can make an assumption here)

Plots for data1

par(mfrow=c(2,2))
plot(data1$x, data1$y)
hist(e1$residuals)
qqnorm(e1$residuals)
qqline(e1$residuals)

Main plot for data1 does not show much of linearity,the Q-Q plot has fairly normal nature with some outliers,but histogram is quite unclear.therefore linear regression model not appropriate.

Plots for data2

par(mfrow=c(2,2))
plot(data2$x, data2$y)
hist(e2$residuals)
qqnorm(e2$residuals)
qqline(e2$residuals)

Main Plot for data2 has a curve and thus is not linear by nature.Linear regression model not recommended.

Plots for data3

par(mfrow=c(2,2))
plot(data3$x, data3$y)
hist(e3$residuals)
qqnorm(e3$residuals)
qqline(e3$residuals)

The plots for data3 do follow linearity with some outliers.The histogram shows that distribution is approximately normal but outlier does affect it .The variability of graph changes as changes are observed in x- values.Therefore linear regression model is not appropriate.

Plots for data4

par(mfrow=c(2,2))
plot(data4$x, data4$y)
hist(e4$residuals)
qqnorm(e4$residuals)
qqline(e4$residuals)

If we take a look at histogram for residuals,The distribution is not normal. The main plot does not have linear relationship either.Therefore linear regression model will not be appropriate.

2.9 Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create.

Visualization plays very important role while anayzing data.Visualizations can help us identify outliers in model,help us build conclusions and prediction for our dataset.Visual graphics are summary of what the particular data set is all about.For example,consider the plots from above models.Graphs helped to conclude the brief analysis of each dataset in question and to visualize the linear model for the plot. “Seeing is believing” - same is power of visualization.