DATA 606 Spring 2017

A: (c) daysDrive, car. Both variables have numeric values. car can take fixed values between 1 and 5 and daysDrive can only take values non-negative whole numbers.

A: For n = 132 students, median = \(\frac{n + 1}{2} = 66\). It falls in \(9^{th}\) bin, closer towards \(8^{th}\) bin. If 3.5 GPA is \(9^{th}\) bin’s average, then median will be less than 3.5 GPA.

Estimates listed in (b) mean = 3.5, median = 3.3 are most plausible.

#number of students
n = 132
#students per bin
s.per<-c(2.5,2.5,0,5,5,14,5,13,13,21,20)
s.df<-data.frame(sper = s.per, stringsAsFactors = F)
s.df$students<-round(n*s.df$sper/100)
s.df$cumstudentsperbin<-cumsum(s.df$students)
s.df

##    sper students cumstudentsperbin
## 1   2.5        3                 3
## 2   2.5        3                 6
## 3   0.0        0                 6
## 4   5.0        7                13
## 5   5.0        7                20
## 6  14.0       18                38
## 7   5.0        7                45
## 8  13.0       17                62
## 9  13.0       17                79
## 10 21.0       28               107
## 11 20.0       26               133

med<-round((n + 1)/2)
#median
med

## [1] 66

A: (d) Both studies (a) and (b) can be conducted in order to establish that the treatment does indeed cause improvement with regards to fever in Ebola patients.

A: chi-squre is calculated as \(\chi^2 = \sum{\frac{(Observed - Expected)^2}{Expected}}\)

Null hypothesis \(H_0\): There is a relationship between hair color and eye color. They are dependent.

Alternative hypothesis \(H_A\): There is no relationship between hair color and eye color. They are independent.

In order for chi-square \(\chi^2\), to be large Observed value must be very large compared to Expected value.

If calculated \(\chi^2\), is large, it will result in large p-value leading to reject null hypothesis \(H_0\) in favor of alternative hypothesis \(H_A\), at 5% significance level \(\alpha = 0.05\).

Then we would conclude, (d) eye color and natural hair color are independent.

A: From given data, first quartile \(Q1 = 37\), third quartile \(Q3 = 49.8\)

Inner Quartile Range \(IQR = Q3 - Q1 = 49.8 - 37 = 12.8\), Lower Limit \(LL = Q1 - 1.5*IQR = 17.8\), Upper Limit \(UL = Q3 + 1.5*IQR = 69\)

Any observed value less than 17.8 and greater than 69 is considered to be outlier. (b) 17.8 and 69.0.

q1<-37
q3<-49.8
m <- 44.4
sd<- 8.4
iqr <- q3-q1

bounds<- data.frame("iqr"=c(q3 - q1), "lower"=c(q1 - (1.5*iqr)), "upper"=c(q3 + (1.5*iqr)),stringsAsFactors = F)
bounds

##    iqr lower upper
## 1 12.8  17.8    69

A: Mean(\(\mu\)) is affected by extreme values(outliers) and standard deviation(\(\sigma\)) is calculated using mean. Hence mean and standard deviation both are not resistant to outliers.

(d) median and interquartile range; mean and standard deviation

Describe the two distributions (2 pts).

A: Figure A represents a histogram of actual observations of data collected from some population. Figure B represents a histogram of 500 means, calculated from 500 random samples of size 30 taken from actual observational data.

Example: Let’s say a survey is conducted on Batman’s response to a minor incident in Gotham City. Sentiments were gathered from 1000 randomly selected individuals and were calculated on a scale of 1 to 5. Where one indicates very negative and five indicates very positive sentiment. In this case, Figure A represents a histogram of sentiments of 1000 individuals. Once the survey is completed, from the response of these 1000 individuals, a sample of 30 responses are selected randomly and mean is calculated and recorded for the sample. The process of picking 30 random responses from 1000 original observations is repeated for 500 times and each time mean is computed and recorded. Figure B represents histogram of those 500 means.

Figure A can be referred as a histogram of observational data distribution and Figure B can be referred as a histogram of the sampling distribution of the sample means. In statistical terms, it is known as normal approximation.

Figure A suggests data is right skewed indicating outliers exist. Figure B indicates sample means are normally distributed, and histogram is unimodal.

Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

A: For the Figure A standard deviation(\(\sigma\)) is 3.22 and mean(\(\mu\)) is 5.05. Standard deviation measures variation in response from one another and is calculated as square root of variance.

Standard deviation \(\sigma = \sqrt{variance}\)

For Figure B histogram of 500 means, the standard deviation is known as standard error(SE). Estimates mean and standard error of the sample are helpful to determine if actual observations mean falls within specified confidence interval, usually 95% confidence interval.

Standard error\((SE) = \frac{\sigma}{\sqrt{n}}\)

Since standard error is divided by \(\sqrt{n}\), it will always be smaller than actual observations standard deviation. Bigger the sample size(\(n\)), smaller will be standard error.

What is the statistical principal that describes this phenomenon (2 pts)?

A: The Central Limit Theorem is the statistical principle that explains this phenomenon.

According to the central limit theorem, regardless of the underlying distribution of the sample observations, if the sample is sufficiently large the sample mean will be approximately normally distributed. In other words, if the sample size is large then the sample mean will be closely aligned with the mean of actual observations(population mean), reducing standard error. Sample size(\(n\)) is defined as large when \(n \ge 30\).

library(dplyr)
library(knitr)
options(scipen=1, digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5), y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)) 
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5), y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74)) 
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5), y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73)) 
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8), y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

The mean (for x and y separately; 1 pt).
The median (for x and y separately; 1 pt).
The standard deviation (for x and y separately; 1 pt).
The correlation (1 pt).

options(scipen=1, digits=2)

outdf <- data.frame("dataset" = NA, "mean.x" = NA, "mean.y" = NA,
                    "median.x" = NA, "median.y" = NA, 
                    "sd.x" = NA, "sd.y" = NA,
                    "correlation.x.y" = NA,
                    stringsAsFactors = F)

#Calculate mean, median, sd and correlation - data2
df <- data.frame("dataset" = "data1",
                 "mean.x" = format(mean(data1$x), digits=2, nsmall=2), 
                 "mean.y" = format(mean(data1$y), digits=2, nsmall=2), 
                 "median.x" = format(median(data1$x), digits=2, nsmall=2),
                 "median.y" = format(median(data1$y), digits=2, nsmall=2),
                 "sd.x" = format(sd(data1$x), digits=2, nsmall=2),
                 "sd.y" = format(sd(data1$y), digits=2, nsmall=2),
                 "correlation.x.y" = format(cor(data1$x, data1$y), digits=2, nsmall=2),
                 stringsAsFactors = F)
outdf <- rbind(outdf, df)
  
#Calculate mean, median, sd and correlation - data2
df <- data.frame("dataset" = "data2",
                 "mean.x" = format(mean(data2$x), digits=2, nsmall=2), 
                 "mean.y" = format(mean(data2$y), digits=2, nsmall=2), 
                 "median.x" = format(median(data2$x), digits=2, nsmall=2),
                 "median.y" = format(median(data2$y), digits=2, nsmall=2),
                 "sd.x" = format(sd(data2$x), digits=2, nsmall=2),
                 "sd.y" = format(sd(data2$y), digits=2, nsmall=2),
                 "correlation.x.y" = format(cor(data2$x, data2$y), digits=2, nsmall=2),
                 stringsAsFactors = F)
outdf <- rbind(outdf, df)

#Calculate mean, median, sd and correlation - data3
df <- data.frame("dataset" = "data3",
                 "mean.x" = format(mean(data3$x), digits=2, nsmall=2), 
                 "mean.y" = format(mean(data3$y), digits=2, nsmall=2), 
                 "median.x" = format(median(data3$x), digits=2, nsmall=2),
                 "median.y" = format(median(data3$y), digits=2, nsmall=2),
                 "sd.x" = format(sd(data3$x), digits=2, nsmall=2),
                 "sd.y" = format(sd(data3$y), digits=2, nsmall=2),
                 "correlation.x.y" = format(cor(data3$x, data3$y), digits=2, nsmall=2),
                 stringsAsFactors = F)
outdf <- rbind(outdf, df)

#Calculate mean, median, sd and correlation - data4
df <- data.frame("dataset" = "data4",
                 "mean.x" = format(mean(data4$x), digits=2, nsmall=2), 
                 "mean.y" = format(mean(data4$y), digits=2, nsmall=2), 
                 "median.x" = format(median(data4$x), digits=2, nsmall=2),
                 "median.y" = format(median(data4$y), digits=2, nsmall=2),
                 "sd.x" = format(sd(data4$x), digits=2, nsmall=2),
                 "sd.y" = format(sd(data4$y), digits=2, nsmall=2),
                 "correlation.x.y" = format(cor(data4$x, data4$y), digits=2, nsmall=2),
                 stringsAsFactors = F)
outdf <- rbind(outdf, df)

outdf <- na.omit(outdf)
rownames(outdf) <- NULL
outdf %>% kable(digits = 2, format='pandoc', caption = "Mean, Median, SD and Correlation of x and y for Datasets")

Mean, Median, SD and Correlation of x and y for Datasets
dataset	mean.x	mean.y	median.x	median.y	sd.x	sd.y	correlation.x.y
data1	9.00	7.50	9.00	7.58	3.32	2.03	0.82
data2	9.00	7.50	9.00	8.14	3.32	2.03	0.82
data3	9.00	7.50	9.00	7.11	3.32	2.03	0.82
data4	9.00	7.50	8.00	7.04	3.32	2.03	0.82

Linear regression equation (2 pts).
R-Squared (2 pts).

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

A: When data meets following conditions it is reasonable to apply linear regression model.

1. The residuals of the model are nearly normal.

2. The variability of the residuals is nearly constant.

3. The residuals are independent.

4. Each variable is linearly related to the outcome.

For dataset data1, intercept \(\beta_0 = 3.000, \beta_1 = 0.500\)

Linear regression equation is \(y = \beta_0 + \beta_1x\)

\(y = 3.000 + 0.500x\)

\(R^2 = 0.667\)

The normal plot between independent variable x and dependent variable y shows a pattern as x value increases y value is increasing. While the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.

Histogram doesn’t have the ideal bell-shaped appearance, and it suggests there are some outliers in the data. However, the histogram can be strongly influenced by choice of intervals for the bars. Reading the normal plot of the residuals, data points lie pretty close to the line. Some deviation is noticed near the ends. There is not enough evidence to assume residuals are not nearly normal.

Reading Residuals Vs. Fitted plot from left to right, the average of the residuals remains approximately Zero. If there was no scatter and all the actual data points fell on the estimated regression line, then dots on this plot would be on the gray dashed line (residual = 0). The red line on the plot is a scatterplot smoother, showing the average value of the residuals at each value of fitted value. It is relatively flat and lies close to the gray dashed line. The variation of the residuals appears to be roughly constant. This meets condition second condition of Constant Variability of the residuals.

Scale-Location plot is drawn using fitted values on the x-axis and square root of the standardized residuals on the y-axis. That means residuals values on the y-axis are rescaled so that they have mean of zero and variance of one. The plot is also known as Absolute values of residuals against fitted values. The red line on the plot shows the trend and is relatively flat. It suggests that the variance in the residuals(y) doesn’t change as a function of x. Scale-Location plot provides enough evidence that the residuals are independent.

Plot Data1.x Vs. Residuals, shows a fairly random pattern. There are four negative and four positive residuals. This random pattern satisfies condition each variable is linearly related to the outcome

Meeting all four conditions indicates that a linear model provides a decent fit to the data1 dataset.

Correlation \(R = 0.82\), is positive and close to 1, suggesting linear relationship exists between the variables x and y of data1 dataset.

#Linear regression data1 dataset
dlm <- lm(y ~ x, data = data1)
summary(dlm)

## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

par(mfrow = c(2, 2))
plot(y ~ x, data = data1)
abline(dlm)

plot(dlm)

#Residuals Histogram
hist(dlm$residuals)

plot(data1$x, residuals(dlm), xlab="data1 x", ylab="Residuals", main = "Data1.x Vs. Residuals")
abline(h=0, col="red")

For dataset data2, intercept \(\beta_0 = 3.001, \beta_1 = 0.500\)

Linear regression equation is \(y = \beta_0 + \beta_1x\)

\(y = 3.001 + 0.500x\)

\(R^2 = 0.666\)

The normal plot between independent variable x and dependent variable y shows a pattern of a parabola. Similarly, Residuals Vs. Fitted plot and Data2.x Vs. Residuals plot show the patterns are non-random, suggesting the relationship between the variables is non-linear

Though \(R^2 = 0.666\), and correlation coefficient \(R = 0.82\), values suggest there is a positive relationship between the variables, but visual inspection of the plots indicate otherwise.

Since data plots does not meet the conditions of linear regression model it indicates that a linear model is not the best fit for data2 dataset.

#Linear regression data2 dataset
dlm <- lm(y ~ x, data = data2)
summary(dlm)

## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

par(mfrow = c(2, 2))
plot(y ~ x, data = data2)
abline(dlm)

plot(dlm)

#Residuals Histogram
hist(dlm$residuals)

plot(data2$x, residuals(dlm), xlab="data2 x", ylab="Residuals", main = "Data2.x Vs. Residuals")
abline(h=0, col="red")

For dataset data3, intercept \(\beta_0 = 3.002, \beta_1 = 0.500\)

Linear regression equation is \(y = 3.002 + 0.500x\)

\(R^2 = 0.666\)

The normal plot between independent variable x and dependent variable y shows a pattern as x value increases y value is increasing. This suggests relationship is not perfectly linear, but it could be helpful to partially explain the connection between these variables with a straight line.

Residuals Vs. Fitted plot and Data3.x Vs. Residuals plot show the patterns of downward trend, and there is not enough scatter in the data points.

\(R^2 = 0.666\), and correlation coefficient \(R = 0.82\), values suggest there is a positive relationship between the variables, but visual inspection of the plots indicate otherwise.

Since data plots does not meet the conditions of linear regression model it suggests that a linear model is not the best fit for data3 dataset.

#Linear regression data2 dataset
dlm <- lm(y ~ x, data = data3)
summary(dlm)

## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

par(mfrow = c(2, 2))
plot(y ~ x, data = data3)
abline(dlm)

plot(dlm)

#Residuals Histogram
hist(dlm$residuals)

plot(data3$x, residuals(dlm), xlab="data3 x", ylab="Residuals", main = "Data3.x Vs. Residuals")
abline(h=0, col="red")

For dataset data4, intercept \(\beta_0 = 3.002, \beta_1 = 0.500\)

Linear regression equation is \(y = 3.002 + 0.500x\)

\(R^2 = 0.667\)

The normal plot between independent variable x and dependent variable y does not show a pattern of the upward or downward trend. For most of the observations, dependent variable y value is different for the same value of x. Different response for the same value of predictor suggests variables x and y have a non-linear relationship between them. Similarly, Residuals Vs. Fitted plot and Data4.x Vs. Residuals plot show the patterns are non-random, suggesting the relationship between the variables is non-linear.

Though \(R^2 = 0.666\), and correlation coefficient \(R = 0.82\), values suggest there is a positive relationship between the variables, but visual inspection of the plots indicate otherwise.

Since data plots does not meet the conditions of linear regression model it suggests that a linear model is not the best fit for data4 dataset.

#Linear regression data2 dataset
dlm <- lm(y ~ x, data = data4)
summary(dlm)

## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x              0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

par(mfrow = c(2, 2))
plot(y ~ x, data = data4)
abline(dlm)

plot(dlm)

#Residuals Histogram
hist(dlm$residuals)

plot(data4$x, residuals(dlm), xlab="data4 x", ylab="Residuals", main = "Data4.x Vs. Residuals")
abline(h=0, col="red")

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

A: For four datasets linear regression summary \(R^2\), \(R\), slope(\(\beta_1\)) and intercept(\(\beta_0\)), suggest there is positive relationship between independent variable x and dependent y. Reading the visualizations(plots) for datasets data2, data3 and data4 suggest otherwise. It is always good practice to compare summary and visualizations to identify patterns when analyzing data; otherwise one might arrive at incorrect conclusions.

DATA 606 Spring 2017 - Final Exam

Pavan Akula

May 19, 2017