Part II

Consider the four datasets, each with two columns (x and y), provided below.

#options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

data.frame(data=c("data1","data2","data3","data4"), mean_x=c(round(mean(data1$x),2),round(mean(data2$x),2),round(mean(data3$x),2),round(mean(data4$x),2)), mean_y=c(round(mean(data1$y),2),round(mean(data2$y),2),round(mean(data3$y),2),round(mean(data4$y),2)))

b. The median (for x and y separately; 1 pt).

data.frame(data=c("data1","data2","data3","data4"), median_x=c(round(median(data1$x),2),round(median(data2$x),2),round(median(data3$x),2),round(median(data4$x),2)), median_y=c(round(median(data1$y),2),round(median(data2$y),2),round(median(data3$y),2),round(median(data4$y),2)))

c. The standard deviation (for x and y separately; 1 pt).

data.frame(data=c("data1","data2","data3","data4"), sd_x=c(round(sd(data1$x),2),round(sd(data2$x),2),round(sd(data3$x),2),round(sd(data4$x),2)), sd_y=c(round(sd(data1$y),2),round(sd(data2$y),2),round(sd(data3$y),2),round(sd(data4$y),2)))

For each x and y pair, calculate (also to two decimal places; 1 pt):

d. The correlation (1 pt).

data.frame(data1 = cor(data1))

data.frame(data2 = cor(data2))

data.frame(data3 = cor(data3))

data.frame(data4 = cor(data4))

e. Linear regression equation (2 pts).

lm_data1 <- lm(data1$y ~ data1$x)
lm_data2 <- lm(data2$y ~ data2$x)
lm_data3 <- lm(data3$y ~ data3$x)
lm_data4 <- lm(data4$y ~ data4$x)

Linear regression equation (\(y = {\beta}_0 + {\beta}_1 * x\)) for:

data1: \(y = 3 + 0.5 * x\)

data2: \(y = 3.001 + 0.5 * x\)

data3 :\(y = 3.002 + 0.5 * x\)

data4: \(y = 3.002 + 0.5 * x\)

f. R-Squared (2 pts).

data.frame(data=c("data1","data2","data3","data4"), r_squared=c(summary(lm_data1)$r.squared,summary(lm_data2)$r.squared,summary(lm_data3)$r.squared,summary(lm_data4)$r.squared))

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Data1: No, the plot seem to look an S-shape

par(mfrow=c(2,2))
hist(lm_data1$residuals)

qqnorm(lm_data1$residuals)
qqline(lm_data1$residuals)

Data2: No, the plot seem to look an S-shape

par(mfrow=c(2,2))
hist(lm_data2$residuals)

qqnorm(lm_data2$residuals)
qqline(lm_data2$residuals)

Data3: Yes, the plot seem to look linear

par(mfrow=c(2,2))
hist(lm_data3$residuals)

qqnorm(lm_data3$residuals)
qqline(lm_data3$residuals)

Data4: No, the plot seem to look an S-shape

par(mfrow=c(2,2))
hist(lm_data4$residuals)

qqnorm(lm_data4$residuals)
qqline(lm_data4$residuals)

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

In visualization, analytics presented visually can help see difficult concepts or identify new patterns
Patterns or trends that might go unnoticed in text-based data can be exposed and recognized easier with data visualization

DATA 606 Spring 2018 - Final Exam

Ohannes Ohannessian

Part I