Part I

Please put the answers for Part I next to the question number (2pts each):

7a. Describe the two distributions (2pts).

Figure A is skewed to the right Figure B is unimodal and more condensed

7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).

The means are similar because Figure B is generated using the mean of 500 random samles from Figure A. Therefore, the means will be approximately similar because the means are carrying over. However, the standard deviations will differ significantly because we are reducing our distribution of points to only include sample means. The values higher than 10 are lost and not included in B’s distribution.

7c. What is the statistical principal that describes this phenomenon (2 pts)?

The Central Limit Theorum describes this phenomena – our distribution appears to follow a normal distribution with an increase in data points.

Part II

Consider the four datasets, each with two columns (x and y), provided below.

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
                    y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
                    y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately; 1 pt).

b. The median (for x and y separately; 1 pt).

c. The standard deviation (for x and y separately; 1 pt).

d. The correlation (1 pt).

e. Linear regression equation (2 pts).

f. R-Squared (2 pts).

calculate <- function(data){
  x <- data$x
  y <- data$y
  
  meanx <- mean(x)
  meany <- mean(y)
  medianx <- median(x)
  mediany <- median(y)
  standarddevx <- sd(x)
  standarddevy <- sd(y)
  cor <- cor(x, y)
  model <- summary(lm(x~y))
  y_intercept <- coefficients(model)[1]
  slope <- coefficients(model)[2]
  r_squared <- model$r.squared
  
  return (list(meanx = meanx, meany = meany, medianx= medianx, mediany = mediany, standarddevx = standarddevx, standarddevy = standarddevy, cor = cor, y_intercept = y_intercept, slope = slope, r_squared = r_squared))
}

data1_df <- as.data.frame(calculate(data1))
data2_df <- as.data.frame(calculate(data2))
data3_df <- as.data.frame(calculate(data3))
data4_df <- as.data.frame(calculate(data4))
final_df <- rbind(data1_df, data2_df, data3_df, data4_df)
rownames(final_df) <- c("data1", "data2", "data3", "data4")
final_df

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots! (4 pts)

Each pair of data has the same number of datapoints, therefore we are most concerned with the residuals and the normality of the data. This is expressed below.

library(ggfortify)

## Loading required package: ggplot2

library(ggplot2)

Data1

Data 1 appears to have normal distribution and normal residuals. It is appropriate to estimate a linear model to this data.

a <- ggplot(data1, aes(x)) 
a + geom_density()

autoplot(lm(y ~ x, data = data1), label.size = 3)

Data2

Data 2 also appears to have normal distribution, however the residuals vs fitted plot shows that this may be a non-linear model. Use simple linear regression with caution.

a <- ggplot(data2, aes(x)) 
a + geom_density()

autoplot(lm(y ~ x, data = data2), label.size = 3)

Data 3

Data 3 appears to have normal distribution however there is an outlier as shown in the residuals. Use a linear model with caution.

a <- ggplot(data3, aes(x)) 
a + geom_density()

autoplot(lm(y ~ x, data = data3), label.size = 3)

Data 4

Data 4 does not have a normal distribution and there is an outlier in the residuals. Would advice to not use a linear model.

a <- ggplot(data4, aes(x)) 
a + geom_density()

autoplot(lm(y ~ x, data = data4), label.size = 3)

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_path).

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

Looking at the raw numbers for data1, data2, data3, and data4, it would be impossible to determine correlations, possible relationships, and whether or not using a simple linear regression model would be valid. Visualizations allow us to interpet and analyze data, as well as make valid and interesting connections we would not be able to otherwise. It is one of the most important tools in data science/data analytics.

DATA 606 Fall 2017 - Final Exam

Michele Bradley