Please put the answers for Part I next to the question number (2pts each):
7a. Describe the two distributions (2pts).
Figure A is skewed to the right Figure B is unimodal and more condensed
7b. Explain why the means of these two distributions are similar but the standard deviations are not (2 pts).
The means are similar because Figure B is generated using the mean of 500 random samles from Figure A. Therefore, the means will be approximately similar because the means are carrying over. However, the standard deviations will differ significantly because we are reducing our distribution of points to only include sample means. The values higher than 10 are lost and not included in B’s distribution.
7c. What is the statistical principal that describes this phenomenon (2 pts)?
The Central Limit Theorum describes this phenomena – our distribution appears to follow a normal distribution with an increase in data points.
Consider the four datasets, each with two columns (x and y), provided below.
options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))
For each column, calculate (to two decimal places):
calculate <- function(data){
x <- data$x
y <- data$y
meanx <- mean(x)
meany <- mean(y)
medianx <- median(x)
mediany <- median(y)
standarddevx <- sd(x)
standarddevy <- sd(y)
cor <- cor(x, y)
model <- summary(lm(x~y))
y_intercept <- coefficients(model)[1]
slope <- coefficients(model)[2]
r_squared <- model$r.squared
return (list(meanx = meanx, meany = meany, medianx= medianx, mediany = mediany, standarddevx = standarddevx, standarddevy = standarddevy, cor = cor, y_intercept = y_intercept, slope = slope, r_squared = r_squared))
}
data1_df <- as.data.frame(calculate(data1))
data2_df <- as.data.frame(calculate(data2))
data3_df <- as.data.frame(calculate(data3))
data4_df <- as.data.frame(calculate(data4))
final_df <- rbind(data1_df, data2_df, data3_df, data4_df)
rownames(final_df) <- c("data1", "data2", "data3", "data4")
final_df
Each pair of data has the same number of datapoints, therefore we are most concerned with the residuals and the normality of the data. This is expressed below.
library(ggfortify)
## Loading required package: ggplot2
library(ggplot2)
Data 1 appears to have normal distribution and normal residuals. It is appropriate to estimate a linear model to this data.
a <- ggplot(data1, aes(x))
a + geom_density()
autoplot(lm(y ~ x, data = data1), label.size = 3)
Data 2 also appears to have normal distribution, however the residuals vs fitted plot shows that this may be a non-linear model. Use simple linear regression with caution.
a <- ggplot(data2, aes(x))
a + geom_density()
autoplot(lm(y ~ x, data = data2), label.size = 3)
Data 3 appears to have normal distribution however there is an outlier as shown in the residuals. Use a linear model with caution.
a <- ggplot(data3, aes(x))
a + geom_density()
autoplot(lm(y ~ x, data = data3), label.size = 3)
Data 4 does not have a normal distribution and there is an outlier in the residuals. Would advice to not use a linear model.
a <- ggplot(data4, aes(x))
a + geom_density()
autoplot(lm(y ~ x, data = data4), label.size = 3)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
Looking at the raw numbers for data1, data2, data3, and data4, it would be impossible to determine correlations, possible relationships, and whether or not using a simple linear regression model would be valid. Visualizations allow us to interpet and analyze data, as well as make valid and interesting connections we would not be able to otherwise. It is one of the most important tools in data science/data analytics.