# Data 606 Fall 2016 - Final Exam
# Part I
# a. Describe the two distributions
# The distribution A is unimodal distribution with highly skewed to the right. It's widely spread. And the mean is 5.05 and
# standard deviation is 3.22. The distribution of B is a symmetrical distribution with range from 3.0 to 6.5.

# b. Explain why the means of these two distributions are similar but the standard deviations are not. The distribution of B represent the distribution of the mean from 500 random samples of size 30 from A. Because the normal model for the sample mean tends to be very good when the sample consists of at least 30 independent observations and the popultaion data are not strongly skewed. According to the central limit theorem, the mean of distribution A and B are expected to be similar. The standard deviation is standard error for the mean estimated from the data. The standard error is computed as following:
s = 3.22
n = 30
standard_error = s / sqrt(n)
standard_error
## [1] 0.5878889
# c. What is the statistical principal that describes this phenomenon?
# The phenomenon can be decribed using Central Limit Theorem. which is The normal model for the sample mean tends to be very good when the sample consists of at least 30 independent observations and the population data are not strongly skewed. 

# Part II.
options(digits=2) 
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5), y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)) 
data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5), y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74)) 
data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5), y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73)) 
data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8), y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89)) 

# a.The mean (for x and y separately)
mean(data1$x)
## [1] 9
mean(data1$y)
## [1] 7.5
# The mean of x in data1 is 9, the mean of y in data1 is 7.5.
mean(data2$x)
## [1] 9
mean(data2$y)
## [1] 7.5
# The mean of x in data2 is 9, the mean of y in data2 is 7.5.
mean(data3$x)
## [1] 9
mean(data3$y)
## [1] 7.5
# The mean of x in data3 is 9, the mean of y in data3 is 7.5.
mean(data4$x)
## [1] 9
mean(data4$y)
## [1] 7.5
# The mean of x in data4 is 9, the mean of y in data4 is 7.5.

# b. The median (for x and y separately)
median(data1$x)
## [1] 9
median(data1$y)
## [1] 7.6
# The median of x in data1 is 9, the median of y in data1 is 7.6.
median(data2$x)
## [1] 9
median(data2$y)
## [1] 8.1
# The median of x in data2 is 9, the median of y in data2 is 8.1.
median(data3$x)
## [1] 9
median(data3$y)
## [1] 7.1
# The median of x in data3 is 9, the median of y in data3 is 7.1.
median(data4$x)
## [1] 8
median(data4$y)
## [1] 7
# The median of x in data4 is 8, the median of y in data4 is 7.

# c. The standard deviation (for x and y separately)
sd(data1$x)
## [1] 3.3
sd(data1$y)
## [1] 2
# The standard deviation of x in data1 is 3.3, the standard deviation of y in data1 is 2.
sd(data2$x)
## [1] 3.3
sd(data2$y)
## [1] 2
# The standard deviation of x in data2 is 3.3, the standard deviation of y in data2 is 2.
sd(data3$x)
## [1] 3.3
sd(data3$y)
## [1] 2
# The standard deviation of x in data3 is 3.3, the standard deviation of y in data3 is 2.
sd(data4$x)
## [1] 3.3
sd(data4$y)
## [1] 2
# The standard deviation of x in data4 is 3.3, the standard deviation of y in data4 is 2.

# d. The correlation.
cor(data1$x, data1$y)
## [1] 0.82
# The correlation (x,y) in data1 is 0.82.
cor(data2$x, data2$y)
## [1] 0.82
# The correlation (x,y) in data2 is 0.82.
cor(data3$x, data3$y)
## [1] 0.82
# The correlation (x,y) in data3 is 0.82.
cor(data4$x, data4$y)
## [1] 0.82
# The correlation (x,y) in data4 is 0.82.

# e. Linear regression equation.
lm(y ~ x, data1)
## 
## Call:
## lm(formula = y ~ x, data = data1)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5
# The equation of data1 is y = 0.5x + 3.
lm(y ~ x, data2)
## 
## Call:
## lm(formula = y ~ x, data = data2)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5
# The equation of data2 is y = 0.5x + 3.
lm(y ~ x, data3)
## 
## Call:
## lm(formula = y ~ x, data = data3)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5
# The equation of data3 is y = 0.5x + 3.
lm(y ~ x, data4)
## 
## Call:
## lm(formula = y ~ x, data = data4)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5
# The equation of data4 is y = 0.5x + 3.

# f. R-Squared
summary(lm(y ~ x, data1))$r.squared
## [1] 0.67
# The r-square of data1 is 0.67.
summary(lm(y ~ x, data2))$r.squared
## [1] 0.67
# The r-square of data2 is 0.67.
summary(lm(y ~ x, data3))$r.squared
## [1] 0.67
# The r-square of data3 is 0.67.
summary(lm(y ~ x, data4))$r.squared
## [1] 0.67
# The r-square of data4 is 0.67.

# For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be speci???c as to why for each pair and include appropriate plots!
# Data 1
# First, we plot a scatter plot for data1.
plot(data1$x, data1$y)

# From the plot, we can see relative positive relationship between y and x. The correlation is 0.82 and r-square is 0.67. It is appropriate to estimate a linear regression model.
# Data 2
plot(data2$x, data2$y)

# From the scatterplot, it appears like a quadratic relationship. It is not appropriate to estimate a linear regression model.

# Data 3
plot(data3$x, data3$y)

# From the plot, there seems a positive relationship between y and x. It is appropriate to estimate a linear regression model.

# Data 4
plot(data4$x, data4$y)

# The plot shows that data set would be distributed vertically at the same value of x. It is not appropriate to estimate a linear regression model.

# Explain why it is important to include appropriate visualizations when analyzing data.  Include any visualization(s) you create.
# using plots or graphs to visualize large amounts of data is easier then poring over tables or reports. Data visualization is a quick, easy way to convey concepts. In this test, the mean, median, correlation and r-square are similar for these 4 data sets. Without ploting
# the data set, we wouldn't know what model would be appropriate to analyze the dataset.