CUNY606 - Final Project

This document is the final exam for CUNY 606 - Intro to Statistics and Probabilities - Spring 2016.

Part II

Data Sets:

options(digits=2)
data1 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68))

data2 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74))

data3 <- data.frame(x=c(10,8,13,9,11,14,6,4,12,7,5),
y=c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73))

data4 <- data.frame(x=c(8,8,8,8,8,8,8,19,8,8,8),
y=c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89))

For each column, calculate (to two decimal places):

a. The mean (for x and y separately):

For each Data Set, we will calculate the mean for x and y denoted mx and my respectively.

The results will be displayed in a tabular form:

Data	variable x	variable y
set 1	9	7.5
set 2	9	7.5
set 3	9	7.5
set 4	9	7.5

b. The median (for x and y separately):

For each data set, we will calculate the median for x and y denoted mdx and mdy respectively.

Again, we will display the results in a tabular form:

Data	variable x	variable y
set 1	9	7.58
set 2	9	8.14
set 3	9	7.11
set 4	8	7.04

c. The standard deviation (for x and y separately):

For each data set, we will calculate the standard deviation for x and y denoted sdx and sdy respectively.

Again, we will display the results in a tabular form:

Data	variable x	variable y
set 1	3.32	2.03
set 2	3.32	2.03
set 3	3.32	2.03
set 4	3.32	2.03

For each x and y pair, calculate (also to two decimal places):

d. The correlation:

For each pair (x,y) we will calcuate the correlation denoted correlation.

The correlation for each pair will be printed below:

Data	Correlation (x,y)
Set 1	0.82
Set 2	0.82
Set 3	0.82
Set 4	0.82

e. Linear regression equation:

For each data set, we will determine the regression line equation.

Linear Line equation:
$\hat { y } \quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }\cdot x\quad$

For first data set, we have: y = 3 + 0.5 x

For second data set, we have: y = 3 + 0.5 x

For third data set, we have: y = 3 + 0.5 x

For fourth data set, we have: y = 3 + 0.5 x

f. R-Squared:

The R-Squared for the first data set is 0.67

The R-Squared for the second data set is 0.67

The R-Squared for the third data set is 0.67

The R-Squared for the fourth data set is 0.67

g. Linear Regression Model Evaluation:

For each pair, is it appropriate to estimate a linear regression model? Why or why not? Be specific as to why for each pair and include appropriate plots!

For each data plot, we would like to evaluate whether it is appropriate to estimate a linear regression model. We will therefore check the following for each data set.

Linearity:
The data should show a linear trend
Nearly Normal Residuals:
Generally the residuals must be nearly normal.
Constant Variability:
The variability of points around the least squares line remains roughly constant.
Independent Observations:
The observations of the data set must be independent.

For the first condition, we will plot the scatter graph and observe the pattern.
For the second condition, we will plot the qqplot for the residuals and observe the pattern.
For the third condition, we will plot the residual graph and observe the pattern. For the fourth condition, we will assume independence unless we have evidence to the contrary.

Evaluation for Data Set 1:

First, we will look at the scatter plot.

From the scatter plot, it appears that there is a positive linear relationship. The correlation for this data set is 0.82, which would indicate a strong relationship.

We will now proceed with lookting at the residual plot

From the qqplot of the residuals, it appears that the points are closed to the line with the exception of the 2 point at with lower values.
We will not plot the histogram for the residuals.

The distribution for the histogram of the residuals show a unimodal symetrical distribution. Due to the sparcity and discreet nature of the data, it is difficult to draw conclusion but We would surmize the the residual distribution from the histogram. We will therefore defer to the qqplot for residual to check for “normal” distribution of the residuals.

We will now observe the residual plot and determine whether the points have a constant variability.

From this graph, we can conclude that the variability of the data is constant.

We will assume indepence of the observations.

In summary for data set 1, it would be appropriate to estimate a regression line model.

Data Set # 2

We will now perform a similar evaluation on the second data set.

We first look at the scatter plot.

From the scatter plot, it appears that there is strong relationship between the variables but it is not a linear one. It appears to be a parabola. The correlation for this data set is 0.82.

From this observation, we would conclude that it would not be appropriate to estimate a regression line model

Data Set # 3

We will now look at the scatter plot for the third data set.

From the scatter plot, it appears that there is a positive linear relation between the variables. There is an outlier point. The correlation for the 2 variables is 0.82 and would denotes a strong relationship.

We will now look at the qqplot for the residuals.

Except for the outlier, all other points are on the line, wich would indicate the the residuals follow a normal distribution.

The residual distribution appears to be uniform, the outlier would need to be accounted for. Since we have a finite sample with only 11 data points, it is difficult to interpret the histogram of the residuals. We will rely on the qqplot of the residuals to evaluate distribution of the residuals.

We will now look at the residual plot against the x variable to observe the variabilty of the data.

From this plot, we can observe that the variability of the points is increasing as the x variable increase. The variability is therefore not constant.

We will conclude that for Data Set 3, it would not be appropriate to estimate a regression line model.

Data Set # 4:

From the scatter plot, we would expect that the data points would be in vertial lime with an outlier quite away from the rest of the data. But the line model is y = 3 + 0.5 x. We would surmize that the observations may not be independent.

We will therefore concluded that it would not be appropriate to estimate a regression line model.

Explain why it is important to include appropriate visualizations when analyzing data. Include any visualization(s) you create. (2 pts)

These cases illustrate clearly the shortcoming of reliying only on summary statistics and results calculation since we had very similar results; Only the median were different. However, when we looked at the various graphs we obtain different understanding of the data. Visualization is an important aspect the data analysis and as is summary statistics. Both should be taken into consideration when performing statistical analysis.

(see graph in the above section)

In addition, we will graph histograms for each different data set components, please note that data1$x, data2$x, data3$x have the same values so that we will only have one graph for these.

Histograms for data sets

Thank you

CUNY606 - Final Project

Valerie Briot

May 17, 2016

Part I

a. Describe the two distributions

b.Explain why the means of these two distributions are similar but the standard deviations are not

c. What is the statistical principal that describes this phenomenon (2 pts)?