IS607 - Project 2

Introduction
Methodology
Exploratory Analysis
- Descriptive Statistics
- y=f(x)
  - Group I
  - Group II
  - Group III
  - Group IV
Outlier Handling
Data Modeling
- Group I
- Group II
- Group III
- Group IV
Conclusion

Introduction

Project 2 of IS607 class requires to perform an exploratory data analysis of the given four data sets.

The data sets are consisted with two variables (X,Y), and the respective values across four groups (I,II,III,IV).

Our objectives in this analysis are:

Identifying data quality issues.
Suggest hypotheses about the cause of observed phenomena.
Assess assumptions on which statistical inference will be based.
Support the selection of appropriate statistical tools and techniques.

[Back to Top]

Methodology

The methodology of this analysis consists of:

Capture provided data in CSV format with the following columns: group, x, y
Load the data into R
Conduct exploratory analysis including:
- Obtain descriptive statistics by group
- Produce basic y~x charts including linear fitting curve
- Result interpretation
Data Modeling
- Outlier handling
- Utilization of linear of polynomial regression models

[Back to Top]

Exploratory Analysis

Descriptive Statistics

X Variable

##   Group x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
## 1     I    4.0       6.5      9.0    9.0      11.5   14.0
## 2    II    4.0       6.5      9.0    9.0      11.5   14.0
## 3   III    4.0       6.5      9.0    9.0      11.5   14.0
## 4    IV    8.0       8.0      8.0    9.0       8.0   19.0

##   group 4 5 6 7  8 9 10 11 12 13 14 19
## 1     I 1 1 1 1  1 1  1  1  1  1  1  0
## 2    II 1 1 1 1  1 1  1  1  1  1  1  0
## 3   III 1 1 1 1  1 1  1  1  1  1  1  0
## 4    IV 0 0 0 0 10 0  0  0  0  0  0  1

From the summary and plot above we identified that the values of x in groups I, II and III are identical , with X being an integer uniformly distributed with values spanning from 4 to 14.

Group IV however, has only two numerical values: X=8 (n=10) and X=19 (n=1) having the same mean value as group I,II, and III.

[Back to Top]

Y Variable

##   Group x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
## 1     I    4.3       6.3      7.6    7.5       8.6   10.8
## 2    II    3.1       6.7      8.1    7.5       8.9    9.3
## 3   III    5.4       6.2      7.1    7.5       8.0   12.7
## 4    IV    5.2       6.2      7.0    7.5       8.2   12.5

Aside from the fact that the mean of Y across all groups is 7.5, the distribution of its values is different across groups.

[Back to Top]

y=f(x)

Given the fact that there are differences in the basic distribution of X,Y values, we break the analysis of these combinations for each of the groups (I,II,III,IV).

For this part of the analysis we plot the (x,y) value to formulate the relationship between X and Y pairs alongside a linear approximation.

Group I

From the above chart we can see that the data in group I does follow a linear trend, with acceptable R2 and p-values.

[Back to Top]

Group II

Here we see that group II’s regression output are identical to those of group I. However, it is evident that a quadratic approximation may be a better fit.

[Back to Top]

Group III

In this case we can see the presence of an outlier (13,12.7). This value clearly skews the curve. If removed, a much better fit can be obtained.

[Back to Top]

Group IV

This series appears to include multiple Y observations for X=8. The average of these multiple Y values can be used instead.

However, it does not appear that the resulting function would be much different than y=0.5x + 3.

[Back to Top]

Outlier Handling

For the purpose of fitting the data to a model, we will apply the following manipulations to the data (by group):

I: Even though the R-squared is relatively low, the fitness to a linear model is not being skewed by outliers but it is intrinsic to the data. Therefore, we will use the data directly without any modifications.
II: In this case, we will attempt to fit the data to a polynomial model without making any modifications to the data.
III: There is clearly one outlier which throws the linear model in the wrong direction (x=13,y=12.74). We will remove this data-point in order to obtain a model that more adequately fits the remaining of the data.
IV: There appear to be multiple Y observations for X=8. We will aggregate these multiple observations into one by averaging all the corresponding Y values.

[Back to Top]

Data Modeling

Conclusion

Hypotheses: We proved that X variables are predictors and Y variables are response across all groups of observations.
Assumption: Based on the regression analysis we identified that X and Y move in the same direction without major outliers except in groups III and IV. Under this assumption, we handled outliers in order to improve the quality of the data. One has to be careful when trying to handle outliers in the data. We retain, removed or aggregated data or adjusted with the goal of obtaining a better fit to either a linear or polynomial model.
Selection of technique: Simple linear or polynomial regression is suitable to analyze each set of data since there are only two variables in each set. However, multiple linear regression with interactions would be another option for the data sets if the goal is to find out the relationship of the data sets. Our goal was to obtain a good linear regression.

[Back to Top]

IS607 - Project 2

Mauricio Alarcon, Jamey Etherton

March 14, 2015

Table of Contents

Introduction

Methodology

Exploratory Analysis

Descriptive Statistics

X Variable

Y Variable

y=f(x)

Group I

Group II

Group III

Group IV

Outlier Handling

Data Modeling

Group I

Group II

Group III

Group IV

Conclusion