Project 2 of IS607 class requires to perform an exploratory data analysis of the given four data sets.
The data sets are consisted with two variables (X,Y), and the respective values across four groups (I,II,III,IV).
Our objectives in this analysis are:
The methodology of this analysis consists of:
## Group x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
## 1 I 4.0 6.5 9.0 9.0 11.5 14.0
## 2 II 4.0 6.5 9.0 9.0 11.5 14.0
## 3 III 4.0 6.5 9.0 9.0 11.5 14.0
## 4 IV 8.0 8.0 8.0 9.0 8.0 19.0
## group 4 5 6 7 8 9 10 11 12 13 14 19
## 1 I 1 1 1 1 1 1 1 1 1 1 1 0
## 2 II 1 1 1 1 1 1 1 1 1 1 1 0
## 3 III 1 1 1 1 1 1 1 1 1 1 1 0
## 4 IV 0 0 0 0 10 0 0 0 0 0 0 1
From the summary and plot above we identified that the values of x in groups I, II and III are identical , with X being an integer uniformly distributed with values spanning from 4 to 14.
Group IV however, has only two numerical values: X=8 (n=10) and X=19 (n=1) having the same mean value as group I,II, and III.
## Group x.Min. x.1st Qu. x.Median x.Mean x.3rd Qu. x.Max.
## 1 I 4.3 6.3 7.6 7.5 8.6 10.8
## 2 II 3.1 6.7 8.1 7.5 8.9 9.3
## 3 III 5.4 6.2 7.1 7.5 8.0 12.7
## 4 IV 5.2 6.2 7.0 7.5 8.2 12.5
Aside from the fact that the mean of Y across all groups is 7.5, the distribution of its values is different across groups.
Given the fact that there are differences in the basic distribution of X,Y values, we break the analysis of these combinations for each of the groups (I,II,III,IV).
For this part of the analysis we plot the (x,y) value to formulate the relationship between X and Y pairs alongside a linear approximation.
From the above chart we can see that the data in group I does follow a linear trend, with acceptable R2 and p-values.
Here we see that group II’s regression output are identical to those of group I. However, it is evident that a quadratic approximation may be a better fit.
In this case we can see the presence of an outlier (13,12.7). This value clearly skews the curve. If removed, a much better fit can be obtained.
For the purpose of fitting the data to a model, we will apply the following manipulations to the data (by group):
Hypotheses: We proved that X variables are predictors and Y variables are response across all groups of observations.
Assumption: Based on the regression analysis we identified that X and Y move in the same direction without major outliers except in groups III and IV. Under this assumption, we handled outliers in order to improve the quality of the data. One has to be careful when trying to handle outliers in the data. We retain, removed or aggregated data or adjusted with the goal of obtaining a better fit to either a linear or polynomial model.
Selection of technique: Simple linear or polynomial regression is suitable to analyze each set of data since there are only two variables in each set. However, multiple linear regression with interactions would be another option for the data sets if the goal is to find out the relationship of the data sets. Our goal was to obtain a good linear regression.