Initial Analysis

We are given 4 data sets, and are asked to perform exploratory analysis. Before performing any analysis, the given data must be read into R Objects, so that we can use R’s ability to perform statistical analysis. Upon looking at the data, I found that the 4 data sets do not have any NA values, and are pretty small (each data set has just 2 variables, with 11 observations). Also since the given data is in tabluar form, the best R object to support this structure will be a data.frame. Four data frames have to be created to hold the given 4 data sets. The data frames D1, D2, D3, and D4 will hold the data of I, II, III and IV data sets respectively. Each of the data frames D1, D2, D3 and D4 will have 2 variables X and Y.

NOTE: Upon close observation, the data sets I, II and III have the same X variable values. Hence in the following R Code, the X vector is prepared only once to create the D1, D2 and D3 Data frames. But for data set IV, the X variable has different values, so a new X vector is prepared to create the Data frame D4.

X <- c(10,8,13,9,11,14,6,4,12,7,5)  
Y <- c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)

D1 <- data.frame(X=X,Y=Y)

Y <- c(9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.1,9.13,7.26,4.74)
D2 <- data.frame(X=X,Y=Y)

Y <- c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73)
D3 <- data.frame(X=X,Y=Y)

X <- c(8,8,8,8,8,8,8,19,8,8,8)
Y <- c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89)

D4 <- data.frame(X=X,Y=Y)

Let us print all the 4 data sets content, along with their data summaries. Since the number of observations in the data sets are small, I am printing all the contents of the data sets. In the real world problems, the data sets are often large, and it is not easy to list the whole data sets contents. We can use fucntions such as head() and tail() to display some first and last records of data set respectively.

Contents of D1 Data frame and its summary information

D1

##     X     Y
## 1  10  8.04
## 2   8  6.95
## 3  13  7.58
## 4   9  8.81
## 5  11  8.33
## 6  14  9.96
## 7   6  7.24
## 8   4  4.26
## 9  12 10.84
## 10  7  4.82
## 11  5  5.68

summary(D1)

##        X              Y         
##  Min.   : 4.0   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.315  
##  Median : 9.0   Median : 7.580  
##  Mean   : 9.0   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :10.840

Contents of D2 Data frame and its summary information

D2

##     X    Y
## 1  10 9.14
## 2   8 8.14
## 3  13 8.74
## 4   9 8.77
## 5  11 9.26
## 6  14 8.10
## 7   6 6.13
## 8   4 3.10
## 9  12 9.13
## 10  7 7.26
## 11  5 4.74

summary(D2)

##        X              Y        
##  Min.   : 4.0   Min.   :3.100  
##  1st Qu.: 6.5   1st Qu.:6.695  
##  Median : 9.0   Median :8.140  
##  Mean   : 9.0   Mean   :7.501  
##  3rd Qu.:11.5   3rd Qu.:8.950  
##  Max.   :14.0   Max.   :9.260

Contents of D3 Data frame and its summary information

D3

##     X     Y
## 1  10  7.46
## 2   8  6.77
## 3  13 12.74
## 4   9  7.11
## 5  11  7.81
## 6  14  8.84
## 7   6  6.08
## 8   4  5.39
## 9  12  8.15
## 10  7  6.42
## 11  5  5.73

summary(D3)

##        X              Y        
##  Min.   : 4.0   Min.   : 5.39  
##  1st Qu.: 6.5   1st Qu.: 6.25  
##  Median : 9.0   Median : 7.11  
##  Mean   : 9.0   Mean   : 7.50  
##  3rd Qu.:11.5   3rd Qu.: 7.98  
##  Max.   :14.0   Max.   :12.74

Contents of D4 Data frame and its summary information

D4

##     X     Y
## 1   8  6.58
## 2   8  5.76
## 3   8  7.71
## 4   8  8.84
## 5   8  8.47
## 6   8  7.04
## 7   8  5.25
## 8  19 12.50
## 9   8  5.56
## 10  8  7.91
## 11  8  6.89

summary(D4)

##        X            Y         
##  Min.   : 8   Min.   : 5.250  
##  1st Qu.: 8   1st Qu.: 6.170  
##  Median : 8   Median : 7.040  
##  Mean   : 9   Mean   : 7.501  
##  3rd Qu.: 8   3rd Qu.: 8.190  
##  Max.   :19   Max.   :12.500

The summary report of the data frames (D1, D2, D3 and D4) inform that the variables X and Y in all the given data frames are numeric. We will use table() function, to determine how the observations are distributed in the data frames.

table(D1$X)

## 
##  4  5  6  7  8  9 10 11 12 13 14 
##  1  1  1  1  1  1  1  1  1  1  1

table(D1$Y)

## 
##  4.26  4.82  5.68  6.95  7.24  7.58  8.04  8.33  8.81  9.96 10.84 
##     1     1     1     1     1     1     1     1     1     1     1

table(D1$X,D1$Y)

##     
##      4.26 4.82 5.68 6.95 7.24 7.58 8.04 8.33 8.81 9.96 10.84
##   4     1    0    0    0    0    0    0    0    0    0     0
##   5     0    0    1    0    0    0    0    0    0    0     0
##   6     0    0    0    0    1    0    0    0    0    0     0
##   7     0    1    0    0    0    0    0    0    0    0     0
##   8     0    0    0    1    0    0    0    0    0    0     0
##   9     0    0    0    0    0    0    0    0    1    0     0
##   10    0    0    0    0    0    0    1    0    0    0     0
##   11    0    0    0    0    0    0    0    1    0    0     0
##   12    0    0    0    0    0    0    0    0    0    0     1
##   13    0    0    0    0    0    1    0    0    0    0     0
##   14    0    0    0    0    0    0    0    0    0    1     0

table(D2$X)

## 
##  4  5  6  7  8  9 10 11 12 13 14 
##  1  1  1  1  1  1  1  1  1  1  1

table(D2$Y)

## 
##  3.1 4.74 6.13 7.26  8.1 8.14 8.74 8.77 9.13 9.14 9.26 
##    1    1    1    1    1    1    1    1    1    1    1

table(D2$X,D2$Y)

##     
##      3.1 4.74 6.13 7.26 8.1 8.14 8.74 8.77 9.13 9.14 9.26
##   4    1    0    0    0   0    0    0    0    0    0    0
##   5    0    1    0    0   0    0    0    0    0    0    0
##   6    0    0    1    0   0    0    0    0    0    0    0
##   7    0    0    0    1   0    0    0    0    0    0    0
##   8    0    0    0    0   0    1    0    0    0    0    0
##   9    0    0    0    0   0    0    0    1    0    0    0
##   10   0    0    0    0   0    0    0    0    0    1    0
##   11   0    0    0    0   0    0    0    0    0    0    1
##   12   0    0    0    0   0    0    0    0    1    0    0
##   13   0    0    0    0   0    0    1    0    0    0    0
##   14   0    0    0    0   1    0    0    0    0    0    0

table(D3$X)

## 
##  4  5  6  7  8  9 10 11 12 13 14 
##  1  1  1  1  1  1  1  1  1  1  1

table(D3$Y)

## 
##  5.39  5.73  6.08  6.42  6.77  7.11  7.46  7.81  8.15  8.84 12.74 
##     1     1     1     1     1     1     1     1     1     1     1

table(D3$X,D3$Y)

##     
##      5.39 5.73 6.08 6.42 6.77 7.11 7.46 7.81 8.15 8.84 12.74
##   4     1    0    0    0    0    0    0    0    0    0     0
##   5     0    1    0    0    0    0    0    0    0    0     0
##   6     0    0    1    0    0    0    0    0    0    0     0
##   7     0    0    0    1    0    0    0    0    0    0     0
##   8     0    0    0    0    1    0    0    0    0    0     0
##   9     0    0    0    0    0    1    0    0    0    0     0
##   10    0    0    0    0    0    0    1    0    0    0     0
##   11    0    0    0    0    0    0    0    1    0    0     0
##   12    0    0    0    0    0    0    0    0    1    0     0
##   13    0    0    0    0    0    0    0    0    0    0     1
##   14    0    0    0    0    0    0    0    0    0    1     0

table(D4$X)

## 
##  8 19 
## 10  1

table(D4$Y)

## 
## 5.25 5.56 5.76 6.58 6.89 7.04 7.71 7.91 8.47 8.84 12.5 
##    1    1    1    1    1    1    1    1    1    1    1

table(D4$X,D4$Y)

##     
##      5.25 5.56 5.76 6.58 6.89 7.04 7.71 7.91 8.47 8.84 12.5
##   8     1    1    1    1    1    1    1    1    1    1    0
##   19    0    0    0    0    0    0    0    0    0    0    1

Except for D4 data frame, all other data frames (D1, D2, D3) have distinct observations for X and Y values. For data frame D4, there are 10 values for X=8 and one value for X=19. Looks like the value X=19, might be an outlier. Let us draw the box plots to find any outliers in the X and Y variables of the 4 data frames. But before doing any graphical analysis, we have to perform data massaging, for easier anaysis, and for easier creation of graphs.

Data Massaging

To make our analysis more easy, let us create a data frame D_Temp, to contain the data in a format, which is easy to plot graphs. This data frame’s content will change, depending on our need, and it is a temporary data frame. Our first step in the analysis is to graph box plots for X and Y variables in D1, D2 and D3 data frames. These three data frames are having the same X values for its observations, and hence my plan is to plot the box plots side by side for the variables of D1, D2 and D3. The D_Temp data frame will have two variables “type” and “value”. The type variable will have 4 different values X1, Y1, Y2, Y3. The type=X1 observations will represent the X observations from D1, D2, and D3 Data frames. Note that the X variable in these three data frames D1, D2, and D3 have the same values. Hence I am representing the the X variable values from these three data sets only once. The observations with type=Y1, Y2, Y3 represent the Y observations from D1, D2, D3 data frames respectively.

D_Temp data frame creation, to plot box plots for D1, D2, D3 data frame variables (side-by-side):

D_Temp <- data.frame(type=c(rep('X1',length(D1$X)),rep('Y1',length(D1$Y)),rep('Y2',length(D2$Y)),rep('Y3',length(D3$Y))),value=c(D1$X,D1$Y,D2$Y,D3$Y))

Let us display the data from D_Temp data frame. Observe that the X and Y values of D1, D2 and D3 are pivoted, and the type variable of D_Temp represents the data frame source, to which the observations belong to.

D_Temp

##    type value
## 1    X1 10.00
## 2    X1  8.00
## 3    X1 13.00
## 4    X1  9.00
## 5    X1 11.00
## 6    X1 14.00
## 7    X1  6.00
## 8    X1  4.00
## 9    X1 12.00
## 10   X1  7.00
## 11   X1  5.00
## 12   Y1  8.04
## 13   Y1  6.95
## 14   Y1  7.58
## 15   Y1  8.81
## 16   Y1  8.33
## 17   Y1  9.96
## 18   Y1  7.24
## 19   Y1  4.26
## 20   Y1 10.84
## 21   Y1  4.82
## 22   Y1  5.68
## 23   Y2  9.14
## 24   Y2  8.14
## 25   Y2  8.74
## 26   Y2  8.77
## 27   Y2  9.26
## 28   Y2  8.10
## 29   Y2  6.13
## 30   Y2  3.10
## 31   Y2  9.13
## 32   Y2  7.26
## 33   Y2  4.74
## 34   Y3  7.46
## 35   Y3  6.77
## 36   Y3 12.74
## 37   Y3  7.11
## 38   Y3  7.81
## 39   Y3  8.84
## 40   Y3  6.08
## 41   Y3  5.39
## 42   Y3  8.15
## 43   Y3  6.42
## 44   Y3  5.73

Let us plot Box plots for X1, Y1, Y2, and Y3 values of D_Temp data frame.

library("ggplot2")

ggplot(D_Temp,aes(x=type,y=value))+
 geom_boxplot()

Figure-1: Box plots of X, Y variables in D1, D2, and D3 data sets. X1, box plot belongs to D1, D2, and D3 data frame’s X values, and Y1, Y2, Y3 box plots belong to D1, D2 and D3 data frame’s Y values respectively

In the above box plots (Figure 1), the X1 box plot represent the box plot for X values of D1 (and, D2 and D3, since all these data frames have the same X values). The other plots Y1, Y2 and Y3 represent the data related to Y variable in D1, D2 and D3 respectively.

From the three box plots, we can infer the following:

None of the X values have any outliers.
The Y values of D2 has an outlier
The Y values of D3 has an outlier
The observations of Y variable in D2 are skewed towards left (lower values of Y in D2)
The observations of Y variable in D3 are skewed towards right (higher values of Y in D3)

Let us plot a box plot for D4 Data frame also. Note that we are plotting the D4 variables box plots separately, since the X values are different for D4 data frame, while the other data frames have the same X values. We will use the same data frame D_Temp, again to hold the pivoted data of D4 Data frame. We will over write the existing contents of D_Temp data frame with the values of D4 Data frame.

D_Temp <- data.frame(type=c(rep('X4',length(D4$X)),rep('Y4',length(D4$Y))),value=c(D4$X,D4$Y))

ggplot(D_Temp,aes(x=type,y=value))+
 geom_boxplot()

Figure-2: Box plot of D4’s X and Y variables

In the above box plots (Figure-2), the X4 box plot represents the values of X variable in D4 Data set, and Y4 box plot represents the values of Y variable in D4 Data set. We can infer that the X values of D4 data frame has one outlier and Y variable has one outlier.

Given the numerical nature of the observations, regression alanysis, would give us more information about the data. But let us do cluster analysis first. This analysis helps us to decide if we have to perform regression analysis on the 4 data sets separately or should they be combined.

We will concatenate the D1, D2, D3, and D4 data frames vertically into D_Temp data frame.

D_Temp <- rbind(merge(data.frame(D1),c('D1')), merge(data.frame(D2),c('D2')), merge(data.frame(D3),c('D3')), merge(data.frame(D4),c('D4')))
names(D_Temp) <- c('X','Y','Source')
#The Source variable represent the data frame, to which the observations belong to

D_Temp

##     X     Y Source
## 1  10  8.04     D1
## 2   8  6.95     D1
## 3  13  7.58     D1
## 4   9  8.81     D1
## 5  11  8.33     D1
## 6  14  9.96     D1
## 7   6  7.24     D1
## 8   4  4.26     D1
## 9  12 10.84     D1
## 10  7  4.82     D1
## 11  5  5.68     D1
## 12 10  9.14     D2
## 13  8  8.14     D2
## 14 13  8.74     D2
## 15  9  8.77     D2
## 16 11  9.26     D2
## 17 14  8.10     D2
## 18  6  6.13     D2
## 19  4  3.10     D2
## 20 12  9.13     D2
## 21  7  7.26     D2
## 22  5  4.74     D2
## 23 10  7.46     D3
## 24  8  6.77     D3
## 25 13 12.74     D3
## 26  9  7.11     D3
## 27 11  7.81     D3
## 28 14  8.84     D3
## 29  6  6.08     D3
## 30  4  5.39     D3
## 31 12  8.15     D3
## 32  7  6.42     D3
## 33  5  5.73     D3
## 34  8  6.58     D4
## 35  8  5.76     D4
## 36  8  7.71     D4
## 37  8  8.84     D4
## 38  8  8.47     D4
## 39  8  7.04     D4
## 40  8  5.25     D4
## 41 19 12.50     D4
## 42  8  5.56     D4
## 43  8  7.91     D4
## 44  8  6.89     D4

ggplot(D_Temp,aes(x=X,y=Y,color=Source,shape=Source))+
geom_point(pch=16,size=5)

Figure-3: Scatter plot of (X,Y) observations from all the data frames (D1, D2, D3, D4)

In the above scatter plot (Figure-3), we can see that each of the 4 data frames data are distributed independently, and there is no common pattern or clusters. This will be more evident by plotting the scatter plots for X and Y variables in all the data frames, separately.

library("gridExtra")

## Loading required package: grid

P1 <- ggplot(data=D1,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D1 data", x="X", y="Y")

P2 <- ggplot(data=D2,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D2 data", x="X", y="Y")

P3 <- ggplot(data=D3,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D3 data", x="X", y="Y")

P4 <- ggplot(data=D4,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D4 data", x="X", y="Y")

grid.arrange(P1, P2, P3, P4, ncol=2)

Figure-4: Scatter plots of (X,Y) values of D1, D2, D3 and D4 Data frames

The above scatter plots (Figure-4) suggest that each data frame’s data has different scatter plots, and we cannot just obtain a common regression model for all the data (X and Y variables), in D1, D2, D3 and D4.

Let us perform regression analysis separately, on each of the 4 data sets.

Regression Analysis

We will plot the linear regression lines on each of the scatter plots, and also obtain the corresponding Residual plot to determine, if our linear regression models reasonably model the respective scatter plot. If the residual plot has a specific pattern, then we have to perform further analysis to see if any other non-linear or logistic models fit the scatter plots. In this analysis, I assumed that Y is the dependent variable, and X as the exploratory (or independent) variable. Please NOTE that no regression model can be used to identify the cause and effect relationship. Hence there is NO way we can say that the variable Y is dependent on X or X variable is dependent on Y. To identify the cause and effect relationship between 2 variables, we need to perform hypothesis testing, which is out of the scope of this analysis.

Let us now use the linear regression models, plot the scatter plots, and the corresponding residual plots.

Fit_D1 <- lm(Y~X,data=D1)

P1 <- ggplot(data=D1,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  geom_smooth(method="lm",color="red") +
  labs(title="D1 data scatter plot", x="X", y="Y")

R1 <- ggplot(data=D1,aes(x=X, y=residuals(Fit_D1))) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D1 residual plot", x="X", y="Residual")

Fit_D2 <- lm(Y~X,data=D2)
  
P2 <- ggplot(data=D2,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  geom_smooth(method="lm",color="red") +
  labs(title="D2 data scatter plot", x="X", y="Y")

R2 <- ggplot(data=D2,aes(x=X, y=residuals(Fit_D2))) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D2 residual plot", x="X", y="Residual")

Fit_D3 <- lm(Y~X,data=D3)
 
P3 <- ggplot(data=D3,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  geom_smooth(method="lm",color="red") +
  labs(title="D3 data scatter plot", x="X", y="Y")

R3 <- ggplot(data=D3,aes(x=X, y=residuals(Fit_D3))) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D3 residual plot", x="X", y="Residual")
 

Fit_D4 <- lm(Y~X,data=D4)

P4 <- ggplot(data=D4,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  geom_smooth(method="lm",color="red") +
  labs(title="D4 data scatter plot", x="X", y="Y")

R4 <- ggplot(data=D4,aes(x=X, y=residuals(Fit_D4))) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D4 residual plot", x="X", y="Residual")



grid.arrange(P1, R1, ncol=2)

summary(Fit_D1)

## 
## Call:
## lm(formula = Y ~ X, data = D1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## X             0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

grid.arrange(P2, R2, ncol=2)

summary(Fit_D2)

## 
## Call:
## lm(formula = Y ~ X, data = D2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## X              0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

grid.arrange(P3, R3, ncol=2)

summary(Fit_D3)

## 
## Call:
## lm(formula = Y ~ X, data = D3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## X             0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

grid.arrange(P4, R4, ncol=2)

summary(Fit_D4)

## 
## Call:
## lm(formula = Y ~ X, data = D4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## X             0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

Figure-5: Linear regression models, Residual plots, Summary of the models

Looking at the residual plots, except for D1 data frame, all other residual plots have a specific pattern, suggesting us to use a non-linear regression model for D2, D3 and D4. But note that the residual plots may have specific patterns due to presence of outliers also.

While D1 data frame’s residual plot does not have any pattern, the R-Squared value of D1’s scatter plot is approximately 66%, which is very less. So we will check, if any other regression models on D1 can improve the R-Squared value of D1 Scatter plot.

Let us find the R-Squared value for the following transformations of X and Y (in D1): \[log(Y), log(X)\] \[Y, X^2\] \[Y,1/X\]

Fit_D1 <- lm(log(D1$Y) ~ I(log(D1$X)))
summary(Fit_D1)

## 
## Call:
## lm(formula = log(D1$Y) ~ I(log(D1$X)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.297173 -0.053124  0.000695  0.111994  0.202071 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    0.7036     0.2793   2.519  0.03283 * 
## I(log(D1$X))   0.5994     0.1292   4.640  0.00122 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1662 on 9 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.6724 
## F-statistic: 21.53 on 1 and 9 DF,  p-value: 0.00122

Fit_D1 <- lm(D1$Y ~ I(D1$X^2))
summary(Fit_D1)

## 
## Call:
## lm(formula = D1$Y ~ I(D1$X^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9790 -0.7867  0.0375  0.7460  1.9406 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.099749   0.748383   6.814 7.78e-05 ***
## I(D1$X^2)   0.026386   0.006949   3.797  0.00424 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.328 on 9 degrees of freedom
## Multiple R-squared:  0.6157, Adjusted R-squared:  0.573 
## F-statistic: 14.42 on 1 and 9 DF,  p-value: 0.004235

Fit_D1 <- lm(D1$Y ~ I(1/D1$X))
summary(Fit_D1)

## 
## Call:
## lm(formula = D1$Y ~ I(1/D1$X))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2744 -0.4855  0.2536  0.7848  2.0082 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.2643     0.9733  11.573 1.05e-06 ***
## I(1/D1$X)   -29.1896     6.9636  -4.192  0.00234 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.246 on 9 degrees of freedom
## Multiple R-squared:  0.6613, Adjusted R-squared:  0.6236 
## F-statistic: 17.57 on 1 and 9 DF,  p-value: 0.002335

Among all the transformations, the log(X) and log(Y) transformations have the greatest R-Squared value of 70.52%. But the linear regression between X and Y has 66% of R-Squared value, almost same as log(X) and log(Y) transformation. So we will use the linear regression obtained in Figure-5, to model the D1 data. The linear regression line to model D1 data is given below. Figure-5 gives more information (see the summary information of D1 data frame in Figure-5): \[Y=0.5001X + 3.0001\].

Fitting regression line for D2 The scatter plot of D2 (in Figure-5) shows that the relationship between X and Y variables of D2 as exponential (as the shape of the scatter plot looks like a parabola). So a linear regression model is not appropriate to model D2 data. Also the residual plot of D2 is having a specific pattern, hence the linear regression model is not appropriate for D2 data. Let us generate a polynomial regression model to find the relationship between X and Y variables in D2 Data frame.

Fit_D2 <- lm(Y ~ poly(X, 2, raw=TRUE),data=D2)
P2 <- ggplot(data=D2,aes(x=X, y=Y)) +
     geom_point(pch=16,color="blue",size=4) +
     geom_smooth(method="lm",color="red") +
     stat_smooth(method="lm", se=TRUE, fill=NA, formula=y ~ poly(x, 3, raw=TRUE),colour="black") +
     labs(title="D2 data scatter plot", x="X", y="Y")

R2 <- ggplot(data=D2,aes(x=X, y=residuals(Fit_D2))) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D2 residual plot", x="X", y="Residual")

grid.arrange(P2, R2, ncol=2)

Figure-6: Polynomial regression od degree 2, to model D2 Data

The regression equation for D2 is given by printing Fit_D2

print(Fit_D2)

## 
## Call:
## lm(formula = Y ~ poly(X, 2, raw = TRUE), data = D2)
## 
## Coefficients:
##             (Intercept)  poly(X, 2, raw = TRUE)1  poly(X, 2, raw = TRUE)2  
##                 -5.9957                   2.7808                  -0.1267

The regression function for D2 is obtained as \[Y=-0.1267X^2+2.7808X-5.9957\]

The summary(Fit_D2) shows that the R-Squared value for the above polynomial function is 1, which is a perfect fit.

summary(Fit_D2)

## 
## Call:
## lm(formula = Y ~ poly(X, 2, raw = TRUE), data = D2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0013287 -0.0011888 -0.0006294  0.0008741  0.0023776 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -5.9957343  0.0043299   -1385   <2e-16 ***
## poly(X, 2, raw = TRUE)1  2.7808392  0.0010401    2674   <2e-16 ***
## poly(X, 2, raw = TRUE)2 -0.1267133  0.0000571   -2219   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001672 on 8 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.378e+06 on 2 and 8 DF,  p-value: < 2.2e-16

Fitting regression line for D3

In Figure-5, in the linear regression of D3, we can see that a linear regression model should fit the data of D3. But the residual plot of D3 data suggests that linear model is not appropriate, since there is specific pattern of data in D3 residual plot. However, in D3 scatter plot, we see that there is an outlier (Y = 12.74). So we will first try to eliminate this observation from the D3 data frame, and plot the linear regression line, along with the new residual plot. If the residual graph is scattered, then we can consider the linear regression model obtained (after eliminating the outlier observation) is appropriate to model D3 Data.

D3 <- D3[-which(D3$Y==12.74),]

Fit_D3 <- lm(Y~X,data=D3)
 
P3 <- ggplot(data=D3,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  geom_smooth(method="lm",color="red") +
  labs(title="D3 data scatter plot", x="X", y="Y")

R3 <- ggplot(data=D3,aes(x=X, y=residuals(Fit_D3))) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D3 residual plot", x="X", y="Residual")

grid.arrange(P3, R3, ncol=2)

summary(Fit_D3)

## 
## Call:
## lm(formula = Y ~ X, data = D3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0041558 -0.0022240  0.0000649  0.0018182  0.0050649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.0056494  0.0029242    1370   <2e-16 ***
## X           0.3453896  0.0003206    1077   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003082 on 8 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.161e+06 on 1 and 8 DF,  p-value: < 2.2e-16

print(Fit_D3)

## 
## Call:
## lm(formula = Y ~ X, data = D3)
## 
## Coefficients:
## (Intercept)            X  
##      4.0056       0.3454

Figure-7: Modified scatter plot, Linear Regression line and Residual plot of D3 data (after removing the outlier of D3$Y)

The above plots show that after eliminating the outlier Y = 12.74 from D3 data frame, the linear model looks appropriate to model D3 data, since there is no specific pattern in the residual plot. Also the R-Squared value of Fit_D3 (new linear model) is showing as 1. Hence the following model is appropriate for D3 Data.

The regression function to model D3 data is given below \[Y=0.3454X + 4.0056\]

Fitting regression line for D4 For D4 Data frame also, we see an outlier for X=19. So we will eliminate this outlier observation and verify, if the resulting residual plot for linear regression line deoes not have any specific pattern. We will also check if R-Squared value for the new linear model, after the elimination of outlier (X=19), is high.

D4 <- D4[-which(D4$X==19),]


Fit_D4 <- lm(Y~X,data=D4)

P4 <- ggplot(data=D4,aes(x=X, y=Y)) +
  geom_point(pch=16,color="blue",size=4) +
  geom_smooth(method="lm",color="red") +
  labs(title="D4 data scatter plot", x="X", y="Y")

R4 <- ggplot(data=D4,aes(x=X, y=residuals(Fit_D4))) +
  geom_point(pch=16,color="blue",size=4) +
  labs(title="D4 residual plot", x="X", y="Residual")



grid.arrange(P4, R4, ncol=2)

Figure-8: Modified scatter plot, Linear Regression line and Residual plot of D4 data (after removing the outlier of D4$X)

For D4, the residual is also showing a specific pattern, even after eliminating the outlier X=19. But if we observe the plot, there is no correlation between X and Y, after eliminating the outlier, since we have all the observations as X=8, and the Std. Dev of X will be 0. Hence we can model the observations as X=8. This means, Y can assume any value, but X remains the same (X=8). But if we do not eliminate the outlier, we will get the following linear model for D4 (see figure-5)

\[Y=0.4999X+3.0017\]

In summary, here are the best models to model the I, II, III, and IV data sets:

Data set I: \[Y=0.5001X + 3.0001\]

Data set II: \[Y=-0.1267X^2+2.7808X-5.9957\]

Data set III (After eliminating the outlier): \[Y=0.3454X + 4.0056\]

Data set IV (After eliminating the outlier): \[X = 8\]

Data set IV (by including the outlier): \[Y=0.4999X+3.0017\]

For the data set IV, unless we have some more observations we cannot determine which of the two models is optimal.

MSDA 607 - Project 2 - Exploratory Analysis

Sekhar Mekala

Sunday, March 08, 2015

Initial Analysis

Data Massaging

Regression Analysis

End of Project Report