Data Setup

To begin, the four datasets were loaded into the Rstudio environment using read.table.Data is then combined by columns and by each of the four sets of data.

ds <- read.table("C:/Users/Public/proj2.csv", header = TRUE, sep = ",")
ds
##    xI    yI xII  yII xIII  yIII xIV   yIV
## 1  10  8.04  10 9.14   10  7.46   8  6.58
## 2   8  6.95   8 8.14    8  6.77   8  5.76
## 3  13  7.58  13 8.74   13 12.74   8  7.71
## 4   9  8.81   9 8.77    9  7.11   8  8.84
## 5  11  8.33  11 9.26   11  7.81   8  8.47
## 6  14  9.96  14 8.10   14  8.84   8  7.04
## 7   6  7.24   6 6.13    6  6.08   8  5.25
## 8   4  4.26   4 3.10    4  5.39  19 12.50
## 9  12 10.84  12 9.13   12  8.15   8  5.56
## 10  7  4.82   7 7.26    7  6.42   8  7.91
## 11  5  5.68   5 4.74    5  5.73   8  6.89
attach(ds)

#define four individual datasets
I <- cbind(xI,yI)
II <- cbind(xII,yII)
III <- cbind(xIII,yIII)
IV <- cbind(xIV,yIV)

#combine columns
xall <- cbind(xI, xII, xIII, xIV)
yall <- cbind(yI, yII, yIII, yIV)

Exploratory Analysis

1. Descriptive statistics

#summary
summary(ds)
##        xI             yI              xII            yII       
##  Min.   : 4.0   Min.   : 4.260   Min.   : 4.0   Min.   :3.100  
##  1st Qu.: 6.5   1st Qu.: 6.315   1st Qu.: 6.5   1st Qu.:6.695  
##  Median : 9.0   Median : 7.580   Median : 9.0   Median :8.140  
##  Mean   : 9.0   Mean   : 7.501   Mean   : 9.0   Mean   :7.501  
##  3rd Qu.:11.5   3rd Qu.: 8.570   3rd Qu.:11.5   3rd Qu.:8.950  
##  Max.   :14.0   Max.   :10.840   Max.   :14.0   Max.   :9.260  
##       xIII           yIII            xIV          yIV        
##  Min.   : 4.0   Min.   : 5.39   Min.   : 8   Min.   : 5.250  
##  1st Qu.: 6.5   1st Qu.: 6.25   1st Qu.: 8   1st Qu.: 6.170  
##  Median : 9.0   Median : 7.11   Median : 8   Median : 7.040  
##  Mean   : 9.0   Mean   : 7.50   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.: 7.98   3rd Qu.: 8   3rd Qu.: 8.190  
##  Max.   :14.0   Max.   :12.74   Max.   :19   Max.   :12.500
#correlation
cor(xI,yI)
## [1] 0.8164205
cor(xII,yII)
## [1] 0.8162365
cor(xIII,yIII)
## [1] 0.8162867
cor(xIV,yIV)
## [1] 0.8165214
#variance
var(xI,yI)
## [1] 5.501
var(xII,yII)
## [1] 5.5
var(xIII,yIII)
## [1] 5.497
var(xIV,yIV)
## [1] 5.499

2. Regression

## 
## Call:
## lm(formula = yI ~ xI, data = ds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## xI            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
## 
## Call:
## lm(formula = yII ~ xII, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## xII            0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
## 
## Call:
## lm(formula = yIII ~ xIII, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## xIII          0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
## 
## Call:
## lm(formula = yIV ~ xIV, data = ds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## xIV           0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

3. Visualization

## Loading required package: ggplot2

Scatterplots

## Loading required package: grid

Boxplots

Best fit line

Violin plots (Must have “vioplot” package installed.)

## Warning: package 'vioplot' was built under R version 3.1.3
## Loading required package: sm
## Warning: package 'sm' was built under R version 3.1.3
## Package 'sm', version 2.2-5.4: type help(sm) for summary information

Parallel plot (Must have “lattice” package installed.)

## Loading required package: lattice

Association plot

Results Each of the 4 datasets have the same or similar mean, variance, correlation and linear regression line. The mean of x=9, variance of x=11, mean of y=7.5, variance of y=4.1, correlation between x and y=0.816 and the linear regression line is y=3+0.5x.

Conclusions The four datasets have similar descriptive statistics but the plots of the four datasets show they each have different distributions. In addition, the plots demonstrate that there are outlier points in datasets III and IV. This exercise demonstrates how important it is to visualize the data when doing exploratory data analysis, as using only descriptive statistics may lead to misleading conclusions.