IS607 - Project 2

Exploratory Analysis

1. Descriptive statistics

#summary
summary(ds)

##        xI             yI              xII            yII       
##  Min.   : 4.0   Min.   : 4.260   Min.   : 4.0   Min.   :3.100  
##  1st Qu.: 6.5   1st Qu.: 6.315   1st Qu.: 6.5   1st Qu.:6.695  
##  Median : 9.0   Median : 7.580   Median : 9.0   Median :8.140  
##  Mean   : 9.0   Mean   : 7.501   Mean   : 9.0   Mean   :7.501  
##  3rd Qu.:11.5   3rd Qu.: 8.570   3rd Qu.:11.5   3rd Qu.:8.950  
##  Max.   :14.0   Max.   :10.840   Max.   :14.0   Max.   :9.260  
##       xIII           yIII            xIV          yIV        
##  Min.   : 4.0   Min.   : 5.39   Min.   : 8   Min.   : 5.250  
##  1st Qu.: 6.5   1st Qu.: 6.25   1st Qu.: 8   1st Qu.: 6.170  
##  Median : 9.0   Median : 7.11   Median : 8   Median : 7.040  
##  Mean   : 9.0   Mean   : 7.50   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.: 7.98   3rd Qu.: 8   3rd Qu.: 8.190  
##  Max.   :14.0   Max.   :12.74   Max.   :19   Max.   :12.500

#correlation
cor(xI,yI)

## [1] 0.8164205

cor(xII,yII)

## [1] 0.8162365

cor(xIII,yIII)

## [1] 0.8162867

cor(xIV,yIV)

## [1] 0.8165214

#variance
var(xI,yI)

## [1] 5.501

var(xII,yII)

## [1] 5.5

var(xIII,yIII)

## [1] 5.497

var(xIV,yIV)

## [1] 5.499

2. Regression

## 
## Call:
## lm(formula = yI ~ xI, data = ds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## xI            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

## 
## Call:
## lm(formula = yII ~ xII, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## xII            0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

## 
## Call:
## lm(formula = yIII ~ xIII, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## xIII          0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

## 
## Call:
## lm(formula = yIV ~ xIV, data = ds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## xIV           0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

3. Visualization

## Loading required package: ggplot2

Scatterplots

## Loading required package: grid

Boxplots

Best fit line

Violin plots (Must have “vioplot” package installed.)

## Warning: package 'vioplot' was built under R version 3.1.3

## Loading required package: sm

## Warning: package 'sm' was built under R version 3.1.3

## Package 'sm', version 2.2-5.4: type help(sm) for summary information

Parallel plot (Must have “lattice” package installed.)

## Loading required package: lattice

Association plot

Results Each of the 4 datasets have the same or similar mean, variance, correlation and linear regression line. The mean of x=9, variance of x=11, mean of y=7.5, variance of y=4.1, correlation between x and y=0.816 and the linear regression line is y=3+0.5x.

Conclusions The four datasets have similar descriptive statistics but the plots of the four datasets show they each have different distributions. In addition, the plots demonstrate that there are outlier points in datasets III and IV. This exercise demonstrates how important it is to visualize the data when doing exploratory data analysis, as using only descriptive statistics may lead to misleading conclusions.

IS607 - Project 2

Sonya Hong, Honey Berk

Sunday, March 15, 2015

Data Setup

Exploratory Analysis