To begin, the four datasets were loaded into the Rstudio environment using read.table.Data is then combined by columns and by each of the four sets of data.
ds <- read.table("C:/Users/Public/proj2.csv", header = TRUE, sep = ",")
ds
## xI yI xII yII xIII yIII xIV yIV
## 1 10 8.04 10 9.14 10 7.46 8 6.58
## 2 8 6.95 8 8.14 8 6.77 8 5.76
## 3 13 7.58 13 8.74 13 12.74 8 7.71
## 4 9 8.81 9 8.77 9 7.11 8 8.84
## 5 11 8.33 11 9.26 11 7.81 8 8.47
## 6 14 9.96 14 8.10 14 8.84 8 7.04
## 7 6 7.24 6 6.13 6 6.08 8 5.25
## 8 4 4.26 4 3.10 4 5.39 19 12.50
## 9 12 10.84 12 9.13 12 8.15 8 5.56
## 10 7 4.82 7 7.26 7 6.42 8 7.91
## 11 5 5.68 5 4.74 5 5.73 8 6.89
attach(ds)
#define four individual datasets
I <- cbind(xI,yI)
II <- cbind(xII,yII)
III <- cbind(xIII,yIII)
IV <- cbind(xIV,yIV)
#combine columns
xall <- cbind(xI, xII, xIII, xIV)
yall <- cbind(yI, yII, yIII, yIV)
1. Descriptive statistics
#summary
summary(ds)
## xI yI xII yII
## Min. : 4.0 Min. : 4.260 Min. : 4.0 Min. :3.100
## 1st Qu.: 6.5 1st Qu.: 6.315 1st Qu.: 6.5 1st Qu.:6.695
## Median : 9.0 Median : 7.580 Median : 9.0 Median :8.140
## Mean : 9.0 Mean : 7.501 Mean : 9.0 Mean :7.501
## 3rd Qu.:11.5 3rd Qu.: 8.570 3rd Qu.:11.5 3rd Qu.:8.950
## Max. :14.0 Max. :10.840 Max. :14.0 Max. :9.260
## xIII yIII xIV yIV
## Min. : 4.0 Min. : 5.39 Min. : 8 Min. : 5.250
## 1st Qu.: 6.5 1st Qu.: 6.25 1st Qu.: 8 1st Qu.: 6.170
## Median : 9.0 Median : 7.11 Median : 8 Median : 7.040
## Mean : 9.0 Mean : 7.50 Mean : 9 Mean : 7.501
## 3rd Qu.:11.5 3rd Qu.: 7.98 3rd Qu.: 8 3rd Qu.: 8.190
## Max. :14.0 Max. :12.74 Max. :19 Max. :12.500
#correlation
cor(xI,yI)
## [1] 0.8164205
cor(xII,yII)
## [1] 0.8162365
cor(xIII,yIII)
## [1] 0.8162867
cor(xIV,yIV)
## [1] 0.8165214
#variance
var(xI,yI)
## [1] 5.501
var(xII,yII)
## [1] 5.5
var(xIII,yIII)
## [1] 5.497
var(xIV,yIV)
## [1] 5.499
2. Regression
##
## Call:
## lm(formula = yI ~ xI, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## xI 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
##
## Call:
## lm(formula = yII ~ xII, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## xII 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
##
## Call:
## lm(formula = yIII ~ xIII, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## xIII 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
##
## Call:
## lm(formula = yIV ~ xIV, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## xIV 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
3. Visualization
## Loading required package: ggplot2
Scatterplots
## Loading required package: grid
Boxplots
Best fit line
Violin plots (Must have “vioplot” package installed.)
## Warning: package 'vioplot' was built under R version 3.1.3
## Loading required package: sm
## Warning: package 'sm' was built under R version 3.1.3
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
Parallel plot (Must have “lattice” package installed.)
## Loading required package: lattice
Association plot
Results Each of the 4 datasets have the same or similar mean, variance, correlation and linear regression line. The mean of x=9, variance of x=11, mean of y=7.5, variance of y=4.1, correlation between x and y=0.816 and the linear regression line is y=3+0.5x.
Conclusions The four datasets have similar descriptive statistics but the plots of the four datasets show they each have different distributions. In addition, the plots demonstrate that there are outlier points in datasets III and IV. This exercise demonstrates how important it is to visualize the data when doing exploratory data analysis, as using only descriptive statistics may lead to misleading conclusions.