In this exercise, we will do a analyze the data in four small datasets - each composed of 11 observations of two variables: x and y.

First, let’s load the required packages:

## Loading required package: stringr
## Loading required package: plyr
## Loading required package: ggplot2
## Loading required package: grid
## Loading required package: gridExtra
## Loading required package: GGally
## Loading required package: scales
## Loading required package: reshape2

And our data:

x1 <- c(10.0,8.0,13.0,9.0,11.0,14.0,6.0,4.0,12.0,7.0,5.0)
y1 <- c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)
x2 <- c(10.0,8.0,13.0,9.0,11.0,14.0,6.0,4.0,12.0,7.0,5.0)
y2 <- c(9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74)
x3 <- c(10.0,8.0,13.0,9.0,11.0,14.0,6.0,4.0,12.0,7.0,5.0)
y3 <- c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73)
x4 <- c(8.0,8.0,8.0,8.0,8.0,8.0,8.0,19.0,8.0,8.0,8.0)
y4 <- c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89)
df1 <- data.frame(x1,y1) # Load each pair into a dataframe and then compile into a master dataframe
df2 <- data.frame(x2,y2)
df3 <- data.frame(x3,y3)
df4 <- data.frame(x4,y4)
master <- data.frame(df1,df2,df3,df4)

Let’s get a brief overview of the stats. We can see that the mean and median are fairly stable among the pairs.

#The following steps take the summary data and convert them into a "long" dataframe
sumStats <- as.data.frame(summary(master)) 
sumStats$Freq <- as.character(sumStats$Freq)
interlist <- str_split(sumStats$Freq, ":")
interdf <- ldply(interlist)
colnames(interdf) <- c("Stat", "Value")
sumStats <- cbind(sumStats, interdf)
sumStats <- sumStats[-c(1,3)]
colnames(sumStats)[1] <- "Variable"
sumStats
##    Variable    Stat    Value
## 1        x1 Min.       4.0  
## 2        x1 1st Qu.    6.5  
## 3        x1 Median     9.0  
## 4        x1 Mean       9.0  
## 5        x1 3rd Qu.   11.5  
## 6        x1 Max.      14.0  
## 7        y1 Min.     4.260  
## 8        y1 1st Qu.  6.315  
## 9        y1 Median   7.580  
## 10       y1 Mean     7.501  
## 11       y1 3rd Qu.  8.570  
## 12       y1 Max.    10.840  
## 13       x2 Min.       4.0  
## 14       x2 1st Qu.    6.5  
## 15       x2 Median     9.0  
## 16       x2 Mean       9.0  
## 17       x2 3rd Qu.   11.5  
## 18       x2 Max.      14.0  
## 19       y2 Min.     3.100  
## 20       y2 1st Qu.  6.695  
## 21       y2 Median   8.140  
## 22       y2 Mean     7.501  
## 23       y2 3rd Qu.  8.950  
## 24       y2 Max.     9.260  
## 25       x3 Min.       4.0  
## 26       x3 1st Qu.    6.5  
## 27       x3 Median     9.0  
## 28       x3 Mean       9.0  
## 29       x3 3rd Qu.   11.5  
## 30       x3 Max.      14.0  
## 31       y3 Min.      5.39  
## 32       y3 1st Qu.   6.25  
## 33       y3 Median    7.11  
## 34       y3 Mean      7.50  
## 35       y3 3rd Qu.   7.98  
## 36       y3 Max.     12.74  
## 37       x4 Min.         8  
## 38       x4 1st Qu.      8  
## 39       x4 Median       8  
## 40       x4 Mean         9  
## 41       x4 3rd Qu.      8  
## 42       x4 Max.        19  
## 43       y4 Min.     5.250  
## 44       y4 1st Qu.  6.170  
## 45       y4 Median   7.040  
## 46       y4 Mean     7.501  
## 47       y4 3rd Qu.  8.190  
## 48       y4 Max.    12.500

Now let’s build some graphs to visualize the dataset.

sumStats$Value <- as.numeric(sumStats$Value)
wideData <- reshape(sumStats, direction="wide", idvar="Variable", timevar="Stat")
wideData["Value.Range"] <- wideData$Value.Max - wideData$Value.Min # Let's add the Range and IQR of each set
wideData["Value.IQR"] <- wideData["Value.3rd Qu."] - wideData["Value.1st Qu."]
p1 <- ggplot(df1, aes(x=x1, y=y1)) + geom_line() # Now for visuals
p2 <- ggplot(df2, aes(x=x2, y=y2)) + geom_line()
p3 <- ggplot(df3, aes(x=x3, y=y3)) + geom_line()
p4 <- ggplot(df4, aes(x=x4, y=y4)) + geom_line()
grid.arrange(p1,p2,p3,p4,ncol=2,main="X,Y Correlations")

As we can see, the x and y have varying relationships in the datasets and the points in each dataframe seem to be treated as ordered pairs. In the upper-left quadrant, df1 demonstrates a positive, but very volatile correlation. To its right, df2 looks to be parabolic - the x value is positively correlated with the y value until it reaches a peak. In the bottom-left quadrant, df3 demonstrates a steady positive correlation, with an outlier at (x3,y3) = (13,12.74). The following call will show the outlier in the raw data:

df3[order(y3),]
##    x3    y3
## 8   4  5.39
## 11  5  5.73
## 7   6  6.08
## 10  7  6.42
## 2   8  6.77
## 4   9  7.11
## 1  10  7.46
## 5  11  7.81
## 9  12  8.15
## 6  14  8.84
## 3  13 12.74

Our bottom-right quadrant shows an odd distribution for df4. This set seems to have no correlation, as the x values remain the same for all values of y with the exception of one outlier. It’s clearer if we look at the distribution with a scatterplot:

ggplot(df4, aes(x=x4, y=y4)) + geom_point()

We can also look at a correlation of matrix of all 8 X and Y variables:

cor(master[,1:8])
##            x1         y1         x2         y2         x3         y3
## x1  1.0000000  0.8164205  1.0000000  0.8162365  1.0000000  0.8162867
## y1  0.8164205  1.0000000  0.8164205  0.7500054  0.8164205  0.4687167
## x2  1.0000000  0.8164205  1.0000000  0.8162365  1.0000000  0.8162867
## y2  0.8162365  0.7500054  0.8162365  1.0000000  0.8162365  0.5879193
## x3  1.0000000  0.8164205  1.0000000  0.8162365  1.0000000  0.8162867
## y3  0.8162867  0.4687167  0.8162867  0.5879193  0.8162867  1.0000000
## x4 -0.5000000 -0.5290927 -0.5000000 -0.7184365 -0.5000000 -0.3446610
## y4 -0.3140467 -0.4891162 -0.3140467 -0.4780949 -0.3140467 -0.1554718
##            x4         y4
## x1 -0.5000000 -0.3140467
## y1 -0.5290927 -0.4891162
## x2 -0.5000000 -0.3140467
## y2 -0.7184365 -0.4780949
## x3 -0.5000000 -0.3140467
## y3 -0.3446610 -0.1554718
## x4  1.0000000  0.8165214
## y4  0.8165214  1.0000000
ggpairs(cor(master[,1:8]),axisLabels="internal")

The previous graph is very difficult to interpret. Instead, we will build a heatmap to visualize correlation.

allCor <- cor(master[,1:8])
allCor2 <- melt(allCor,varnames=c("x","y"),value.name="Correlation")
allCor3 <- allCor2[order(allCor2$Correlation),]
ggplot(allCor3,aes(x=x, y=y)) +
  geom_tile(aes(fill=Correlation)) + 
  theme_minimal() +
  labs(x=NULL,y=NULL)

We can glean at least two useful pieces of information from this table:

  1. The datasets df1,df2,df3 all have strong positive correlations with each other, but are all negatively correlated with df4.

  2. The lightest tiles may indicate a near perfect correlation between the x variables in df1,df2, and df3. We will test the correlations with an equality.

allCor["x1","x2"] == (allCor["x2","x3"] == allCor["x1","x3"]) # wrap the second pair to run multiple equalities
## [1] TRUE