In this exercise, we will do a analyze the data in four small datasets - each composed of 11 observations of two variables: x and y.
First, let’s load the required packages:
## Loading required package: stringr
## Loading required package: plyr
## Loading required package: ggplot2
## Loading required package: grid
## Loading required package: gridExtra
## Loading required package: GGally
## Loading required package: scales
## Loading required package: reshape2
And our data:
x1 <- c(10.0,8.0,13.0,9.0,11.0,14.0,6.0,4.0,12.0,7.0,5.0)
y1 <- c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)
x2 <- c(10.0,8.0,13.0,9.0,11.0,14.0,6.0,4.0,12.0,7.0,5.0)
y2 <- c(9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74)
x3 <- c(10.0,8.0,13.0,9.0,11.0,14.0,6.0,4.0,12.0,7.0,5.0)
y3 <- c(7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73)
x4 <- c(8.0,8.0,8.0,8.0,8.0,8.0,8.0,19.0,8.0,8.0,8.0)
y4 <- c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89)
df1 <- data.frame(x1,y1) # Load each pair into a dataframe and then compile into a master dataframe
df2 <- data.frame(x2,y2)
df3 <- data.frame(x3,y3)
df4 <- data.frame(x4,y4)
master <- data.frame(df1,df2,df3,df4)
Let’s get a brief overview of the stats. We can see that the mean and median are fairly stable among the pairs.
#The following steps take the summary data and convert them into a "long" dataframe
sumStats <- as.data.frame(summary(master))
sumStats$Freq <- as.character(sumStats$Freq)
interlist <- str_split(sumStats$Freq, ":")
interdf <- ldply(interlist)
colnames(interdf) <- c("Stat", "Value")
sumStats <- cbind(sumStats, interdf)
sumStats <- sumStats[-c(1,3)]
colnames(sumStats)[1] <- "Variable"
sumStats
## Variable Stat Value
## 1 x1 Min. 4.0
## 2 x1 1st Qu. 6.5
## 3 x1 Median 9.0
## 4 x1 Mean 9.0
## 5 x1 3rd Qu. 11.5
## 6 x1 Max. 14.0
## 7 y1 Min. 4.260
## 8 y1 1st Qu. 6.315
## 9 y1 Median 7.580
## 10 y1 Mean 7.501
## 11 y1 3rd Qu. 8.570
## 12 y1 Max. 10.840
## 13 x2 Min. 4.0
## 14 x2 1st Qu. 6.5
## 15 x2 Median 9.0
## 16 x2 Mean 9.0
## 17 x2 3rd Qu. 11.5
## 18 x2 Max. 14.0
## 19 y2 Min. 3.100
## 20 y2 1st Qu. 6.695
## 21 y2 Median 8.140
## 22 y2 Mean 7.501
## 23 y2 3rd Qu. 8.950
## 24 y2 Max. 9.260
## 25 x3 Min. 4.0
## 26 x3 1st Qu. 6.5
## 27 x3 Median 9.0
## 28 x3 Mean 9.0
## 29 x3 3rd Qu. 11.5
## 30 x3 Max. 14.0
## 31 y3 Min. 5.39
## 32 y3 1st Qu. 6.25
## 33 y3 Median 7.11
## 34 y3 Mean 7.50
## 35 y3 3rd Qu. 7.98
## 36 y3 Max. 12.74
## 37 x4 Min. 8
## 38 x4 1st Qu. 8
## 39 x4 Median 8
## 40 x4 Mean 9
## 41 x4 3rd Qu. 8
## 42 x4 Max. 19
## 43 y4 Min. 5.250
## 44 y4 1st Qu. 6.170
## 45 y4 Median 7.040
## 46 y4 Mean 7.501
## 47 y4 3rd Qu. 8.190
## 48 y4 Max. 12.500
Now let’s build some graphs to visualize the dataset.
sumStats$Value <- as.numeric(sumStats$Value)
wideData <- reshape(sumStats, direction="wide", idvar="Variable", timevar="Stat")
wideData["Value.Range"] <- wideData$Value.Max - wideData$Value.Min # Let's add the Range and IQR of each set
wideData["Value.IQR"] <- wideData["Value.3rd Qu."] - wideData["Value.1st Qu."]
p1 <- ggplot(df1, aes(x=x1, y=y1)) + geom_line() # Now for visuals
p2 <- ggplot(df2, aes(x=x2, y=y2)) + geom_line()
p3 <- ggplot(df3, aes(x=x3, y=y3)) + geom_line()
p4 <- ggplot(df4, aes(x=x4, y=y4)) + geom_line()
grid.arrange(p1,p2,p3,p4,ncol=2,main="X,Y Correlations")
As we can see, the x and y have varying relationships in the datasets and the points in each dataframe seem to be treated as ordered pairs. In the upper-left quadrant, df1 demonstrates a positive, but very volatile correlation. To its right, df2 looks to be parabolic - the x value is positively correlated with the y value until it reaches a peak. In the bottom-left quadrant, df3 demonstrates a steady positive correlation, with an outlier at (x3,y3) = (13,12.74). The following call will show the outlier in the raw data:
df3[order(y3),]
## x3 y3
## 8 4 5.39
## 11 5 5.73
## 7 6 6.08
## 10 7 6.42
## 2 8 6.77
## 4 9 7.11
## 1 10 7.46
## 5 11 7.81
## 9 12 8.15
## 6 14 8.84
## 3 13 12.74
Our bottom-right quadrant shows an odd distribution for df4. This set seems to have no correlation, as the x values remain the same for all values of y with the exception of one outlier. It’s clearer if we look at the distribution with a scatterplot:
ggplot(df4, aes(x=x4, y=y4)) + geom_point()
We can also look at a correlation of matrix of all 8 X and Y variables:
cor(master[,1:8])
## x1 y1 x2 y2 x3 y3
## x1 1.0000000 0.8164205 1.0000000 0.8162365 1.0000000 0.8162867
## y1 0.8164205 1.0000000 0.8164205 0.7500054 0.8164205 0.4687167
## x2 1.0000000 0.8164205 1.0000000 0.8162365 1.0000000 0.8162867
## y2 0.8162365 0.7500054 0.8162365 1.0000000 0.8162365 0.5879193
## x3 1.0000000 0.8164205 1.0000000 0.8162365 1.0000000 0.8162867
## y3 0.8162867 0.4687167 0.8162867 0.5879193 0.8162867 1.0000000
## x4 -0.5000000 -0.5290927 -0.5000000 -0.7184365 -0.5000000 -0.3446610
## y4 -0.3140467 -0.4891162 -0.3140467 -0.4780949 -0.3140467 -0.1554718
## x4 y4
## x1 -0.5000000 -0.3140467
## y1 -0.5290927 -0.4891162
## x2 -0.5000000 -0.3140467
## y2 -0.7184365 -0.4780949
## x3 -0.5000000 -0.3140467
## y3 -0.3446610 -0.1554718
## x4 1.0000000 0.8165214
## y4 0.8165214 1.0000000
ggpairs(cor(master[,1:8]),axisLabels="internal")
The previous graph is very difficult to interpret. Instead, we will build a heatmap to visualize correlation.
allCor <- cor(master[,1:8])
allCor2 <- melt(allCor,varnames=c("x","y"),value.name="Correlation")
allCor3 <- allCor2[order(allCor2$Correlation),]
ggplot(allCor3,aes(x=x, y=y)) +
geom_tile(aes(fill=Correlation)) +
theme_minimal() +
labs(x=NULL,y=NULL)
We can glean at least two useful pieces of information from this table:
The datasets df1,df2,df3 all have strong positive correlations with each other, but are all negatively correlated with df4.
The lightest tiles may indicate a near perfect correlation between the x variables in df1,df2, and df3. We will test the correlations with an equality.
allCor["x1","x2"] == (allCor["x2","x3"] == allCor["x1","x3"]) # wrap the second pair to run multiple equalities
## [1] TRUE