Bi-variate Relationship

Most of the times, we are interested in understanding relationship between two numeric values. In statistical background, we usually have one variable that is ‘Output’ and one or many variables as ‘Input’.
The output variable is referred as Response Variable denoted by y and also called as dependent variable. The input variable is referred as Explanatory Variable denoted by x and also called as independent variable/predictor.

Scatterplots could be used to find relationship b/w two numerical variables as each dot would represent the value of one variable measured on a single observation. We put Response variable on y-axis (vertical axis) and Explanatory variable on x-axis (horizontal axis).

You might want to install and load “openintro” package for using various further mentioned datasets.Alongside, please install and load “ggplot2”, “dplyr” and “magrittr” packages as well.

In ggplot(), after binding our variables to respective axis, we use geom_point() to plot the points.

library(openintro)
library(ggplot2)

#ggplot specifying the total length of possum as a function to the length of the tail of a possum.

ggplot(data = possum, aes(x= tailL, y= totalL)) + geom_point()

We can use boxplot() to show relationship b/w numerical variable and a categorical exploratory variable. Scatterplots can be considered as generalization of side by side boxplots. They can be achieved discretizing the explanatory variable using cut() function. cut() function takes a numeric vector and chops it into discrete chunks.

ggplot(data = ncbirths, aes(x=cut(weeks, breaks = 5), y= weight)) + geom_boxplot()

Note: Notice how the median length increases as tail length increases across the 5 groups

Characterizing Bi-variate relationship

We usually look for 4 characteristics in a relationship. They are:

  1. Form: It is the overall shape made by the points. It could be linear, non-linear or quadrate.
  2. Direction: Direction would either be positive (if one variable is directly proportional to the other) or negative (if one variable is inversely proportional to the other).
  3. Strength: Strength is governed by how much scatter is present i.e. do the points seem to be clustered together to form a relationship or are they organised loosely.
  4. Outliers: Points lying far away from the clusters. They could be the exceptions that help clarifying the general trend.

Let’s plot a scatterplot from different datasets and look at their relationships.

#scatterplot illustrating how the brain weight of a mammal varies as a function of its body weight.
ggplot(data = mammals, aes(x=BodyWt, y=BrainWt)) +geom_point()

The graph characterises a weak relationship.

#scatterplot illustrating how the slugging percentage (SLG) of a baseball player varies as a function of his on-base percentage (OBP).
ggplot(data = mlbBat10, aes(x= OBP, y=SLG)) + geom_point()

The graph shows a positive, linear, strong relationship with a few exceptions as outliers.

#scatterplot illustrating how a person's weight varies as a function of their height.
ggplot(data = bdims, aes(x=hgt, y=wgt, color=factor(sex))) + geom_point()

The graph shows a positive linear strong relationship.

#scatterplot illustrating how the amount that a person smokes on weekdays varies as a function of their age.
ggplot(data = smoking, aes(x=age, y=amtWeekdays)) + geom_point()

The graph shows non linear relationship.

Transformation

The relationship between 2 variables might not be linear. Sometimes a scatterplot may have strange & inscrutable patterns and sometimes no meaningful relationship at all.

Other times a careful transformation of one or both variables might reveal a clear relationship. ggplot2 has various ways for viewing transformed relationships. coord_trans() transforms the coordinates of the plot. scale_x_log10() and scale_y_log10() perform a base-10 log transformation of each axis.

Eg: In the scatterplot between brain and weight among mammals, the relationship was bizarre. But it shows meaningful relationship if we transform it.

#transforming using coord_trans()
ggplot(data = mammals, aes(x=BodyWt, y= BrainWt)) + geom_point() + coord_trans(x="log10", y="log10")

#transforming using scale_x_log10() & scale_y_log10()
ggplot(data = mammals, aes(x=BodyWt, y= BrainWt)) + geom_point() + 
  scale_x_log10() + scale_y_log10()

We can see clearly in the above two plots that after transforming, the graph shows a positive linear strong relationship between the two variables.

Mis-representing Data

ggplot(data = mlbBat10, aes(x=SB, y=HR)) + geom_point()

The scatterplot above shows a relationship between number of home runs (HR) hit by major league baseball players in 2010 and the number of bases they stall (SB).

Since both the variables are integer valued, several observations have same coordinates and hence are overlapping each other in the plot. This can mis-represent the data.

To solve this problem, we can use ‘alpha’ argument in geom_point which sets the transparency to the points.

ggplot(data = mlbBat10, aes(x= SB, y=HR)) + geom_point(alpha = 0.5)

Therefore, we can spot over-plotting happening on darker points.

Another way to avoid mis-representation of data is to add ‘Jitters’ which is a small amount of noise in either x or y direction.

ggplot(data = mlbBat10, aes(x= SB, y=HR)) + geom_point(alpha = 0.5, position= "jitter")

Correlation

Correlation is way to quantify the strength of a linear relationship usually identified by correlation coefficient. The value of correlation coefficient lies between -1 and 1. The magnitude responds to the strength whereas the “sign” corresponds to the direction.

A correlation coefficient close to 1 (Eg. 0.959) represents a near perfect positive correlation, value around 0.756 represent strong relationship, value around 0.536 represent moderate relationship and values close to 0.241 represent weak relationship.

If there is no linear correlation between two variables, coefficient value would be close to 0 (eg. 0.003). This might mean that the values of x give no information about the values of y.

If the direction of the plot is negative, the correlation coefficient would be negative as well but it would be the magnitude of the coefficient which would decide the strength of relationship.

run10 %>% filter( divPlace <= 10 ) %>%
   ggplot(aes(x=age, y=pace, color=gender)) + geom_point()

The above graph represents the pace of top 10 finishers of each age group against their age. Their relationship is non-linear with coefficient magnitude 0.68 but we know that pace and age are more closely related than what the graph indicates.

Correlation can be computed using cor(x,y). Since the quantity is symmetric w.r.t. x and y, it doesn’t matter which order you put the variables.

cor() function would by default provide you NA for missing data. To override it, we can use use() argument. Setting use="pairwise.complete.obs" allows us to compute the correlation coefficient for those observations where the values of x & y are both not missing.

#calculating correlation without missing pairs
ncbirths %>% summarise(N=n(), r=cor(mage, weight))
##      N          r
## 1 1000 0.05506589
#calculating correlation with some missing values
ncbirths %>% summarise(N=n(), r=cor(weight, weeks, use= "pairwise.complete.obs"))
##      N         r
## 1 1000 0.6701013

Reference

Data Camp course on Correlation and Regression by Ben Baumer.