2024-09-23

Introduction

  • Bivariate analysis is one of the statistical analysis where the relationship between two variables is observed. One variable is dependent while the other is independent.

    – Ex: Finding r/ship between the heights and weights of diabetic patients

  • Bivariate data can also be two sets of data that are dependant on each other.

    – Ex: Traffic accidents compared with weather on a particular day

  • Bivariate analysis helps to identify patterns and trends, and forms the basis for more complex multivariate analyses.

Types of Bivariate Analysis

Few of the Types of Bivariate Analysis Explained

  • Scatter plots - shows a point for each observation such that the x-coordinate of the point is the value of one variable and the y-coordinate is the value of the other for that observation.

  • Regression Analysis - A widely used statistical method for investigating relationships between two continuous (quantitative) variables

  • (pearson) Correlation coefficient - statistical measure that shows how strong and in what direction two variables are linked.

    • 0 : no correlation.
    • 1 or -1 : perfectly correlated (1, positively, and -1 negatively correlated)

(Pearson) Correlation

For two numeric variables, we often use (pearson) correlation:

\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}\]

Where:

  • \(r\) is the correlation coefficient.
  • \(x_i\) and \(y_i\) are individual values of x and y variables in a sample.
  • \(\bar{x}\) and \(\bar{y}\) are the means of X and Y variables

Note: Bivariate analysis is not the same as two sample data analysis where X and Y are not directly related

Example Dataset: iris and mtcars

Using the available iris and mtcars dataset in R for our examples, we apply the three types of bivariate analysis.

  • head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Scatter Plot: Sepal Length vs. Petal Length

Box Plot: Species vs. Petal Length

3D Scatter Plot: Petal Length, Petal Width, Sepal Length

Correlation Matrix (“Mtcars” dataset)

As can be seen from the correlation plot, variables “cylinder” and “weight” against “mile per gallon” variable have the most negative relationship out of those depicted in the plot.

Chi-Square Test

identifies disparities in one or more categories between what was expected and what was observed. For categorical variables, we can use the Chi-Square test of independence:

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

Where:
\(O\) is the observed frequency.
\(E\) is the expected frequency.

R Code for Bivariate Analysis

# Correlation analysis
mtcars_numeric <- mtcars %>% select_if(is.numeric)
cor_matrix <- cor(mtcars_numeric)
cor_matrix
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

#Specific Example
cor(mtcars$cyl, mtcars$mpg)
## [1] -0.852162

T-test for groups Petal.Length ~ Species

t.test(Petal.Length ~ Species, data = iris, subset = Species != "virginica")
## 
##  Welch Two Sample t-test
## 
## data:  Petal.Length by Species
## t = -39.493, df = 62.14, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group setosa and group versicolor is not equal to 0
## 95 percent confidence interval:
##  -2.939618 -2.656382
## sample estimates:
##     mean in group setosa mean in group versicolor 
##                    1.462                    4.260