In this tutorial, we will consider how best to analyze and visualize two continuous variables. These techniques are widely used and considered foundational in quantitative research methods.
We will cover:
Pearson’s R,
Spearman’s Rank, and
Kendall’s Tau
Scatterplots
Bivariate Linear Regression
Standardized Coefficients
data(mtcars)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
skim(mtcars)
Name | mtcars |
Number of rows | 32 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
cyl | 0 | 1 | 6.19 | 1.79 | 4.00 | 4.00 | 6.00 | 8.00 | 8.00 | ▆▁▃▁▇ |
disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
vs | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
As discussed in lecture, Pearson’s R correlation can be thought of as a measure of how much shared variance two continuous variables have with each other. When one variable misses its central tendency by x amount, how similar is the miss to its central tendency for the other variable? Pearson’s R gives us an easy way to assess this measure of association.
We will be using the cor
function from the
rstatix
package first. This package allows you to specify
the measure of association you want to analyze with following key
options: “pearson”, “kendall”, “spearman”
We will review each one in turn using the mtcars dataset. You can either specify specific variables to use in the correlation or correlate the entire dataset. However, you will frequently have variables that are not numeric which will make the table harder to read and understand. It can be important to simply reduce the variables for the correlation analysis to a few select variables.
We will use the cor
function from the
rstatix
package. This allows you to produce the three
correlation types we have reviewed in class along with other useful
options.
r_select <- cor(mtcars$mpg, mtcars$hp, method = "pearson")
r_select
## [1] -0.7761684
r_all <-cor(mtcars, method="pearson", use="complete.obs") #complete.obs includes units with data for each variable
r_all
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
r_all_rnd <- round(r_all, 3) #Rounds the correlation matrix to 3 decimal points for easier viewing
r_all_rnd
## mpg cyl disp hp drat wt qsec vs am gear
## mpg 1.000 -0.852 -0.848 -0.776 0.681 -0.868 0.419 0.664 0.600 0.480
## cyl -0.852 1.000 0.902 0.832 -0.700 0.782 -0.591 -0.811 -0.523 -0.493
## disp -0.848 0.902 1.000 0.791 -0.710 0.888 -0.434 -0.710 -0.591 -0.556
## hp -0.776 0.832 0.791 1.000 -0.449 0.659 -0.708 -0.723 -0.243 -0.126
## drat 0.681 -0.700 -0.710 -0.449 1.000 -0.712 0.091 0.440 0.713 0.700
## wt -0.868 0.782 0.888 0.659 -0.712 1.000 -0.175 -0.555 -0.692 -0.583
## qsec 0.419 -0.591 -0.434 -0.708 0.091 -0.175 1.000 0.745 -0.230 -0.213
## vs 0.664 -0.811 -0.710 -0.723 0.440 -0.555 0.745 1.000 0.168 0.206
## am 0.600 -0.523 -0.591 -0.243 0.713 -0.692 -0.230 0.168 1.000 0.794
## gear 0.480 -0.493 -0.556 -0.126 0.700 -0.583 -0.213 0.206 0.794 1.000
## carb -0.551 0.527 0.395 0.750 -0.091 0.428 -0.656 -0.570 0.058 0.274
## carb
## mpg -0.551
## cyl 0.527
## disp 0.395
## hp 0.750
## drat -0.091
## wt 0.428
## qsec -0.656
## vs -0.570
## am 0.058
## gear 0.274
## carb 1.000
Variables in this table run both vertically and horizontally. To find a correlation between two specific variables, find the first variable in the vertical column on the left and then move horizontally until you find the second variable in the row at the top of the table. The closer to 1 or -1 the correlation coefficient is the stronger the two variables are related.
You will always see the diagonal correlations = 1; that is because the diagonal represents the correlation of a variable with itself. And a variable is always perfectly correlated with itself.
Even with fewer decimal points, a table with this many correlations
can be difficult to fully interpret. You can also visualize the
correlations using various graphs. Here, we use the
corrplot
package to visually represent the correlations in
the above table.
r_cars <- cor(mtcars, method = "pearson")
num<-corrplot(r_cars, method = 'number') # colorful number
num
## $corr
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
##
## $corrPos
## xName yName x y corr
## 1 mpg mpg 1 11 1.00000000
## 2 mpg cyl 1 10 -0.85216196
## 3 mpg disp 1 9 -0.84755138
## 4 mpg hp 1 8 -0.77616837
## 5 mpg drat 1 7 0.68117191
## 6 mpg wt 1 6 -0.86765938
## 7 mpg qsec 1 5 0.41868403
## 8 mpg vs 1 4 0.66403892
## 9 mpg am 1 3 0.59983243
## 10 mpg gear 1 2 0.48028476
## 11 mpg carb 1 1 -0.55092507
## 12 cyl mpg 2 11 -0.85216196
## 13 cyl cyl 2 10 1.00000000
## 14 cyl disp 2 9 0.90203287
## 15 cyl hp 2 8 0.83244745
## 16 cyl drat 2 7 -0.69993811
## 17 cyl wt 2 6 0.78249579
## 18 cyl qsec 2 5 -0.59124207
## 19 cyl vs 2 4 -0.81081180
## 20 cyl am 2 3 -0.52260705
## 21 cyl gear 2 2 -0.49268660
## 22 cyl carb 2 1 0.52698829
## 23 disp mpg 3 11 -0.84755138
## 24 disp cyl 3 10 0.90203287
## 25 disp disp 3 9 1.00000000
## 26 disp hp 3 8 0.79094859
## 27 disp drat 3 7 -0.71021393
## 28 disp wt 3 6 0.88797992
## 29 disp qsec 3 5 -0.43369788
## 30 disp vs 3 4 -0.71041589
## 31 disp am 3 3 -0.59122704
## 32 disp gear 3 2 -0.55556920
## 33 disp carb 3 1 0.39497686
## 34 hp mpg 4 11 -0.77616837
## 35 hp cyl 4 10 0.83244745
## 36 hp disp 4 9 0.79094859
## 37 hp hp 4 8 1.00000000
## 38 hp drat 4 7 -0.44875912
## 39 hp wt 4 6 0.65874789
## 40 hp qsec 4 5 -0.70822339
## 41 hp vs 4 4 -0.72309674
## 42 hp am 4 3 -0.24320426
## 43 hp gear 4 2 -0.12570426
## 44 hp carb 4 1 0.74981247
## 45 drat mpg 5 11 0.68117191
## 46 drat cyl 5 10 -0.69993811
## 47 drat disp 5 9 -0.71021393
## 48 drat hp 5 8 -0.44875912
## 49 drat drat 5 7 1.00000000
## 50 drat wt 5 6 -0.71244065
## 51 drat qsec 5 5 0.09120476
## 52 drat vs 5 4 0.44027846
## 53 drat am 5 3 0.71271113
## 54 drat gear 5 2 0.69961013
## 55 drat carb 5 1 -0.09078980
## 56 wt mpg 6 11 -0.86765938
## 57 wt cyl 6 10 0.78249579
## 58 wt disp 6 9 0.88797992
## 59 wt hp 6 8 0.65874789
## 60 wt drat 6 7 -0.71244065
## 61 wt wt 6 6 1.00000000
## 62 wt qsec 6 5 -0.17471588
## 63 wt vs 6 4 -0.55491568
## 64 wt am 6 3 -0.69249526
## 65 wt gear 6 2 -0.58328700
## 66 wt carb 6 1 0.42760594
## 67 qsec mpg 7 11 0.41868403
## 68 qsec cyl 7 10 -0.59124207
## 69 qsec disp 7 9 -0.43369788
## 70 qsec hp 7 8 -0.70822339
## 71 qsec drat 7 7 0.09120476
## 72 qsec wt 7 6 -0.17471588
## 73 qsec qsec 7 5 1.00000000
## 74 qsec vs 7 4 0.74453544
## 75 qsec am 7 3 -0.22986086
## 76 qsec gear 7 2 -0.21268223
## 77 qsec carb 7 1 -0.65624923
## 78 vs mpg 8 11 0.66403892
## 79 vs cyl 8 10 -0.81081180
## 80 vs disp 8 9 -0.71041589
## 81 vs hp 8 8 -0.72309674
## 82 vs drat 8 7 0.44027846
## 83 vs wt 8 6 -0.55491568
## 84 vs qsec 8 5 0.74453544
## 85 vs vs 8 4 1.00000000
## 86 vs am 8 3 0.16834512
## 87 vs gear 8 2 0.20602335
## 88 vs carb 8 1 -0.56960714
## 89 am mpg 9 11 0.59983243
## 90 am cyl 9 10 -0.52260705
## 91 am disp 9 9 -0.59122704
## 92 am hp 9 8 -0.24320426
## 93 am drat 9 7 0.71271113
## 94 am wt 9 6 -0.69249526
## 95 am qsec 9 5 -0.22986086
## 96 am vs 9 4 0.16834512
## 97 am am 9 3 1.00000000
## 98 am gear 9 2 0.79405876
## 99 am carb 9 1 0.05753435
## 100 gear mpg 10 11 0.48028476
## 101 gear cyl 10 10 -0.49268660
## 102 gear disp 10 9 -0.55556920
## 103 gear hp 10 8 -0.12570426
## 104 gear drat 10 7 0.69961013
## 105 gear wt 10 6 -0.58328700
## 106 gear qsec 10 5 -0.21268223
## 107 gear vs 10 4 0.20602335
## 108 gear am 10 3 0.79405876
## 109 gear gear 10 2 1.00000000
## 110 gear carb 10 1 0.27407284
## 111 carb mpg 11 11 -0.55092507
## 112 carb cyl 11 10 0.52698829
## 113 carb disp 11 9 0.39497686
## 114 carb hp 11 8 0.74981247
## 115 carb drat 11 7 -0.09078980
## 116 carb wt 11 6 0.42760594
## 117 carb qsec 11 5 -0.65624923
## 118 carb vs 11 4 -0.56960714
## 119 carb am 11 3 0.05753435
## 120 carb gear 11 2 0.27407284
## 121 carb carb 11 1 1.00000000
##
## $arg
## $arg$type
## [1] "full"
num2<-corrplot(r_cars, method = 'number', type="lower") #Only Show Lower Part of Diag
num2
## $corr
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
##
## $corrPos
## xName yName x y corr
## 1 mpg mpg 1 11 1.00000000
## 2 mpg cyl 1 10 -0.85216196
## 3 mpg disp 1 9 -0.84755138
## 4 mpg hp 1 8 -0.77616837
## 5 mpg drat 1 7 0.68117191
## 6 mpg wt 1 6 -0.86765938
## 7 mpg qsec 1 5 0.41868403
## 8 mpg vs 1 4 0.66403892
## 9 mpg am 1 3 0.59983243
## 10 mpg gear 1 2 0.48028476
## 11 mpg carb 1 1 -0.55092507
## 12 cyl cyl 2 10 1.00000000
## 13 cyl disp 2 9 0.90203287
## 14 cyl hp 2 8 0.83244745
## 15 cyl drat 2 7 -0.69993811
## 16 cyl wt 2 6 0.78249579
## 17 cyl qsec 2 5 -0.59124207
## 18 cyl vs 2 4 -0.81081180
## 19 cyl am 2 3 -0.52260705
## 20 cyl gear 2 2 -0.49268660
## 21 cyl carb 2 1 0.52698829
## 22 disp disp 3 9 1.00000000
## 23 disp hp 3 8 0.79094859
## 24 disp drat 3 7 -0.71021393
## 25 disp wt 3 6 0.88797992
## 26 disp qsec 3 5 -0.43369788
## 27 disp vs 3 4 -0.71041589
## 28 disp am 3 3 -0.59122704
## 29 disp gear 3 2 -0.55556920
## 30 disp carb 3 1 0.39497686
## 31 hp hp 4 8 1.00000000
## 32 hp drat 4 7 -0.44875912
## 33 hp wt 4 6 0.65874789
## 34 hp qsec 4 5 -0.70822339
## 35 hp vs 4 4 -0.72309674
## 36 hp am 4 3 -0.24320426
## 37 hp gear 4 2 -0.12570426
## 38 hp carb 4 1 0.74981247
## 39 drat drat 5 7 1.00000000
## 40 drat wt 5 6 -0.71244065
## 41 drat qsec 5 5 0.09120476
## 42 drat vs 5 4 0.44027846
## 43 drat am 5 3 0.71271113
## 44 drat gear 5 2 0.69961013
## 45 drat carb 5 1 -0.09078980
## 46 wt wt 6 6 1.00000000
## 47 wt qsec 6 5 -0.17471588
## 48 wt vs 6 4 -0.55491568
## 49 wt am 6 3 -0.69249526
## 50 wt gear 6 2 -0.58328700
## 51 wt carb 6 1 0.42760594
## 52 qsec qsec 7 5 1.00000000
## 53 qsec vs 7 4 0.74453544
## 54 qsec am 7 3 -0.22986086
## 55 qsec gear 7 2 -0.21268223
## 56 qsec carb 7 1 -0.65624923
## 57 vs vs 8 4 1.00000000
## 58 vs am 8 3 0.16834512
## 59 vs gear 8 2 0.20602335
## 60 vs carb 8 1 -0.56960714
## 61 am am 9 3 1.00000000
## 62 am gear 9 2 0.79405876
## 63 am carb 9 1 0.05753435
## 64 gear gear 10 2 1.00000000
## 65 gear carb 10 1 0.27407284
## 66 carb carb 11 1 1.00000000
##
## $arg
## $arg$type
## [1] "lower"
cir<-corrplot(r_cars, method = 'circle', type="lower", diag=FALSE) #Removes the diag corr of 1 & uses different sized circles to represent the correlation coefficient
cir
## $corr
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
##
## $corrPos
## xName yName x y corr
## 1 mpg cyl 1 10 -0.85216196
## 2 mpg disp 1 9 -0.84755138
## 3 mpg hp 1 8 -0.77616837
## 4 mpg drat 1 7 0.68117191
## 5 mpg wt 1 6 -0.86765938
## 6 mpg qsec 1 5 0.41868403
## 7 mpg vs 1 4 0.66403892
## 8 mpg am 1 3 0.59983243
## 9 mpg gear 1 2 0.48028476
## 10 mpg carb 1 1 -0.55092507
## 11 cyl disp 2 9 0.90203287
## 12 cyl hp 2 8 0.83244745
## 13 cyl drat 2 7 -0.69993811
## 14 cyl wt 2 6 0.78249579
## 15 cyl qsec 2 5 -0.59124207
## 16 cyl vs 2 4 -0.81081180
## 17 cyl am 2 3 -0.52260705
## 18 cyl gear 2 2 -0.49268660
## 19 cyl carb 2 1 0.52698829
## 20 disp hp 3 8 0.79094859
## 21 disp drat 3 7 -0.71021393
## 22 disp wt 3 6 0.88797992
## 23 disp qsec 3 5 -0.43369788
## 24 disp vs 3 4 -0.71041589
## 25 disp am 3 3 -0.59122704
## 26 disp gear 3 2 -0.55556920
## 27 disp carb 3 1 0.39497686
## 28 hp drat 4 7 -0.44875912
## 29 hp wt 4 6 0.65874789
## 30 hp qsec 4 5 -0.70822339
## 31 hp vs 4 4 -0.72309674
## 32 hp am 4 3 -0.24320426
## 33 hp gear 4 2 -0.12570426
## 34 hp carb 4 1 0.74981247
## 35 drat wt 5 6 -0.71244065
## 36 drat qsec 5 5 0.09120476
## 37 drat vs 5 4 0.44027846
## 38 drat am 5 3 0.71271113
## 39 drat gear 5 2 0.69961013
## 40 drat carb 5 1 -0.09078980
## 41 wt qsec 6 5 -0.17471588
## 42 wt vs 6 4 -0.55491568
## 43 wt am 6 3 -0.69249526
## 44 wt gear 6 2 -0.58328700
## 45 wt carb 6 1 0.42760594
## 46 qsec vs 7 4 0.74453544
## 47 qsec am 7 3 -0.22986086
## 48 qsec gear 7 2 -0.21268223
## 49 qsec carb 7 1 -0.65624923
## 50 vs am 8 3 0.16834512
## 51 vs gear 8 2 0.20602335
## 52 vs carb 8 1 -0.56960714
## 53 am gear 9 2 0.79405876
## 54 am carb 9 1 0.05753435
## 55 gear carb 10 1 0.27407284
##
## $arg
## $arg$type
## [1] "lower"
?mtcars #Describes what each variable is for easier interpretation
## starting httpd help server ... done
There are other ways to customize this type of plot. Use
?corrplot
to see different options or there are several
good tutorials on Google that goes through the plots in detail.
Ultimately, we use correlation tables like this in an exploratory fashion. It allows us to quickly assess the relationships between our variables. This process can be especially helpful at building high quality multivariate regression models and for picking key variables if you have sample size limitations (like we do with this dataset of only 32).
Now, let’s review the formula for Pearson’s R correlation and then
manually calculate it using the mtcars
data. We will use
the hp
variable as the x and the mpg
variable
as the y. Computationally it does not matter but theoretically we want
to always think about the y variable as the DV and the x as the IV. We
will do this calculation in 4-steps.
#Step 1: Calculates the deviations for both x and y
new_mt<-mtcars %>%
mutate(hp_avg=mean(hp),
mpg_avg=mean(mpg),
x_dev=hp-hp_avg,
y_dev=mpg-mpg_avg)
#Step 2: Calculate the product of residuals for mpg and hp
sum_product_residuals <- new_mt %>%
summarise(sum_product = sum(x_dev * y_dev)) %>%
pull(sum_product)
# Step 3: Calculate the sums of squares for the residuals of each variable
sum_squares_mpg <- new_mt %>%
summarise(sum_squares_mpg = sum(y_dev^2)) %>%
pull(sum_squares_mpg)
sum_squares_hp <- new_mt %>%
summarise(sum_squares_hp = sum(x_dev^2)) %>%
pull(sum_squares_hp)
# Step 4: Calculate Pearson's r
pearson_r <- sum_product_residuals / sqrt(sum_squares_mpg * sum_squares_hp)
# Output the result
pearson_r
## [1] -0.7761684
cor(mtcars$mpg, mtcars$hp, method="pearson")
## [1] -0.7761684
After manually calculating each individual component part of the
formula and the final Pearson’s R correlation coefficient, we compare
the manual calculation to the value returned by the cor
function above. We see that they are identical meaning we manually
replicated the formula accurately.
Now, we look at using the Spearman’s Rank correlation as an alternative measure of association for two variables. As discussed in lecture, this approach ranks each variable from 1:n and then looks at the relationship between their rankings. If all of the ranks were in perfect order, 1-1, 2-2, 3-3, etc, then you would have a perfect correlation of 1.
We might want to consider this approach due to a few reasons.
Nature of Variables: Some variables in the mtcars dataset, such as cyl (number of cylinders) and gear (number of forward gears), are ordinal with only a few distinct values. Pearson’s R is designed to work best with continuous, normally distributed variables. For ordinal data or variables with a limited number of categories, Spearman’s rank correlation is generally a better approach as it is less sensitive to the actual distribution of the data.
Distribution of Continuous Variables: For the continuous variables in the dataset, many exhibit skewed distributions, which can violate the normality assumption required for Pearson’s correlation. In such cases, using Spearman’s rank correlation is more appropriate because it ranks the data, thereby minimizing the influence of outliers and providing a more robust measure of association for skewed distributions.
rho_all <-cor(mtcars, method="spearman", use="complete.obs")
rho_all_rnd <- round(rho_all, 3)
rho_all_rnd
## mpg cyl disp hp drat wt qsec vs am gear
## mpg 1.000 -0.911 -0.909 -0.895 0.651 -0.886 0.467 0.707 0.562 0.543
## cyl -0.911 1.000 0.928 0.902 -0.679 0.858 -0.572 -0.814 -0.522 -0.564
## disp -0.909 0.928 1.000 0.851 -0.684 0.898 -0.460 -0.724 -0.624 -0.594
## hp -0.895 0.902 0.851 1.000 -0.520 0.775 -0.667 -0.752 -0.362 -0.331
## drat 0.651 -0.679 -0.684 -0.520 1.000 -0.750 0.092 0.447 0.687 0.745
## wt -0.886 0.858 0.898 0.775 -0.750 1.000 -0.225 -0.587 -0.738 -0.676
## qsec 0.467 -0.572 -0.460 -0.667 0.092 -0.225 1.000 0.792 -0.203 -0.148
## vs 0.707 -0.814 -0.724 -0.752 0.447 -0.587 0.792 1.000 0.168 0.283
## am 0.562 -0.522 -0.624 -0.362 0.687 -0.738 -0.203 0.168 1.000 0.808
## gear 0.543 -0.564 -0.594 -0.331 0.745 -0.676 -0.148 0.283 0.808 1.000
## carb -0.657 0.580 0.540 0.733 -0.125 0.500 -0.659 -0.634 -0.064 0.115
## carb
## mpg -0.657
## cyl 0.580
## disp 0.540
## hp 0.733
## drat -0.125
## wt 0.500
## qsec -0.659
## vs -0.634
## am -0.064
## gear 0.115
## carb 1.000
The third measure of association we will review is the Kendall’s Tau correlation. This counts the number of concordant and discordant pairs of variables between all two observations in the dataset. The more similar the variables are, the higher number of concordant pairs they will have increasing the correlation coefficient.
This approach is mostly used on small sample sizes, like we have in this example.
tau_all <-cor(mtcars, method="kendall", use="complete.obs")
tau_all_rnd <- round(tau_all, 3)
tau_all_rnd
## mpg cyl disp hp drat wt qsec vs am gear
## mpg 1.000 -0.795 -0.768 -0.743 0.465 -0.728 0.315 0.590 0.469 0.433
## cyl -0.795 1.000 0.814 0.785 -0.551 0.728 -0.449 -0.771 -0.495 -0.513
## disp -0.768 0.814 1.000 0.666 -0.499 0.743 -0.301 -0.603 -0.520 -0.476
## hp -0.743 0.785 0.666 1.000 -0.383 0.611 -0.473 -0.631 -0.304 -0.279
## drat 0.465 -0.551 -0.499 -0.383 1.000 -0.547 0.033 0.375 0.576 0.584
## wt -0.728 0.728 0.743 0.611 -0.547 1.000 -0.142 -0.488 -0.614 -0.544
## qsec 0.315 -0.449 -0.301 -0.473 0.033 -0.142 1.000 0.658 -0.169 -0.091
## vs 0.590 -0.771 -0.603 -0.631 0.375 -0.488 0.658 1.000 0.168 0.270
## am 0.469 -0.495 -0.520 -0.304 0.576 -0.614 -0.169 0.168 1.000 0.771
## gear 0.433 -0.513 -0.476 -0.279 0.584 -0.544 -0.091 0.270 0.771 1.000
## carb -0.504 0.465 0.414 0.596 -0.095 0.371 -0.506 -0.577 -0.059 0.098
## carb
## mpg -0.504
## cyl 0.465
## disp 0.414
## hp 0.596
## drat -0.095
## wt 0.371
## qsec -0.506
## vs -0.577
## am -0.059
## gear 0.098
## carb 1.000
Now let’s compare the correlations for the MPG variable across the three approaches.
# Extract the first column from each correlation matrix
r_first_col <- r_all_rnd[, 1]
tau_first_col <- tau_all_rnd[, 1]
rho_first_col <- rho_all_rnd[, 1]
# Combine them into a tibble with appropriately named columns
correlation_tibble <- tibble(
Variable = rownames(r_all_rnd),
Pearson = r_first_col,
Kendall = tau_first_col,
Spearman = rho_first_col
)
# Print the resulting tibble
print(correlation_tibble)
## # A tibble: 11 × 4
## Variable Pearson Kendall Spearman
## <chr> <dbl> <dbl> <dbl>
## 1 mpg 1 1 1
## 2 cyl -0.852 -0.795 -0.911
## 3 disp -0.848 -0.768 -0.909
## 4 hp -0.776 -0.743 -0.895
## 5 drat 0.681 0.465 0.651
## 6 wt -0.868 -0.728 -0.886
## 7 qsec 0.419 0.315 0.467
## 8 vs 0.664 0.59 0.707
## 9 am 0.6 0.469 0.562
## 10 gear 0.48 0.433 0.543
## 11 carb -0.551 -0.504 -0.657
Another way to visualize the relationship between two variables is to plot them in a scatterplot. Scatterplots are very useful tools at examining the relationship between two variables. It can quickly give you a sense about any type of relationship such as linear or some non-normal one.
We will look at a couple of easy ways to produce these graphs. To
start we will use the Base R function plot
to quickly
produce a scatterplot for each of the variable pairs. If you use all the
variables, it can be difficult to read as demonstrated in the first
table. Good approach for this is to run the plot command first, then
choose the variables from what you see that have the appropriate
variable type for a scatterplot.
Here, we pick the variables that are continuous or ordinal with what appears to be a relationship with another variable. We also drop the dichotomous variables as they do not reveal as much information about the relationship in a scatterplot.
plot(mtcars)
mtcars_select<-mtcars %>%
dplyr::select(mpg, cyl, disp, hp, wt, qsec)
plot(mtcars_select)
#Scatterplot
scatter_plot <- mtcars %>%
ggplot(aes(x = hp, y = mpg)) +
geom_point() + # Scatter plot points
labs(
title = "Scatterplot of MPG & Horsepower from mtcars data",
x = "Car Horsepower",
y = "Avg MPF"
) +
theme_minimal()
scatter_plot
#Scatterplot w/Best Fit Line
scatter_plot_bf <- mtcars %>%
ggplot(aes(x = hp, y = mpg)) +
geom_point() + # Scatter plot points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Linear best fit line
labs(
title = "Scatterplot of MPG & Horsepower from mtcars data",
x = "Car Horsepower",
y = "Avg MPF"
) +
theme_minimal()
scatter_plot_bf
## `geom_smooth()` using formula = 'y ~ x'
###Adding Correlation Coefficient to Graph
pearson_r <- cor(mtcars$mpg, mtcars$hp, method = "pearson")
scatter_plot_r <- mtcars %>%
ggplot(aes(x = hp, y = mpg)) +
geom_point() + # Scatter plot points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Linear best fit line
labs(
title = "Scatterplot of MPG & Horsepower from mtcars data",
x = "Car Horsepower",
y = "Avg MPF"
) +
theme_minimal()+
annotate("text",
x = max(mtcars$hp) * 0.4,
y = min(mtcars$mpg) * 1.2,
label = paste("Pearson's r =", round(pearson_r, 2)),
hjust = 0, vjust = 0.5,
size = 4, color = "red")
scatter_plot_r
## `geom_smooth()` using formula = 'y ~ x'
To illustrate visually what different scatterplots look like when two continuous variables are correlated at different rates, we can randomly sample two while setting the population correlation to be a specific value. We will do this for 4 Pearson’s R values:
0 Correlation
.3 Correlation
.6 Correlation
.9 Correlation
The MASS
package allows us to simulate variables that
are conditionally related - i.e. correlated at pre-specified levels.
Because we will be randomly sampling from a population, our
pre-specificed correlations will not be exact but should be roughly
accurate.
Once we load the MASS
package, we follow these
steps:
Set the seed for replication
Define the number of cases - here we use 100
Define the population mean for both variables - we use 0
Define the covariance matrix with population correlation
Randomly sample the two variables
Save the correlation between the two variables
Graph the scatterplot and annotate with the correlation
b. Must specify using geom_smooth(method="lm"....)
for
best fit line
###Simulating Conditional Probability
library(MASS)
# Set the seed for reproducibility
set.seed(1934)
# Define the number of cases
n <- 100
# Define the means of the two variables
mu <- c(0, 0)
# Define the covariance matrix with the specified correlation
sigma3 <- matrix(c(1, 0.3, 0.3, 1), nrow = 2)
# Generate the data
simulated_data3 <- mvrnorm(n = n, mu = mu, Sigma = sigma3)
# Convert to a data frame for easier handling
simulated_data_df3 <- as.data.frame(simulated_data3)
colnames(simulated_data_df3) <- c("Variable1", "Variable2")
# Calculate and print the correlation to verify
r3<-cor(simulated_data_df3$Variable1, simulated_data_df3$Variable2)
b<-ggplot(simulated_data_df3, aes(x = Variable1, y = Variable2)) +
geom_point() + # Scatter plot points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Linear best fit line
labs(
title = "Scatterplot of Simulated Data",
x = "Variable 1",
y = "Variable 2"
) +
theme_minimal()+
annotate("text",
x = max(simulated_data_df3$Variable1) * 0.4,
y = min(simulated_data_df3$Variable2) * 1.2,
label = paste("Pearson's r =", round(r3, 2)),
hjust = 0, vjust = 0.5,
size = 4, color = "red")
b
## `geom_smooth()` using formula = 'y ~ x'
Here, we used ggplot
to create 4 scatterplots with
different pre-specified correlations. As the correlation coefficient
increases(decreases if negative), the best fit line becomes steeper.
This is because as the correlation gets close to 1 or -1, the variance
between the two variables is becoming nearly identical.
create_plot <- function(correlation) {
set.seed(1934)
n <- 100
mu <- c(0, 0)
sigma <- matrix(c(1, correlation, correlation, 1), nrow = 2)
simulated_data <- mvrnorm(n = n, mu = mu, Sigma = sigma)
simulated_data_df <- as.data.frame(simulated_data)
colnames(simulated_data_df) <- c("Variable1", "Variable2")
r <- cor(simulated_data_df$Variable1, simulated_data_df$Variable2)
plot <- ggplot(simulated_data_df, aes(x = Variable1, y = Variable2)) +
geom_point() + # Scatter plot points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Linear best fit line
labs(
title = paste("Scatterplot (r =", round(r, 2), ")"),
x = "Variable 1",
y = "Variable 2"
) +
theme_minimal() +
annotate("text",
x = max(simulated_data_df$Variable1) * 0.4,
y = min(simulated_data_df$Variable2) * 1.2,
label = paste("Pearson's r =", round(r, 2)),
hjust = 0, vjust = 0.5,
size = 4, color = "red")
return(plot)
}
# Create plots for different correlations
plot_r0 <- create_plot(0)
plot_r3 <- create_plot(0.3)
plot_r6 <- create_plot(0.6)
plot_r9 <- create_plot(0.9)
# Arrange plots in a 2x2 grid
grid.arrange(plot_r0, plot_r3, plot_r6, plot_r9, ncol = 2, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
We also used the annotate
function from
ggplot2
to add the actual correlation coefficient. While
this is not always necessary, it can be helpful as you start to
visualize what different correlation coefficients look like in the a
scatterplot.
Because scatterplots can be so useful, the following function allows you to quickly create a scatterplot between any two variables in a dataset along with its Pearson’s R correlation coefficient. Functions are something that are manually created that allows the user to do the same thing more quickly and efficiently. In this function, you have to specify 3 things:
To get a function to work, you have to first load it into your R environment. After it is loaded, you simply have to call the function to work and specify the required information. Note this function is also saved as a stand alone file on our Canvas page.
scatter <- function(data, x, y) {
# Ensure the variables exist in the data frame
if(!(x %in% names(data)) | !(y %in% names(data))) {
stop("The specified variables are not in the data frame.")
}
# Calculate Pearson's correlation
correlation <- cor(data[[x]], data[[y]], use = "complete.obs")
# Create the scatterplot
plot <- ggplot(data, aes(x = .data[[x]], y = .data[[y]])) +
geom_point() + # Scatter plot points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Linear best fit line
labs(
title = paste("Scatterplot of", y, "vs", x),
x = x,
y = y
) +
theme_minimal() +
annotate("text",
x = Inf, y = -Inf,
label = paste("Pearson's r =", round(correlation, 2)),
hjust = 1.1, vjust = -0.5,
size = 4, color = "red")
return(plot)
}
scatter(mtcars, "mpg", "wt")
## `geom_smooth()` using formula = 'y ~ x'
# Example usage
# df <- data.frame(var1 = rnorm(100), var2 = rnorm(100))
# scatter(df, "var1", "var2")
Correlations are useful in that they statistically assess how two variables are related to each other. While the coefficient is useful to us, it does not tell us other information that can be useful in understanding how one variable influences another.
We turn to bivariate - meaning two variables - linear regression to better assess the real-world influence of some x on some y. Linear regression is one of most frequently used analytic approaches because of its simplicity yet strength. As we will see, a linear regression provides us with additional information that correlation coefficients do not. Most notably, the beta weight which is the estimated real-world impact on expected y for increasing the x variable by 1 unit. This technique allows us to better understand the world and how variables are inter-related compared to correlations. Once again though, this is considered a correlational approach in most instances. You have to use theory and empirics to argue for causation with observational data.
What is the estimated impact of increasing a car’s horsepower on the average miles per gallon for a car in the 1970s? Above we saw a strong correlation between the two variables of -.776 (Pearson’s R). While this reveals a strong relationship between the two, it does not tell us the real-world impact of increasing horsepower in a car on miles-per-gallon estimates. That number would be useful to car engineer’s working to increase MPG in their fleet.
Because we have already looked at the histograms and scatterplot for the two variables, we know that OLS is an appropriate modeling technique to use with these variable types.
We use the lm
function from the rstatix
package to estimate the linear regression. We usually want to do
something with the results so get in the habit of saving them instead of
simply displaying them immediately. Here we save the OLS model as
lm1
which has mpg
as the dependent variable
being predicted by hp
, the measure of a car’s horsepower.
The dependent variable always comes before the ~
and the
independent variables after it. You finish this line of code by
including data=data_name
at the end updating with the
appropriate name of your data.
lm1<-lm(mpg~hp, data=mtcars)
summary(lm1)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
By running the summary
command, we see the results of
the bivariate regression model. Specifically, we see the intercept of
30.09, meaning the estimated MPG
for a car with 0
horsepower. This is an instance of when the intercept does not make
sense so no need to interpret it. Later in the semester we will learn
how to use stargazer
and jtools
to create
nicer looking regression tables.
We also see the beta weight estimate for how hp
is
related to mpg
with a value of -.0682. The negative sign
tells you that as the independent variable gets larger, here the more
horsepower a car has, the smaller the dependent variables gets on
average, here the less MPG a car gets on average. The coefficient value
also tells us something specific as well beyond the direction of the
relationship. It tells us that on average, for every 1 unit more
horsepower in a car, the average miles-per-gallon for the auto does down
about .0682 MPG.
We also see three other bits of information about the uncertainty in that estimate: standard error, t-statistic, and p-value. The p-value will come with stars included if you have a significant effect (i.e. a p-value < some specified level). Here we see a large t-statistic of -6.742 with associated p-value of 1.79e-07. This is simply scientific notation to indicate that the p-value has a lot of 0s to begin. Here, we would reject the null hypothesis that there was no relationship between a car’s horsepower and miles-per-gallon.
One useful trick at ascertaining a significant p-value is to divide the absolute value of coefficient size by the standard error. If the value is >= 2, you would reject the null hypothesis of no difference. This is just a quick trick that should not replace the value of precisely calculating the values.
It can be helpful to think about the scales each variable is measured in to decide if that is a big effect or not. Here we review descriptive stats for just the two variables included in the model.
mtcars %>%
dplyr::select(mpg, hp) %>%
get_summary_stats(type=("full"))
## # A tibble: 2 × 13
## variable n min max median q1 q3 iqr mad mean sd se
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mpg 32 10.4 33.9 19.2 15.4 22.8 7.38 5.41 20.1 6.03 1.06
## 2 hp 32 52 335 123 96.5 180 83.5 77.1 147. 68.6 12.1
## # ℹ 1 more variable: ci <dbl>
For the mpg
variable, we see the range(max-min) is
roughly 24 mpg while for the hp
variable, the range is
around 270. So a 1 unit increase in horsepower is very small. When
thinking about the effect size from a linear regression, always review
the variable descriptives to get a sense of the range. While a 1 unit
increase in HP would not be much, a 100 unit change would be. To
interpret the beta weight of .0682 in a more realistic scenario, simply
multiply the bw x the new value of 100 for 6.82. Here, you would say
that for every 100 unit increase in horsepower, the average MPG for a
car would decrease around 6.82 MPG.
Frequently, you find yourself in need to saving the predicted value
of y and the residuals from a linear regression. That can easily be
handled in a variety of ways. We will use the predict
function from the rstatix
package.
Here we predict the expected value of y from the lm1
model before calculating the residual as the difference between y and
the expected value of y from the model.
mtcars$ev_mpg<-predict(lm1) #Save expected value of y from linear model lm1
mtcars$residual<-mtcars$mpg-mtcars$ev_mpg #Calculate the residuals based on y and expected y
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb ev_mpg
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 22.59375
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.59375
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 23.75363
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 22.59375
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.15891
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 22.93489
## residual
## Mazda RX4 -1.5937500
## Mazda RX4 Wag -1.5937500
## Datsun 710 -0.9536307
## Hornet 4 Drive -1.1937500
## Hornet Sportabout 0.5410881
## Valiant -4.8348913
list(
mpg = mtcars$mpg,
expected_mpg = mtcars$ev_mpg,
residual = mtcars$residual)
## $mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
##
## $expected_mpg
## [1] 22.593750 22.593750 23.753631 22.593750 18.158912 22.934891 13.382932
## [8] 25.868707 23.617174 21.706782 21.706782 17.817770 17.817770 17.817770
## [15] 16.112064 15.429781 14.406357 25.595794 26.550990 25.664022 23.480718
## [22] 19.864619 19.864619 13.382932 18.158912 25.595794 23.890087 22.389065
## [29] 12.086595 18.158912 7.242387 22.661978
##
## $residual
## [1] -1.59374995 -1.59374995 -0.95363068 -1.19374995 0.54108812 -4.83489134
## [7] 0.91706759 -1.46870730 -0.81717412 -2.50678234 -3.90678234 -1.41777049
## [13] -0.51777049 -2.61777049 -5.71206353 -5.02978075 0.29364342 6.80420581
## [19] 3.84900992 8.23597754 -1.98071757 -4.36461883 -4.66461883 -0.08293241
## [25] 1.04108812 1.70420581 2.10991276 8.01093488 3.71340487 1.54108812
## [31] 7.75761261 -1.26197823
You can also use the ggpredict
function from the
ggeffects
package which calculates the expected value of y
at different levels of x. We will work with this package frequently
later in the semester when we focus on data visualization. Here, we use
it just to easily review the expected MPG by different horsepower
levels.
a<-ggpredict(lm1, terms=("hp")) #Sets values based on the data itself
a
## # Predicted values of mpg
##
## hp | Predicted | 95% CI
## --------------------------------
## 50 | 26.69 | [24.25, 29.12]
## 85 | 24.30 | [22.41, 26.19]
## 120 | 21.91 | [20.41, 23.41]
## 155 | 19.52 | [18.12, 20.93]
## 195 | 16.79 | [15.08, 18.51]
## 230 | 14.41 | [12.19, 16.62]
## 265 | 12.02 | [ 9.20, 14.83]
## 335 | 7.24 | [ 3.11, 11.38]
##
## Not all rows are shown in the ouput. Use `print(..., n = Inf)` to show
## all rows.
b<-ggpredict(lm1, terms=("hp [50:350, by =100]")) #Manually set values of X
b
## # Predicted values of mpg
##
## hp | Predicted | 95% CI
## --------------------------------
## 50 | 26.69 | [24.25, 29.12]
## 150 | 19.86 | [18.47, 21.26]
## 250 | 13.04 | [10.49, 15.59]
## 350 | 6.22 | [ 1.79, 10.65]
We can also use the deviance for both x and y to calculate the beta weight of -.0682 that was revealed in the previous linear regression. We will once again do that one step at a time following the formula discussed in lecture.
#Step 1: Calculates the deviations for both x and y
new_mt<-mtcars %>%
mutate(hp_avg=mean(hp),
mpg_avg=mean(mpg),
x_dev=hp-hp_avg,
y_dev=mpg-mpg_avg)
#Step 2: Calculate the Beta
beta <- new_mt %>%
summarise(
numerator = sum(x_dev*y_dev),
denominator= sum(x_dev^2),
beta = numerator/denominator)
print(beta)
## numerator denominator beta
## 1 -9942.694 145726.9 -0.06822828
summary(lm1)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Once again, we verify our work and see that the manual calculation of
the beta weight and the beta weight from the lm
model are
identical. Both this illustration and the previous one are designed to
get you thinking specifically about how these coefficients are
calculated. You generally will never need to hand-calculate these things
but it is good to know how to do it. If nothing else, it should help you
with the intuition behind the calculations and especially why
understanding deviance/residuals is so important in these types of
analyses.
The above linear regression beta weight is called an unstandardized coefficient meaning it takes on the scale of itself. The Pearson’s R correlation coefficient is simply a standardized version of this unstandardized beta weight.
We will illustrate this is two ways: First, using the
QuantPsych
package to directly calculate the standardized
coefficient and then second calculating z-scores for the x and y
variable and then estimating a linear regression on these transformed
variables.
We first reestimate the lm1 model then we use the
lm.beta
function from the QuantPsych
package.
This calculates the standardized beta weight which should match the
Pearson’s R correlation exactly.
lm1<-lm(mpg ~ hp, data=mtcars)
lm1_standardized <- lm.beta(lm1) #Use lm.beta from QuantPsych package.
summary(lm1_standardized)# Print the standardized coefficients
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7762 -0.7762 -0.7762 -0.7762 -0.7762 -0.7762
cor(mtcars$mpg, mtcars$hp, method="pearson")
## [1] -0.7761684
We see that both values are -.776, showing that the standardized coefficient from a linear regression and a Pearson’s R correlation is the same thing. This only works in a bivariate world and not in a multi-variate one.
Now, we calculate the z-score for both mpg and hp then reestimate the regression model. The coefficient should be -.776.
mtcars<-mtcars %>% #This calculates the z-score while keeping all of the original variables in the df
mutate(mean_mpg = mean(mpg),
n=n(),
sd_mpg=sd(mpg),
zscore_mpg = ((mpg - mean_mpg)/sd_mpg),
mean_hp = mean(hp),
sd_hp=sd(hp),
zscore_hp = ((hp - mean_hp)/sd_hp))
lm_z<-lm(zscore_mpg ~ zscore_hp, data=mtcars)
stargazer(lm_z, type="text", digits=4)
##
## ===============================================
## Dependent variable:
## ---------------------------
## zscore_mpg
## -----------------------------------------------
## zscore_hp -0.7762***
## (0.1151)
##
## Constant -0.0000
## (0.1133)
##
## -----------------------------------------------
## Observations 32
## R2 0.6024
## Adjusted R2 0.5892
## Residual Std. Error 0.6409 (df = 30)
## F Statistic 45.4598*** (df = 1; 30)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
And that is exactly what we see in these regression results. The
coefficient between the z-score version of x and the z-score version of
y is -.7762, just like the Pearson’s R coefficient and the standardized
beta weight from the QuantPsych
package.