Correlations, Scatter Plots and OLS Intitution

Introduction

In this tutorial, we will consider how best to analyze and visualize two continuous variables. These techniques are widely used and considered foundational in quantitative research methods.

We will cover:

Pearson’s R,
Spearman’s Rank, and
Kendall’s Tau
Scatterplots
Bivariate Linear Regression
Standardized Coefficients

data(mtcars)
summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

skim(mtcars)

Data summary
Name	mtcars
Number of rows	32
Number of columns	11
_______________________
Column type frequency:
numeric	11
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mpg	1	20.09	6.03	10.40	15.43	19.20	22.80	33.90	▃▇▅▁▂
cyl	1	6.19	1.79	4.00	4.00	6.00	8.00	8.00	▆▁▃▁▇
disp	1	230.72	123.94	71.10	120.83	196.30	326.00	472.00	▇▃▃▃▂
hp	1	146.69	68.56	52.00	96.50	123.00	180.00	335.00	▇▇▆▃▁
drat	1	3.60	0.53	2.76	3.08	3.70	3.92	4.93	▇▃▇▅▁
wt	1	3.22	0.98	1.51	2.58	3.33	3.61	5.42	▃▃▇▁▂
qsec	1	17.85	1.79	14.50	16.89	17.71	18.90	22.90	▃▇▇▂▁
vs	1	0.44	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
am	1	0.41	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
gear	1	3.69	0.74	3.00	3.00	4.00	4.00	5.00	▇▁▆▁▂
carb	1	2.81	1.62	1.00	2.00	2.00	4.00	8.00	▇▂▅▁▁

Pearson’s R Correlation

As discussed in lecture, Pearson’s R correlation can be thought of as a measure of how much shared variance two continuous variables have with each other. When one variable misses its central tendency by x amount, how similar is the miss to its central tendency for the other variable? Pearson’s R gives us an easy way to assess this measure of association.

We will be using the cor function from the rstatix package first. This package allows you to specify the measure of association you want to analyze with following key options: “pearson”, “kendall”, “spearman”

We will review each one in turn using the mtcars dataset. You can either specify specific variables to use in the correlation or correlate the entire dataset. However, you will frequently have variables that are not numeric which will make the table harder to read and understand. It can be important to simply reduce the variables for the correlation analysis to a few select variables.

We will use the cor function from the rstatix package. This allows you to produce the three correlation types we have reviewed in class along with other useful options.

r_select <- cor(mtcars$mpg, mtcars$hp, method = "pearson")
r_select

## [1] -0.7761684

r_all <-cor(mtcars, method="pearson", use="complete.obs") #complete.obs includes units with data for each variable
r_all

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

r_all_rnd <- round(r_all, 3) #Rounds the correlation matrix to 3 decimal points for easier viewing
r_all_rnd

##         mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear
## mpg   1.000 -0.852 -0.848 -0.776  0.681 -0.868  0.419  0.664  0.600  0.480
## cyl  -0.852  1.000  0.902  0.832 -0.700  0.782 -0.591 -0.811 -0.523 -0.493
## disp -0.848  0.902  1.000  0.791 -0.710  0.888 -0.434 -0.710 -0.591 -0.556
## hp   -0.776  0.832  0.791  1.000 -0.449  0.659 -0.708 -0.723 -0.243 -0.126
## drat  0.681 -0.700 -0.710 -0.449  1.000 -0.712  0.091  0.440  0.713  0.700
## wt   -0.868  0.782  0.888  0.659 -0.712  1.000 -0.175 -0.555 -0.692 -0.583
## qsec  0.419 -0.591 -0.434 -0.708  0.091 -0.175  1.000  0.745 -0.230 -0.213
## vs    0.664 -0.811 -0.710 -0.723  0.440 -0.555  0.745  1.000  0.168  0.206
## am    0.600 -0.523 -0.591 -0.243  0.713 -0.692 -0.230  0.168  1.000  0.794
## gear  0.480 -0.493 -0.556 -0.126  0.700 -0.583 -0.213  0.206  0.794  1.000
## carb -0.551  0.527  0.395  0.750 -0.091  0.428 -0.656 -0.570  0.058  0.274
##        carb
## mpg  -0.551
## cyl   0.527
## disp  0.395
## hp    0.750
## drat -0.091
## wt    0.428
## qsec -0.656
## vs   -0.570
## am    0.058
## gear  0.274
## carb  1.000

Variables in this table run both vertically and horizontally. To find a correlation between two specific variables, find the first variable in the vertical column on the left and then move horizontally until you find the second variable in the row at the top of the table. The closer to 1 or -1 the correlation coefficient is the stronger the two variables are related.

You will always see the diagonal correlations = 1; that is because the diagonal represents the correlation of a variable with itself. And a variable is always perfectly correlated with itself.

Even with fewer decimal points, a table with this many correlations can be difficult to fully interpret. You can also visualize the correlations using various graphs. Here, we use the corrplot package to visually represent the correlations in the above table.

r_cars <- cor(mtcars, method = "pearson")
num<-corrplot(r_cars, method = 'number') # colorful number

num

## $corr
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
## 
## $corrPos
##     xName yName  x  y        corr
## 1     mpg   mpg  1 11  1.00000000
## 2     mpg   cyl  1 10 -0.85216196
## 3     mpg  disp  1  9 -0.84755138
## 4     mpg    hp  1  8 -0.77616837
## 5     mpg  drat  1  7  0.68117191
## 6     mpg    wt  1  6 -0.86765938
## 7     mpg  qsec  1  5  0.41868403
## 8     mpg    vs  1  4  0.66403892
## 9     mpg    am  1  3  0.59983243
## 10    mpg  gear  1  2  0.48028476
## 11    mpg  carb  1  1 -0.55092507
## 12    cyl   mpg  2 11 -0.85216196
## 13    cyl   cyl  2 10  1.00000000
## 14    cyl  disp  2  9  0.90203287
## 15    cyl    hp  2  8  0.83244745
## 16    cyl  drat  2  7 -0.69993811
## 17    cyl    wt  2  6  0.78249579
## 18    cyl  qsec  2  5 -0.59124207
## 19    cyl    vs  2  4 -0.81081180
## 20    cyl    am  2  3 -0.52260705
## 21    cyl  gear  2  2 -0.49268660
## 22    cyl  carb  2  1  0.52698829
## 23   disp   mpg  3 11 -0.84755138
## 24   disp   cyl  3 10  0.90203287
## 25   disp  disp  3  9  1.00000000
## 26   disp    hp  3  8  0.79094859
## 27   disp  drat  3  7 -0.71021393
## 28   disp    wt  3  6  0.88797992
## 29   disp  qsec  3  5 -0.43369788
## 30   disp    vs  3  4 -0.71041589
## 31   disp    am  3  3 -0.59122704
## 32   disp  gear  3  2 -0.55556920
## 33   disp  carb  3  1  0.39497686
## 34     hp   mpg  4 11 -0.77616837
## 35     hp   cyl  4 10  0.83244745
## 36     hp  disp  4  9  0.79094859
## 37     hp    hp  4  8  1.00000000
## 38     hp  drat  4  7 -0.44875912
## 39     hp    wt  4  6  0.65874789
## 40     hp  qsec  4  5 -0.70822339
## 41     hp    vs  4  4 -0.72309674
## 42     hp    am  4  3 -0.24320426
## 43     hp  gear  4  2 -0.12570426
## 44     hp  carb  4  1  0.74981247
## 45   drat   mpg  5 11  0.68117191
## 46   drat   cyl  5 10 -0.69993811
## 47   drat  disp  5  9 -0.71021393
## 48   drat    hp  5  8 -0.44875912
## 49   drat  drat  5  7  1.00000000
## 50   drat    wt  5  6 -0.71244065
## 51   drat  qsec  5  5  0.09120476
## 52   drat    vs  5  4  0.44027846
## 53   drat    am  5  3  0.71271113
## 54   drat  gear  5  2  0.69961013
## 55   drat  carb  5  1 -0.09078980
## 56     wt   mpg  6 11 -0.86765938
## 57     wt   cyl  6 10  0.78249579
## 58     wt  disp  6  9  0.88797992
## 59     wt    hp  6  8  0.65874789
## 60     wt  drat  6  7 -0.71244065
## 61     wt    wt  6  6  1.00000000
## 62     wt  qsec  6  5 -0.17471588
## 63     wt    vs  6  4 -0.55491568
## 64     wt    am  6  3 -0.69249526
## 65     wt  gear  6  2 -0.58328700
## 66     wt  carb  6  1  0.42760594
## 67   qsec   mpg  7 11  0.41868403
## 68   qsec   cyl  7 10 -0.59124207
## 69   qsec  disp  7  9 -0.43369788
## 70   qsec    hp  7  8 -0.70822339
## 71   qsec  drat  7  7  0.09120476
## 72   qsec    wt  7  6 -0.17471588
## 73   qsec  qsec  7  5  1.00000000
## 74   qsec    vs  7  4  0.74453544
## 75   qsec    am  7  3 -0.22986086
## 76   qsec  gear  7  2 -0.21268223
## 77   qsec  carb  7  1 -0.65624923
## 78     vs   mpg  8 11  0.66403892
## 79     vs   cyl  8 10 -0.81081180
## 80     vs  disp  8  9 -0.71041589
## 81     vs    hp  8  8 -0.72309674
## 82     vs  drat  8  7  0.44027846
## 83     vs    wt  8  6 -0.55491568
## 84     vs  qsec  8  5  0.74453544
## 85     vs    vs  8  4  1.00000000
## 86     vs    am  8  3  0.16834512
## 87     vs  gear  8  2  0.20602335
## 88     vs  carb  8  1 -0.56960714
## 89     am   mpg  9 11  0.59983243
## 90     am   cyl  9 10 -0.52260705
## 91     am  disp  9  9 -0.59122704
## 92     am    hp  9  8 -0.24320426
## 93     am  drat  9  7  0.71271113
## 94     am    wt  9  6 -0.69249526
## 95     am  qsec  9  5 -0.22986086
## 96     am    vs  9  4  0.16834512
## 97     am    am  9  3  1.00000000
## 98     am  gear  9  2  0.79405876
## 99     am  carb  9  1  0.05753435
## 100  gear   mpg 10 11  0.48028476
## 101  gear   cyl 10 10 -0.49268660
## 102  gear  disp 10  9 -0.55556920
## 103  gear    hp 10  8 -0.12570426
## 104  gear  drat 10  7  0.69961013
## 105  gear    wt 10  6 -0.58328700
## 106  gear  qsec 10  5 -0.21268223
## 107  gear    vs 10  4  0.20602335
## 108  gear    am 10  3  0.79405876
## 109  gear  gear 10  2  1.00000000
## 110  gear  carb 10  1  0.27407284
## 111  carb   mpg 11 11 -0.55092507
## 112  carb   cyl 11 10  0.52698829
## 113  carb  disp 11  9  0.39497686
## 114  carb    hp 11  8  0.74981247
## 115  carb  drat 11  7 -0.09078980
## 116  carb    wt 11  6  0.42760594
## 117  carb  qsec 11  5 -0.65624923
## 118  carb    vs 11  4 -0.56960714
## 119  carb    am 11  3  0.05753435
## 120  carb  gear 11  2  0.27407284
## 121  carb  carb 11  1  1.00000000
## 
## $arg
## $arg$type
## [1] "full"

num2<-corrplot(r_cars, method = 'number',  type="lower")  #Only Show Lower Part of Diag

num2

## $corr
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
## 
## $corrPos
##    xName yName  x  y        corr
## 1    mpg   mpg  1 11  1.00000000
## 2    mpg   cyl  1 10 -0.85216196
## 3    mpg  disp  1  9 -0.84755138
## 4    mpg    hp  1  8 -0.77616837
## 5    mpg  drat  1  7  0.68117191
## 6    mpg    wt  1  6 -0.86765938
## 7    mpg  qsec  1  5  0.41868403
## 8    mpg    vs  1  4  0.66403892
## 9    mpg    am  1  3  0.59983243
## 10   mpg  gear  1  2  0.48028476
## 11   mpg  carb  1  1 -0.55092507
## 12   cyl   cyl  2 10  1.00000000
## 13   cyl  disp  2  9  0.90203287
## 14   cyl    hp  2  8  0.83244745
## 15   cyl  drat  2  7 -0.69993811
## 16   cyl    wt  2  6  0.78249579
## 17   cyl  qsec  2  5 -0.59124207
## 18   cyl    vs  2  4 -0.81081180
## 19   cyl    am  2  3 -0.52260705
## 20   cyl  gear  2  2 -0.49268660
## 21   cyl  carb  2  1  0.52698829
## 22  disp  disp  3  9  1.00000000
## 23  disp    hp  3  8  0.79094859
## 24  disp  drat  3  7 -0.71021393
## 25  disp    wt  3  6  0.88797992
## 26  disp  qsec  3  5 -0.43369788
## 27  disp    vs  3  4 -0.71041589
## 28  disp    am  3  3 -0.59122704
## 29  disp  gear  3  2 -0.55556920
## 30  disp  carb  3  1  0.39497686
## 31    hp    hp  4  8  1.00000000
## 32    hp  drat  4  7 -0.44875912
## 33    hp    wt  4  6  0.65874789
## 34    hp  qsec  4  5 -0.70822339
## 35    hp    vs  4  4 -0.72309674
## 36    hp    am  4  3 -0.24320426
## 37    hp  gear  4  2 -0.12570426
## 38    hp  carb  4  1  0.74981247
## 39  drat  drat  5  7  1.00000000
## 40  drat    wt  5  6 -0.71244065
## 41  drat  qsec  5  5  0.09120476
## 42  drat    vs  5  4  0.44027846
## 43  drat    am  5  3  0.71271113
## 44  drat  gear  5  2  0.69961013
## 45  drat  carb  5  1 -0.09078980
## 46    wt    wt  6  6  1.00000000
## 47    wt  qsec  6  5 -0.17471588
## 48    wt    vs  6  4 -0.55491568
## 49    wt    am  6  3 -0.69249526
## 50    wt  gear  6  2 -0.58328700
## 51    wt  carb  6  1  0.42760594
## 52  qsec  qsec  7  5  1.00000000
## 53  qsec    vs  7  4  0.74453544
## 54  qsec    am  7  3 -0.22986086
## 55  qsec  gear  7  2 -0.21268223
## 56  qsec  carb  7  1 -0.65624923
## 57    vs    vs  8  4  1.00000000
## 58    vs    am  8  3  0.16834512
## 59    vs  gear  8  2  0.20602335
## 60    vs  carb  8  1 -0.56960714
## 61    am    am  9  3  1.00000000
## 62    am  gear  9  2  0.79405876
## 63    am  carb  9  1  0.05753435
## 64  gear  gear 10  2  1.00000000
## 65  gear  carb 10  1  0.27407284
## 66  carb  carb 11  1  1.00000000
## 
## $arg
## $arg$type
## [1] "lower"

cir<-corrplot(r_cars, method = 'circle',  type="lower", diag=FALSE) #Removes the diag corr of 1 & uses different sized circles to represent the correlation coefficient

cir

## $corr
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
## 
## $corrPos
##    xName yName  x  y        corr
## 1    mpg   cyl  1 10 -0.85216196
## 2    mpg  disp  1  9 -0.84755138
## 3    mpg    hp  1  8 -0.77616837
## 4    mpg  drat  1  7  0.68117191
## 5    mpg    wt  1  6 -0.86765938
## 6    mpg  qsec  1  5  0.41868403
## 7    mpg    vs  1  4  0.66403892
## 8    mpg    am  1  3  0.59983243
## 9    mpg  gear  1  2  0.48028476
## 10   mpg  carb  1  1 -0.55092507
## 11   cyl  disp  2  9  0.90203287
## 12   cyl    hp  2  8  0.83244745
## 13   cyl  drat  2  7 -0.69993811
## 14   cyl    wt  2  6  0.78249579
## 15   cyl  qsec  2  5 -0.59124207
## 16   cyl    vs  2  4 -0.81081180
## 17   cyl    am  2  3 -0.52260705
## 18   cyl  gear  2  2 -0.49268660
## 19   cyl  carb  2  1  0.52698829
## 20  disp    hp  3  8  0.79094859
## 21  disp  drat  3  7 -0.71021393
## 22  disp    wt  3  6  0.88797992
## 23  disp  qsec  3  5 -0.43369788
## 24  disp    vs  3  4 -0.71041589
## 25  disp    am  3  3 -0.59122704
## 26  disp  gear  3  2 -0.55556920
## 27  disp  carb  3  1  0.39497686
## 28    hp  drat  4  7 -0.44875912
## 29    hp    wt  4  6  0.65874789
## 30    hp  qsec  4  5 -0.70822339
## 31    hp    vs  4  4 -0.72309674
## 32    hp    am  4  3 -0.24320426
## 33    hp  gear  4  2 -0.12570426
## 34    hp  carb  4  1  0.74981247
## 35  drat    wt  5  6 -0.71244065
## 36  drat  qsec  5  5  0.09120476
## 37  drat    vs  5  4  0.44027846
## 38  drat    am  5  3  0.71271113
## 39  drat  gear  5  2  0.69961013
## 40  drat  carb  5  1 -0.09078980
## 41    wt  qsec  6  5 -0.17471588
## 42    wt    vs  6  4 -0.55491568
## 43    wt    am  6  3 -0.69249526
## 44    wt  gear  6  2 -0.58328700
## 45    wt  carb  6  1  0.42760594
## 46  qsec    vs  7  4  0.74453544
## 47  qsec    am  7  3 -0.22986086
## 48  qsec  gear  7  2 -0.21268223
## 49  qsec  carb  7  1 -0.65624923
## 50    vs    am  8  3  0.16834512
## 51    vs  gear  8  2  0.20602335
## 52    vs  carb  8  1 -0.56960714
## 53    am  gear  9  2  0.79405876
## 54    am  carb  9  1  0.05753435
## 55  gear  carb 10  1  0.27407284
## 
## $arg
## $arg$type
## [1] "lower"

?mtcars #Describes what each variable is for easier interpretation

## starting httpd help server ... done

There are other ways to customize this type of plot. Use ?corrplot to see different options or there are several good tutorials on Google that goes through the plots in detail.

Ultimately, we use correlation tables like this in an exploratory fashion. It allows us to quickly assess the relationships between our variables. This process can be especially helpful at building high quality multivariate regression models and for picking key variables if you have sample size limitations (like we do with this dataset of only 32).

Manual Calculation of Pearson’s R

Now, let’s review the formula for Pearson’s R correlation and then manually calculate it using the mtcars data. We will use the hp variable as the x and the mpg variable as the y. Computationally it does not matter but theoretically we want to always think about the y variable as the DV and the x as the IV. We will do this calculation in 4-steps.

Calculate the deviance - difference from each unit value to the mean value of that variable - for both the x and y variable in the correlation.
Calculate the sum of the product of the deviances
Calculate the sum of the product of the sum of squares
Calculate the actual Pearson’s R coefficient

#Step 1: Calculates the deviations for both x and y
new_mt<-mtcars %>% 
  mutate(hp_avg=mean(hp),
         mpg_avg=mean(mpg),
          x_dev=hp-hp_avg,
         y_dev=mpg-mpg_avg)
  
#Step 2: Calculate the product of residuals for mpg and hp
sum_product_residuals <- new_mt %>%
  summarise(sum_product = sum(x_dev * y_dev)) %>%
  pull(sum_product)

# Step 3: Calculate the sums of squares for the residuals of each variable
sum_squares_mpg <- new_mt %>%
  summarise(sum_squares_mpg = sum(y_dev^2)) %>%
  pull(sum_squares_mpg)

sum_squares_hp <- new_mt %>%
  summarise(sum_squares_hp = sum(x_dev^2)) %>%
  pull(sum_squares_hp)

# Step 4: Calculate Pearson's r
pearson_r <- sum_product_residuals / sqrt(sum_squares_mpg * sum_squares_hp)

# Output the result
pearson_r

## [1] -0.7761684

cor(mtcars$mpg, mtcars$hp, method="pearson")

## [1] -0.7761684

After manually calculating each individual component part of the formula and the final Pearson’s R correlation coefficient, we compare the manual calculation to the value returned by the cor function above. We see that they are identical meaning we manually replicated the formula accurately.

Spearman’s Rank Correlation

Now, we look at using the Spearman’s Rank correlation as an alternative measure of association for two variables. As discussed in lecture, this approach ranks each variable from 1:n and then looks at the relationship between their rankings. If all of the ranks were in perfect order, 1-1, 2-2, 3-3, etc, then you would have a perfect correlation of 1.

We might want to consider this approach due to a few reasons.

Nature of Variables: Some variables in the mtcars dataset, such as cyl (number of cylinders) and gear (number of forward gears), are ordinal with only a few distinct values. Pearson’s R is designed to work best with continuous, normally distributed variables. For ordinal data or variables with a limited number of categories, Spearman’s rank correlation is generally a better approach as it is less sensitive to the actual distribution of the data.
Distribution of Continuous Variables: For the continuous variables in the dataset, many exhibit skewed distributions, which can violate the normality assumption required for Pearson’s correlation. In such cases, using Spearman’s rank correlation is more appropriate because it ranks the data, thereby minimizing the influence of outliers and providing a more robust measure of association for skewed distributions.

rho_all <-cor(mtcars, method="spearman", use="complete.obs")
rho_all_rnd <- round(rho_all, 3)
rho_all_rnd

##         mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear
## mpg   1.000 -0.911 -0.909 -0.895  0.651 -0.886  0.467  0.707  0.562  0.543
## cyl  -0.911  1.000  0.928  0.902 -0.679  0.858 -0.572 -0.814 -0.522 -0.564
## disp -0.909  0.928  1.000  0.851 -0.684  0.898 -0.460 -0.724 -0.624 -0.594
## hp   -0.895  0.902  0.851  1.000 -0.520  0.775 -0.667 -0.752 -0.362 -0.331
## drat  0.651 -0.679 -0.684 -0.520  1.000 -0.750  0.092  0.447  0.687  0.745
## wt   -0.886  0.858  0.898  0.775 -0.750  1.000 -0.225 -0.587 -0.738 -0.676
## qsec  0.467 -0.572 -0.460 -0.667  0.092 -0.225  1.000  0.792 -0.203 -0.148
## vs    0.707 -0.814 -0.724 -0.752  0.447 -0.587  0.792  1.000  0.168  0.283
## am    0.562 -0.522 -0.624 -0.362  0.687 -0.738 -0.203  0.168  1.000  0.808
## gear  0.543 -0.564 -0.594 -0.331  0.745 -0.676 -0.148  0.283  0.808  1.000
## carb -0.657  0.580  0.540  0.733 -0.125  0.500 -0.659 -0.634 -0.064  0.115
##        carb
## mpg  -0.657
## cyl   0.580
## disp  0.540
## hp    0.733
## drat -0.125
## wt    0.500
## qsec -0.659
## vs   -0.634
## am   -0.064
## gear  0.115
## carb  1.000

Kendall’s Tau Correlation

The third measure of association we will review is the Kendall’s Tau correlation. This counts the number of concordant and discordant pairs of variables between all two observations in the dataset. The more similar the variables are, the higher number of concordant pairs they will have increasing the correlation coefficient.

This approach is mostly used on small sample sizes, like we have in this example.

tau_all <-cor(mtcars, method="kendall", use="complete.obs")
tau_all_rnd <- round(tau_all, 3)
tau_all_rnd

##         mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear
## mpg   1.000 -0.795 -0.768 -0.743  0.465 -0.728  0.315  0.590  0.469  0.433
## cyl  -0.795  1.000  0.814  0.785 -0.551  0.728 -0.449 -0.771 -0.495 -0.513
## disp -0.768  0.814  1.000  0.666 -0.499  0.743 -0.301 -0.603 -0.520 -0.476
## hp   -0.743  0.785  0.666  1.000 -0.383  0.611 -0.473 -0.631 -0.304 -0.279
## drat  0.465 -0.551 -0.499 -0.383  1.000 -0.547  0.033  0.375  0.576  0.584
## wt   -0.728  0.728  0.743  0.611 -0.547  1.000 -0.142 -0.488 -0.614 -0.544
## qsec  0.315 -0.449 -0.301 -0.473  0.033 -0.142  1.000  0.658 -0.169 -0.091
## vs    0.590 -0.771 -0.603 -0.631  0.375 -0.488  0.658  1.000  0.168  0.270
## am    0.469 -0.495 -0.520 -0.304  0.576 -0.614 -0.169  0.168  1.000  0.771
## gear  0.433 -0.513 -0.476 -0.279  0.584 -0.544 -0.091  0.270  0.771  1.000
## carb -0.504  0.465  0.414  0.596 -0.095  0.371 -0.506 -0.577 -0.059  0.098
##        carb
## mpg  -0.504
## cyl   0.465
## disp  0.414
## hp    0.596
## drat -0.095
## wt    0.371
## qsec -0.506
## vs   -0.577
## am   -0.059
## gear  0.098
## carb  1.000

Now let’s compare the correlations for the MPG variable across the three approaches.

# Extract the first column from each correlation matrix
r_first_col <- r_all_rnd[, 1]
tau_first_col <- tau_all_rnd[, 1]
rho_first_col <- rho_all_rnd[, 1]

# Combine them into a tibble with appropriately named columns
correlation_tibble <- tibble(
  Variable = rownames(r_all_rnd),
  Pearson = r_first_col,
  Kendall = tau_first_col,
  Spearman = rho_first_col
)

# Print the resulting tibble
print(correlation_tibble)

## # A tibble: 11 × 4
##    Variable Pearson Kendall Spearman
##    <chr>      <dbl>   <dbl>    <dbl>
##  1 mpg        1       1        1    
##  2 cyl       -0.852  -0.795   -0.911
##  3 disp      -0.848  -0.768   -0.909
##  4 hp        -0.776  -0.743   -0.895
##  5 drat       0.681   0.465    0.651
##  6 wt        -0.868  -0.728   -0.886
##  7 qsec       0.419   0.315    0.467
##  8 vs         0.664   0.59     0.707
##  9 am         0.6     0.469    0.562
## 10 gear       0.48    0.433    0.543
## 11 carb      -0.551  -0.504   -0.657

Another way to visualize the relationship between two variables is to plot them in a scatterplot. Scatterplots are very useful tools at examining the relationship between two variables. It can quickly give you a sense about any type of relationship such as linear or some non-normal one.

We will look at a couple of easy ways to produce these graphs. To start we will use the Base R function plot to quickly produce a scatterplot for each of the variable pairs. If you use all the variables, it can be difficult to read as demonstrated in the first table. Good approach for this is to run the plot command first, then choose the variables from what you see that have the appropriate variable type for a scatterplot.

Here, we pick the variables that are continuous or ordinal with what appears to be a relationship with another variable. We also drop the dichotomous variables as they do not reveal as much information about the relationship in a scatterplot.

plot(mtcars)

mtcars_select<-mtcars %>% 
  dplyr::select(mpg, cyl, disp, hp, wt, qsec)

plot(mtcars_select)

#Scatterplot 
scatter_plot <- mtcars %>%
  ggplot(aes(x = hp, y = mpg)) +
  geom_point() +  # Scatter plot points
  labs(
    title = "Scatterplot of MPG & Horsepower from mtcars data",
    x = "Car Horsepower",
    y = "Avg MPF"
  ) +
  theme_minimal()
scatter_plot

#Scatterplot w/Best Fit Line
scatter_plot_bf <- mtcars %>%
  ggplot(aes(x = hp, y = mpg)) +
  geom_point() +  # Scatter plot points
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Linear best fit line
  labs(
    title = "Scatterplot of MPG & Horsepower from mtcars data",
    x = "Car Horsepower",
    y = "Avg MPF"
  ) +
  theme_minimal()
scatter_plot_bf

## `geom_smooth()` using formula = 'y ~ x'

###Adding Correlation Coefficient to Graph
pearson_r <- cor(mtcars$mpg, mtcars$hp, method = "pearson")
scatter_plot_r <- mtcars %>%
  ggplot(aes(x = hp, y = mpg)) +
  geom_point() +  # Scatter plot points
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Linear best fit line
  labs(
    title = "Scatterplot of MPG & Horsepower from mtcars data",
    x = "Car Horsepower",
    y = "Avg MPF"
  ) +
  theme_minimal()+
  annotate("text", 
           x = max(mtcars$hp) * 0.4, 
           y = min(mtcars$mpg) * 1.2, 
           label = paste("Pearson's r =", round(pearson_r, 2)),
           hjust = 0, vjust = 0.5, 
           size = 4, color = "red")

scatter_plot_r

## `geom_smooth()` using formula = 'y ~ x'

Scatterplots of Different Pearson’s R Correlation Coefficients

To illustrate visually what different scatterplots look like when two continuous variables are correlated at different rates, we can randomly sample two while setting the population correlation to be a specific value. We will do this for 4 Pearson’s R values:

0 Correlation
.3 Correlation
.6 Correlation
.9 Correlation

The MASS package allows us to simulate variables that are conditionally related - i.e. correlated at pre-specified levels. Because we will be randomly sampling from a population, our pre-specificed correlations will not be exact but should be roughly accurate.

Once we load the MASS package, we follow these steps:

Set the seed for replication
Define the number of cases - here we use 100
Define the population mean for both variables - we use 0
Define the covariance matrix with population correlation
Randomly sample the two variables
Save the correlation between the two variables
Graph the scatterplot and annotate with the correlation

b. Must specify using geom_smooth(method="lm"....) for best fit line

###Simulating Conditional Probability
library(MASS)
# Set the seed for reproducibility
set.seed(1934)
# Define the number of cases
n <- 100
# Define the means of the two variables
mu <- c(0, 0)
# Define the covariance matrix with the specified correlation
sigma3 <- matrix(c(1, 0.3, 0.3, 1), nrow = 2)
# Generate the data
simulated_data3 <- mvrnorm(n = n, mu = mu, Sigma = sigma3)

# Convert to a data frame for easier handling
simulated_data_df3 <- as.data.frame(simulated_data3)
colnames(simulated_data_df3) <- c("Variable1", "Variable2")


# Calculate and print the correlation to verify
r3<-cor(simulated_data_df3$Variable1, simulated_data_df3$Variable2)

b<-ggplot(simulated_data_df3, aes(x = Variable1, y = Variable2)) +
  geom_point() +  # Scatter plot points
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Linear best fit line
  labs(
    title = "Scatterplot of Simulated Data",
    x = "Variable 1",
    y = "Variable 2"
  ) +
  theme_minimal()+
  annotate("text", 
           x = max(simulated_data_df3$Variable1) * 0.4, 
           y = min(simulated_data_df3$Variable2) * 1.2, 
           label = paste("Pearson's r =", round(r3, 2)),
           hjust = 0, vjust = 0.5, 
           size = 4, color = "red")

b

## `geom_smooth()` using formula = 'y ~ x'

Here, we used ggplot to create 4 scatterplots with different pre-specified correlations. As the correlation coefficient increases(decreases if negative), the best fit line becomes steeper. This is because as the correlation gets close to 1 or -1, the variance between the two variables is becoming nearly identical.

create_plot <- function(correlation) {
  set.seed(1934)
  n <- 100
  mu <- c(0, 0)
  sigma <- matrix(c(1, correlation, correlation, 1), nrow = 2)
  simulated_data <- mvrnorm(n = n, mu = mu, Sigma = sigma)
  simulated_data_df <- as.data.frame(simulated_data)
  colnames(simulated_data_df) <- c("Variable1", "Variable2")
  
  r <- cor(simulated_data_df$Variable1, simulated_data_df$Variable2)
  
  plot <- ggplot(simulated_data_df, aes(x = Variable1, y = Variable2)) +
    geom_point() +  # Scatter plot points
    geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Linear best fit line
    labs(
      title = paste("Scatterplot (r =", round(r, 2), ")"),
      x = "Variable 1",
      y = "Variable 2"
    ) +
    theme_minimal() +
    annotate("text", 
             x = max(simulated_data_df$Variable1) * 0.4, 
             y = min(simulated_data_df$Variable2) * 1.2, 
             label = paste("Pearson's r =", round(r, 2)),
             hjust = 0, vjust = 0.5, 
             size = 4, color = "red")
  
  return(plot)
}

# Create plots for different correlations
plot_r0 <- create_plot(0)
plot_r3 <- create_plot(0.3)
plot_r6 <- create_plot(0.6)
plot_r9 <- create_plot(0.9)

# Arrange plots in a 2x2 grid
grid.arrange(plot_r0, plot_r3, plot_r6, plot_r9, ncol = 2, nrow = 2)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

We also used the annotate function from ggplot2 to add the actual correlation coefficient. While this is not always necessary, it can be helpful as you start to visualize what different correlation coefficients look like in the a scatterplot.

Because scatterplots can be so useful, the following function allows you to quickly create a scatterplot between any two variables in a dataset along with its Pearson’s R correlation coefficient. Functions are something that are manually created that allows the user to do the same thing more quickly and efficiently. In this function, you have to specify 3 things:

Name of the dataset
Name of the y variable
Name of the x variable

To get a function to work, you have to first load it into your R environment. After it is loaded, you simply have to call the function to work and specify the required information. Note this function is also saved as a stand alone file on our Canvas page.

scatter <- function(data, x, y) {
  # Ensure the variables exist in the data frame
  if(!(x %in% names(data)) | !(y %in% names(data))) {
    stop("The specified variables are not in the data frame.")
  }
  
  # Calculate Pearson's correlation
  correlation <- cor(data[[x]], data[[y]], use = "complete.obs")
  
  # Create the scatterplot
  plot <- ggplot(data, aes(x = .data[[x]], y = .data[[y]])) +
    geom_point() +  # Scatter plot points
    geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Linear best fit line
    labs(
      title = paste("Scatterplot of", y, "vs", x),
      x = x,
      y = y
    ) +
    theme_minimal() +
    annotate("text", 
             x = Inf, y = -Inf, 
             label = paste("Pearson's r =", round(correlation, 2)),
             hjust = 1.1, vjust = -0.5, 
             size = 4, color = "red")
  
  return(plot)
}

scatter(mtcars, "mpg", "wt")

## `geom_smooth()` using formula = 'y ~ x'

# Example usage
# df <- data.frame(var1 = rnorm(100), var2 = rnorm(100))
# scatter(df, "var1", "var2")

Bivariate Linear Regression

Correlations are useful in that they statistically assess how two variables are related to each other. While the coefficient is useful to us, it does not tell us other information that can be useful in understanding how one variable influences another.

We turn to bivariate - meaning two variables - linear regression to better assess the real-world influence of some x on some y. Linear regression is one of most frequently used analytic approaches because of its simplicity yet strength. As we will see, a linear regression provides us with additional information that correlation coefficients do not. Most notably, the beta weight which is the estimated real-world impact on expected y for increasing the x variable by 1 unit. This technique allows us to better understand the world and how variables are inter-related compared to correlations. Once again though, this is considered a correlational approach in most instances. You have to use theory and empirics to argue for causation with observational data.

Bivariate OLS: MPG and Horsepower

What is the estimated impact of increasing a car’s horsepower on the average miles per gallon for a car in the 1970s? Above we saw a strong correlation between the two variables of -.776 (Pearson’s R). While this reveals a strong relationship between the two, it does not tell us the real-world impact of increasing horsepower in a car on miles-per-gallon estimates. That number would be useful to car engineer’s working to increase MPG in their fleet.

Because we have already looked at the histograms and scatterplot for the two variables, we know that OLS is an appropriate modeling technique to use with these variable types.

We use the lm function from the rstatix package to estimate the linear regression. We usually want to do something with the results so get in the habit of saving them instead of simply displaying them immediately. Here we save the OLS model as lm1 which has mpg as the dependent variable being predicted by hp, the measure of a car’s horsepower. The dependent variable always comes before the ~ and the independent variables after it. You finish this line of code by including data=data_name at the end updating with the appropriate name of your data.

lm1<-lm(mpg~hp, data=mtcars)
summary(lm1)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

By running the summary command, we see the results of the bivariate regression model. Specifically, we see the intercept of 30.09, meaning the estimated MPG for a car with 0 horsepower. This is an instance of when the intercept does not make sense so no need to interpret it. Later in the semester we will learn how to use stargazer and jtools to create nicer looking regression tables.

We also see the beta weight estimate for how hp is related to mpg with a value of -.0682. The negative sign tells you that as the independent variable gets larger, here the more horsepower a car has, the smaller the dependent variables gets on average, here the less MPG a car gets on average. The coefficient value also tells us something specific as well beyond the direction of the relationship. It tells us that on average, for every 1 unit more horsepower in a car, the average miles-per-gallon for the auto does down about .0682 MPG.

We also see three other bits of information about the uncertainty in that estimate: standard error, t-statistic, and p-value. The p-value will come with stars included if you have a significant effect (i.e. a p-value < some specified level). Here we see a large t-statistic of -6.742 with associated p-value of 1.79e-07. This is simply scientific notation to indicate that the p-value has a lot of 0s to begin. Here, we would reject the null hypothesis that there was no relationship between a car’s horsepower and miles-per-gallon.

One useful trick at ascertaining a significant p-value is to divide the absolute value of coefficient size by the standard error. If the value is >= 2, you would reject the null hypothesis of no difference. This is just a quick trick that should not replace the value of precisely calculating the values.

It can be helpful to think about the scales each variable is measured in to decide if that is a big effect or not. Here we review descriptive stats for just the two variables included in the model.

mtcars %>% 
  dplyr::select(mpg, hp) %>% 
get_summary_stats(type=("full"))

## # A tibble: 2 × 13
##   variable     n   min   max median    q1    q3   iqr   mad  mean    sd    se
##   <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mpg         32  10.4  33.9   19.2  15.4  22.8  7.38  5.41  20.1  6.03  1.06
## 2 hp          32  52   335    123    96.5 180   83.5  77.1  147.  68.6  12.1 
## # ℹ 1 more variable: ci <dbl>

For the mpg variable, we see the range(max-min) is roughly 24 mpg while for the hp variable, the range is around 270. So a 1 unit increase in horsepower is very small. When thinking about the effect size from a linear regression, always review the variable descriptives to get a sense of the range. While a 1 unit increase in HP would not be much, a 100 unit change would be. To interpret the beta weight of .0682 in a more realistic scenario, simply multiply the bw x the new value of 100 for 6.82. Here, you would say that for every 100 unit increase in horsepower, the average MPG for a car would decrease around 6.82 MPG.

Frequently, you find yourself in need to saving the predicted value of y and the residuals from a linear regression. That can easily be handled in a variety of ways. We will use the predict function from the rstatix package.

Here we predict the expected value of y from the lm1 model before calculating the residual as the difference between y and the expected value of y from the model.

mtcars$ev_mpg<-predict(lm1) #Save expected value of y from linear model lm1
mtcars$residual<-mtcars$mpg-mtcars$ev_mpg #Calculate the residuals based on y and expected y 

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb   ev_mpg
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 22.59375
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 22.59375
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 23.75363
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 22.59375
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 18.15891
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 22.93489
##                     residual
## Mazda RX4         -1.5937500
## Mazda RX4 Wag     -1.5937500
## Datsun 710        -0.9536307
## Hornet 4 Drive    -1.1937500
## Hornet Sportabout  0.5410881
## Valiant           -4.8348913

list(
  mpg = mtcars$mpg,
  expected_mpg = mtcars$ev_mpg,
  residual = mtcars$residual)

## $mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
## 
## $expected_mpg
##  [1] 22.593750 22.593750 23.753631 22.593750 18.158912 22.934891 13.382932
##  [8] 25.868707 23.617174 21.706782 21.706782 17.817770 17.817770 17.817770
## [15] 16.112064 15.429781 14.406357 25.595794 26.550990 25.664022 23.480718
## [22] 19.864619 19.864619 13.382932 18.158912 25.595794 23.890087 22.389065
## [29] 12.086595 18.158912  7.242387 22.661978
## 
## $residual
##  [1] -1.59374995 -1.59374995 -0.95363068 -1.19374995  0.54108812 -4.83489134
##  [7]  0.91706759 -1.46870730 -0.81717412 -2.50678234 -3.90678234 -1.41777049
## [13] -0.51777049 -2.61777049 -5.71206353 -5.02978075  0.29364342  6.80420581
## [19]  3.84900992  8.23597754 -1.98071757 -4.36461883 -4.66461883 -0.08293241
## [25]  1.04108812  1.70420581  2.10991276  8.01093488  3.71340487  1.54108812
## [31]  7.75761261 -1.26197823

You can also use the ggpredict function from the ggeffects package which calculates the expected value of y at different levels of x. We will work with this package frequently later in the semester when we focus on data visualization. Here, we use it just to easily review the expected MPG by different horsepower levels.

a<-ggpredict(lm1, terms=("hp")) #Sets values based on the data itself
a

## # Predicted values of mpg
## 
##  hp | Predicted |         95% CI
## --------------------------------
##  50 |     26.69 | [24.25, 29.12]
##  85 |     24.30 | [22.41, 26.19]
## 120 |     21.91 | [20.41, 23.41]
## 155 |     19.52 | [18.12, 20.93]
## 195 |     16.79 | [15.08, 18.51]
## 230 |     14.41 | [12.19, 16.62]
## 265 |     12.02 | [ 9.20, 14.83]
## 335 |      7.24 | [ 3.11, 11.38]

## 
## Not all rows are shown in the ouput. Use `print(..., n = Inf)` to show
##   all rows.

b<-ggpredict(lm1, terms=("hp [50:350, by =100]")) #Manually set values of X 
b

## # Predicted values of mpg
## 
##  hp | Predicted |         95% CI
## --------------------------------
##  50 |     26.69 | [24.25, 29.12]
## 150 |     19.86 | [18.47, 21.26]
## 250 |     13.04 | [10.49, 15.59]
## 350 |      6.22 | [ 1.79, 10.65]

Manually Calculationg Unstandardized Beta Weight

We can also use the deviance for both x and y to calculate the beta weight of -.0682 that was revealed in the previous linear regression. We will once again do that one step at a time following the formula discussed in lecture.

#Step 1: Calculates the deviations for both x and y
new_mt<-mtcars %>% 
  mutate(hp_avg=mean(hp),
         mpg_avg=mean(mpg),
          x_dev=hp-hp_avg,
         y_dev=mpg-mpg_avg)

#Step 2: Calculate the Beta
beta <- new_mt %>% 
  summarise(
    numerator = sum(x_dev*y_dev),
    denominator= sum(x_dev^2),
    beta = numerator/denominator)

print(beta)

##   numerator denominator        beta
## 1 -9942.694    145726.9 -0.06822828

summary(lm1)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Once again, we verify our work and see that the manual calculation of the beta weight and the beta weight from the lm model are identical. Both this illustration and the previous one are designed to get you thinking specifically about how these coefficients are calculated. You generally will never need to hand-calculate these things but it is good to know how to do it. If nothing else, it should help you with the intuition behind the calculations and especially why understanding deviance/residuals is so important in these types of analyses.

Standardized Beta Weights

The above linear regression beta weight is called an unstandardized coefficient meaning it takes on the scale of itself. The Pearson’s R correlation coefficient is simply a standardized version of this unstandardized beta weight.

We will illustrate this is two ways: First, using the QuantPsych package to directly calculate the standardized coefficient and then second calculating z-scores for the x and y variable and then estimating a linear regression on these transformed variables.

We first reestimate the lm1 model then we use the lm.beta function from the QuantPsych package. This calculates the standardized beta weight which should match the Pearson’s R correlation exactly.

lm1<-lm(mpg ~ hp, data=mtcars)
lm1_standardized <- lm.beta(lm1) #Use lm.beta from QuantPsych package. 
summary(lm1_standardized)# Print the standardized coefficients

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7762 -0.7762 -0.7762 -0.7762 -0.7762 -0.7762

cor(mtcars$mpg, mtcars$hp, method="pearson")

## [1] -0.7761684

We see that both values are -.776, showing that the standardized coefficient from a linear regression and a Pearson’s R correlation is the same thing. This only works in a bivariate world and not in a multi-variate one.

Now, we calculate the z-score for both mpg and hp then reestimate the regression model. The coefficient should be -.776.

mtcars<-mtcars %>% #This calculates the z-score while keeping all of the original variables in the df
  mutate(mean_mpg = mean(mpg),
         n=n(),
         sd_mpg=sd(mpg), 
         zscore_mpg = ((mpg - mean_mpg)/sd_mpg), 
         mean_hp = mean(hp),
         sd_hp=sd(hp), 
         zscore_hp = ((hp - mean_hp)/sd_hp))

lm_z<-lm(zscore_mpg ~ zscore_hp, data=mtcars)
stargazer(lm_z, type="text", digits=4)

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                             zscore_mpg         
## -----------------------------------------------
## zscore_hp                   -0.7762***         
##                              (0.1151)          
##                                                
## Constant                      -0.0000          
##                              (0.1133)          
##                                                
## -----------------------------------------------
## Observations                    32             
## R2                            0.6024           
## Adjusted R2                   0.5892           
## Residual Std. Error      0.6409 (df = 30)      
## F Statistic           45.4598*** (df = 1; 30)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

And that is exactly what we see in these regression results. The coefficient between the z-score version of x and the z-score version of y is -.7762, just like the Pearson’s R coefficient and the standardized beta weight from the QuantPsych package.