Giovanni Minchio giovanni.minchio@unitn.it
Yuxin Zhang yuxin.zhang@unitn.it
Quantitative Methods Lab, Lesson 4.2
24
Oct. 2024
Model 1: y = var1 + var2
Model 2: y = var1 + var2 + var3
Model 3: y = var1 + var3
Because variables are measured with different scales
And here comes standardized coefficients
Standardized regression coefficients allow researchers to compare the relative magnitude of the effects of different explanatory variables in the path model by adjusting the standard deviations such that all the variables, despite different units of measurement, have equal standard deviations (Lleras, 2005).
Optional:
Let’s practice with our old friend ESS10
data set
lrscale
on
agea
Source | SS df MS Number of obs = 32,101
-------------+---------------------------------- F(1, 32099) = 10.27
Model | 56.5785141 1 56.5785141 Prob > F = 0.0014
Residual | 176894.649 32,099 5.51090842 R-squared = 0.0003
-------------+---------------------------------- Adj R-squared = 0.0003
Total | 176951.228 32,100 5.51249931 Root MSE = 2.3475
------------------------------------------------------------------------------
lrscale | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
agea | .0023079 .0007203 3.20 0.001 .0008961 .0037197
_cons | 5.09883 .0391211 130.33 0.000 5.022151 5.175508
------------------------------------------------------------------------------
How would you interpret the intercept?
Without centering, it represents the predicted value of the dependent variable when age = 0
Subtract the mean of variable from the value for each observation (this results in a mean of 0)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
agea | 37,319 50.8512 18.41309 15 90
(292 missing values generated)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
c_agea | 37,319 1.71e-06 18.41309 -35.8512 39.1488
r(mean)
after summarize
command Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
agea | 37,319 50.8512 18.41309 15 90
(292 missing values generated)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
c1_agea | 37,319 -6.24e-08 18.41309 -35.8512 39.1488
Source | SS df MS Number of obs = 32,101
-------------+---------------------------------- F(1, 32099) = 10.27
Model | 56.5785172 1 56.5785172 Prob > F = 0.0014
Residual | 176894.649 32,099 5.51090842 R-squared = 0.0003
-------------+---------------------------------- Adj R-squared = 0.0003
Total | 176951.228 32,100 5.51249931 Root MSE = 2.3475
------------------------------------------------------------------------------
lrscale | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
c1_agea | .0023079 .0007203 3.20 0.001 .0008961 .0037197
_cons | 5.216191 .0131045 398.04 0.000 5.190505 5.241876
------------------------------------------------------------------------------
How do you interpret the intercept now?
After centering, the intercept represents the predicted value of the dependent variable when the centered age is 0, corresponding to its mean value. So, the intercept now reflects the expected value of the dependent variable when age is at its average.
(When a person is at the average age, their predicted position on the left-right political scale is 5.22)
lrscale
on
variables agea
, rlgdgr
, and
gndr
Source | SS df MS Number of obs = 31,890
-------------+---------------------------------- F(3, 31886) = 146.09
Model | 2385.32457 3 795.10819 Prob > F = 0.0000
Residual | 173547.133 31,886 5.44273767 R-squared = 0.0136
-------------+---------------------------------- Adj R-squared = 0.0135
Total | 175932.458 31,889 5.5170265 Root MSE = 2.333
------------------------------------------------------------------------------
lrscale | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
agea | 5.15e-06 .0007318 0.01 0.994 -.0014292 .0014395
rlgdgr | .0782304 .0042905 18.23 0.000 .0698208 .0866399
|
gndr |
Female | -.3185983 .0264032 -12.07 0.000 -.3703495 -.266847
_cons | 5.018478 .0425542 117.93 0.000 4.935071 5.101886
------------------------------------------------------------------------------
How would you interpret the coefficients one by one?
A standardized variable (also called z-score) is a variable that has been rescaled to have a mean of zero and a standard deviation of one.
standardized variable = (variable - mean)/sd
Or more formally, the formula for calculating sample z-score is
\[ z = \frac{x - \bar{x}}{s} \]
Where:
Now let’s standardize our numerical variables lrscale
,
agea
and rlgdgr
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lrscale | 32,314 5.216501 2.348016 0 10
agea | 37,319 50.8512 18.41309 15 90
rlgdgr | 37,203 4.75123 3.15127 0 10
(x-m)/sd
gen z_lrscale = (lrscale - 5.216501) / 2.348016
gen z_agea = (agea - 50.8512) / 18.41309
gen z_rlgdgr = (rlgdgr - 4.75123) / 3.15127
(5,297 missing values generated)
(292 missing values generated)
(408 missing values generated)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
z_lrscale | 32,314 -1.59e-07 1 -2.221663 2.037251
z_agea | 37,319 9.13e-08 .9999999 -1.94705 2.12614
z_rlgdgr | 37,203 -8.40e-08 1 -1.507719 1.665605
Rounding error: the mean of a standardized variable is rarely exactly 0
std
in egen
command(5,297 missing values generated)
(292 missing values generated)
(408 missing values generated)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
z_lrscale | 32,314 -1.59e-07 1 -2.221663 2.037251
z1_agea | 37,319 8.99e-10 1 -1.94705 2.12614
z1_rlgdgr | 37,203 -1.47e-09 1 -1.507719 1.665605
The means and standard deviations are not exactly the same as those calculated manually; again, these differences are due to very small rounding errors.
Source | SS df MS Number of obs = 31,890
-------------+---------------------------------- F(3, 31886) = 146.09
Model | 432.658636 3 144.219545 Prob > F = 0.0000
Residual | 31478.5948 31,886 .98722307 R-squared = 0.0136
-------------+---------------------------------- Adj R-squared = 0.0135
Total | 31911.2534 31,889 1.00069784 Root MSE = .99359
------------------------------------------------------------------------------
z_lrscale | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
z1_agea | .0000404 .0057386 0.01 0.994 -.0112076 .0112883
z1_rlgdgr | .1049929 .0057583 18.23 0.000 .0937064 .1162794
|
gndr |
Female | -.1356883 .0112449 -12.07 0.000 -.1577287 -.1136479
_cons | .0740751 .0081398 9.10 0.000 .0581207 .0900295
------------------------------------------------------------------------------
How you interpret these coefficients?
The coefficients show how many standard deviations the outcome variable changes for a one standard deviation change in the explanatory variable.
E.g., for religiosity, the coefficient is 0.10 in standardized regression. So, one standard deviation increase in religiosity is associated with a 0.10 standard deviation increase in the left-right scale (outcome variable).
Source | SS df MS Number of obs = 31,890
-------------+---------------------------------- F(3, 31886) = 146.09
Model | 2385.32456 3 795.108188 Prob > F = 0.0000
Residual | 173547.133 31,886 5.44273767 R-squared = 0.0136
-------------+---------------------------------- Adj R-squared = 0.0135
Total | 175932.458 31,889 5.5170265 Root MSE = 2.333
------------------------------------------------------------------------------
lrscale | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
z1_agea | .0000948 .0134744 0.01 0.994 -.0263155 .0265051
z1_rlgdgr | .246525 .0135206 18.23 0.000 .2200242 .2730258
|
gndr |
Female | -.3185983 .0264032 -12.07 0.000 -.3703495 -.266847
_cons | 5.39043 .0191125 282.04 0.000 5.352969 5.427892
------------------------------------------------------------------------------
How you interpret these coefficients?
When the outcome is not standardized, but the explanatory variables are, the coefficients represent how much the outcome changes (in its original units) for a 1 standard deviation change in the explanatory variable.
E.g., one standard deviation increase in religiosity is associated with a 0.25 unit increase in the original left-right scale.
Sounds weird, no..?
So it is up to you to decide whether or not to standardize your variables, depending on contexts.
normalized variable = (variable - min)/(max - min)
Or formula for variable X
\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]
Where:
Note: normalization has a restricted range while standardization does not, so normalization is more likely to be affected by outliers!
Lets’ normalize variable lrscale
summarize
, and then r(max)
and
r(min)
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
lrscale | 32,314 5.216501 2.348016 0 10
(5,297 missing values generated)
See more options of stored results in r()
for
summarize
here
lrscale
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
norm_lrscale | 32,314 .5216501 .2348016 0 1
E.g., “How often pray apart from at religious services?”
pray:
1 Every day
2 More than once a week
3 Once a week
4 At least once a month
5 Only on special holy days
6 Less often
7 Never
.a Refusal
.b Don't know
.c No answer
Reverse and recode it, so that:
Tty it yourself first
This is how I do it
How often |
pray apart |
from at |
religious |
services | Freq. Percent Cum.
------------+-----------------------------------
1 | 6,502 17.61 17.61
2 | 3,143 8.51 26.13
3 | 2,364 6.40 32.53
4 | 2,632 7.13 39.66
5 | 2,805 7.60 47.26
6 | 6,169 16.71 63.97
7 | 13,300 36.03 100.00
------------+-----------------------------------
Total | 36,915 100.00
(696 missing values generated)
pray_rev | Freq. Percent Cum.
------------+-----------------------------------
0 | 13,300 36.03 36.03
1 | 6,169 16.71 52.74
2 | 2,805 7.60 60.34
3 | 2,632 7.13 67.47
4 | 2,364 6.40 73.87
5 | 3,143 8.51 82.39
6 | 6,502 17.61 100.00
------------+-----------------------------------
Total | 36,915 100.00
Now we see there are two similar variables, hmsacld
and
freehms
, regarding people’s opinions on LGBT+ rights.
hmsacld
: gay and lesbian couples right to adopt children
freehms
: gays and lesbians free to live life as they
wish
hmsacld Gay and lesbian couples right to adopt children
---------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: hmsacld
Range: [1,5] Units: 1
Unique values: 5 Missing .: 0/37,611
Unique mv codes: 3 Missing .*: 1,280/37,611
Tabulation: Freq. Numeric Label
7,740 1 Agree strongly
7,727 2 Agree
6,584 3 Neither agree nor disagree
6,514 4 Disagree
7,766 5 Disagree strongly
270 .a Refusal
894 .b Don't know
116 .c No answer
---------------------------------------------------------------------------------------------------------
freehms Gays and lesbians free to live life as they wish
---------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: freehms
Range: [1,5] Units: 1
Unique values: 5 Missing .: 0/37,611
Unique mv codes: 3 Missing .*: 885/37,611
Tabulation: Freq. Numeric Label
12,649 1 Agree strongly
12,655 2 Agree
5,799 3 Neither agree nor disagree
2,990 4 Disagree
2,633 5 Disagree strongly
226 .a Refusal
564 .b Don't know
95 .c No answer
We want to combine them into one variable, how should we do it?
Try it yourself first!
This is how I do it
(1,280 missing values generated)
(885 missing values generated)
Gay and |
lesbian |
couples |
right to |
adopt |
children | Freq. Percent Cum.
------------+-----------------------------------
1 | 7,740 21.30 21.30
2 | 7,727 21.27 42.57
3 | 6,584 18.12 60.69
4 | 6,514 17.93 78.62
5 | 7,766 21.38 100.00
------------+-----------------------------------
Total | 36,331 100.00
hmsacld_rev | Freq. Percent Cum.
------------+-----------------------------------
0 | 7,766 21.38 21.38
1 | 6,514 17.93 39.31
2 | 6,584 18.12 57.43
3 | 7,727 21.27 78.70
4 | 7,740 21.30 100.00
------------+-----------------------------------
Total | 36,331 100.00
Gays and |
lesbians |
free to |
live life |
as they |
wish | Freq. Percent Cum.
------------+-----------------------------------
1 | 12,649 34.44 34.44
2 | 12,655 34.46 68.90
3 | 5,799 15.79 84.69
4 | 2,990 8.14 92.83
5 | 2,633 7.17 100.00
------------+-----------------------------------
Total | 36,726 100.00
freehms_rev | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,633 7.17 7.17
1 | 2,990 8.14 15.31
2 | 5,799 15.79 31.10
3 | 12,655 34.46 65.56
4 | 12,649 34.44 100.00
------------+-----------------------------------
Total | 36,726 100.00
(1,660 missing values generated)
gay_rights_ |
sum | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,218 6.17 6.17
1 | 1,564 4.35 10.52
2 | 2,739 7.62 18.14
3 | 3,502 9.74 27.88
4 | 5,979 16.63 44.51
5 | 3,905 10.86 55.37
6 | 5,669 15.77 71.14
7 | 3,403 9.47 80.61
8 | 6,972 19.39 100.00
------------+-----------------------------------
Total | 35,951 100.00
Or…
(1,660 missing values generated)
gay_rights_ |
mean | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,218 6.17 6.17
.5 | 1,564 4.35 10.52
1 | 2,739 7.62 18.14
1.5 | 3,502 9.74 27.88
2 | 5,979 16.63 44.51
2.5 | 3,905 10.86 55.37
3 | 5,669 15.77 71.14
3.5 | 3,403 9.47 80.61
4 | 6,972 19.39 100.00
------------+-----------------------------------
Total | 35,951 100.00
You can also simply use egen
for non-missing values
in the variables
gay_rights_ |
sum1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,977 7.92 7.92
1 | 1,713 4.55 12.47
2 | 2,933 7.80 20.27
3 | 3,899 10.37 30.63
4 | 6,140 16.33 46.96
5 | 3,905 10.38 57.34
6 | 5,669 15.07 72.41
7 | 3,403 9.05 81.46
8 | 6,972 18.54 100.00
------------+-----------------------------------
Total | 37,611 100.00
(505 missing values generated)
gay_rights_ |
mean1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,472 6.66 6.66
.5 | 1,564 4.21 10.88
1 | 2,888 7.78 18.66
1.5 | 3,502 9.44 28.10
2 | 6,173 16.64 44.73
2.5 | 3,905 10.52 55.26
3 | 6,066 16.35 71.61
3.5 | 3,403 9.17 80.78
4 | 7,133 19.22 100.00
------------+-----------------------------------
Total | 37,106 100.00
Note that two approaches treat missing values differently, so you may obtain different results.
Sum:
hmsacld_rev + freehms_rev
: . + . = ..hmsacld_rev + freehms_rev
: . + 1 = ..rowtotal(hmsacld_rev freehms_rev)
: rowtotal(. + .) =
0rowtotal(hmsacld_rev freehms_rev)
: rowtotal(. + 1) =
0+1 =1Mean:
(hmsacld_rev + freehms_rev) / 2
: (. + .) / 2 = ..(hmsacld_rev + freehms_rev) / 2
: (. + 1) / 2 = ..rowmean(hmsacld_rev freehms_rev)
: rowmean(. .) =
..rowmean(hmsacld_rev freehms_rev)
: rowmean(. 1) = 1/1 =
1We can change reference category in categorical variables. Now let’s try with gender.
gndr Gender
---------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: gndr
Range: [1,2] Units: 1
Unique values: 2 Missing .: 0/37,611
Tabulation: Freq. Numeric Label
17,463 1 Male
20,148 2 Female
Source | SS df MS Number of obs = 32,314
-------------+---------------------------------- F(1, 32312) = 96.95
Model | 532.921349 1 532.921349 Prob > F = 0.0000
Residual | 177614.441 32,312 5.49685691 R-squared = 0.0030
-------------+---------------------------------- Adj R-squared = 0.0030
Total | 178147.362 32,313 5.51317927 Root MSE = 2.3445
------------------------------------------------------------------------------
lrscale | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
gndr |
Female | -.2571983 .0261212 -9.85 0.000 -.3083969 -.2059997
_cons | 5.351865 .0189501 282.42 0.000 5.314722 5.389008
------------------------------------------------------------------------------
We can change reference group by specifying the reference
category using ib(#)
(b
stands for “base”,
#
indicates the category number you want to set as the
base/reference group)
Source | SS df MS Number of obs = 32,314
-------------+---------------------------------- F(1, 32312) = 96.95
Model | 532.921349 1 532.921349 Prob > F = 0.0000
Residual | 177614.441 32,312 5.49685691 R-squared = 0.0030
-------------+---------------------------------- Adj R-squared = 0.0030
Total | 178147.362 32,313 5.51317927 Root MSE = 2.3445
------------------------------------------------------------------------------
lrscale | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
gndr |
Female | -.2571983 .0261212 -9.85 0.000 -.3083969 -.2059997
_cons | 5.351865 .0189501 282.42 0.000 5.314722 5.389008
------------------------------------------------------------------------------
Source | SS df MS Number of obs = 32,314
-------------+---------------------------------- F(1, 32312) = 96.95
Model | 532.921349 1 532.921349 Prob > F = 0.0000
Residual | 177614.441 32,312 5.49685691 R-squared = 0.0030
-------------+---------------------------------- Adj R-squared = 0.0030
Total | 178147.362 32,313 5.51317927 Root MSE = 2.3445
------------------------------------------------------------------------------
lrscale | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
gndr |
Male | .2571983 .0261212 9.85 0.000 .2059997 .3083969
_cons | 5.094667 .0179781 283.38 0.000 5.059429 5.129905
------------------------------------------------------------------------------
Use ESS10.dta
Investigate variables trstprt
, trstplt
,
rlgatnd
, pray
Transform these variables using what we learned today (as you
want but justify it), then run a simple regression with the created
additive indices: general trust in politics
(outcome) and
religiosity
(explanatory)
Come up with a potential third explanatory variable, then run a multiple regression
Compare and interpret two regression tables
Submit only your do-file with comments to Moodle In-class assignment (Week 4), with:
* use asterisk "*" at the beginning of a line for comments in Stata (the comment line will appear in green in your do-file)