More on Linear Regression in Stata

Giovanni Minchio giovanni.minchio@unitn.it
Yuxin Zhang yuxin.zhang@unitn.it

Quantitative Methods Lab, Lesson 4.2
24 Oct. 2024

On Tuesday, you learned…

Multiple linear regression with Stata
Reading regression tables
Simpson’s paradox

How many predictors in one regression model?

Depends on the theoretical model and the dataset
Avoid garbage in, garbage out models (too many predictors with noise, unclear what are you looking for)

Nested Models

Model 1: y = var1 + var2

Model 2: y = var1 + var2 + var3

Model 3: y = var1 + var3

Model 1 2 are nested
Model 2 3 are nested
Model 1 3 are not nested

Coefficients cannot be directly compared!

Because variables are measured with different scales

Image source here

And here comes standardized coefficients

Standardization (z-score scaling)

Standardized regression coefficients allow researchers to compare the relative magnitude of the effects of different explanatory variables in the path model by adjusting the standard deviations such that all the variables, despite different units of measurement, have equal standard deviations (Lleras, 2005).

Optional:

Pro: compare the relative effect magnitude of included variables
Con: less meaningful/intuitive interpretation of coefficients

Variable transformation

Centering (e.g., at mean)
Standardization
Normalization (0 to 1)
Reversing scales
Creating additive indices
Changing reference group (categorical predictors)

In Stata

Let’s practice with our old friend ESS10 data set

cd "/Users/yuxin/Documents/STATALAB2024-25"
use "datafile/ESS10.dta", clear

Regression

Let’s first regress the outcome lrscale on agea

regress lrscale agea

      Source |       SS           df       MS      Number of obs   =    32,101
-------------+----------------------------------   F(1, 32099)     =     10.27
       Model |  56.5785141         1  56.5785141   Prob > F        =    0.0014
    Residual |  176894.649    32,099  5.51090842   R-squared       =    0.0003
-------------+----------------------------------   Adj R-squared   =    0.0003
       Total |  176951.228    32,100  5.51249931   Root MSE        =    2.3475

------------------------------------------------------------------------------
     lrscale | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        agea |   .0023079   .0007203     3.20   0.001     .0008961    .0037197
       _cons |    5.09883   .0391211   130.33   0.000     5.022151    5.175508
------------------------------------------------------------------------------

How would you interpret the intercept?

Without centering, it represents the predicted value of the dependent variable when age = 0

Mean centering

Subtract the mean of variable from the value for each observation (this results in a mean of 0)

summarize agea

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        agea |     37,319     50.8512    18.41309         15         90

gen c_agea = agea - 50.8512
sum c_agea

(292 missing values generated)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      c_agea |     37,319    1.71e-06    18.41309   -35.8512    39.1488

Or use r(mean) after summarize command

sum agea
gen c1_agea = agea - r(mean)
sum c1_agea

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        agea |     37,319     50.8512    18.41309         15         90

(292 missing values generated)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     c1_agea |     37,319   -6.24e-08    18.41309   -35.8512    39.1488

Now, let’s regress again

regress lrscale c1_agea

      Source |       SS           df       MS      Number of obs   =    32,101
-------------+----------------------------------   F(1, 32099)     =     10.27
       Model |  56.5785172         1  56.5785172   Prob > F        =    0.0014
    Residual |  176894.649    32,099  5.51090842   R-squared       =    0.0003
-------------+----------------------------------   Adj R-squared   =    0.0003
       Total |  176951.228    32,100  5.51249931   Root MSE        =    2.3475

------------------------------------------------------------------------------
     lrscale | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     c1_agea |   .0023079   .0007203     3.20   0.001     .0008961    .0037197
       _cons |   5.216191   .0131045   398.04   0.000     5.190505    5.241876
------------------------------------------------------------------------------

How do you interpret the intercept now?

After centering, the intercept represents the predicted value of the dependent variable when the centered age is 0, corresponding to its mean value. So, the intercept now reflects the expected value of the dependent variable when age is at its average.

(When a person is at the average age, their predicted position on the left-right political scale is 5.22)

Regression coefficients

Now, let’s first regress the outcome lrscale on variables agea, rlgdgr, and gndr

regress lrscale agea rlgdgr i.gndr

      Source |       SS           df       MS      Number of obs   =    31,890
-------------+----------------------------------   F(3, 31886)     =    146.09
       Model |  2385.32457         3   795.10819   Prob > F        =    0.0000
    Residual |  173547.133    31,886  5.44273767   R-squared       =    0.0136
-------------+----------------------------------   Adj R-squared   =    0.0135
       Total |  175932.458    31,889   5.5170265   Root MSE        =     2.333

------------------------------------------------------------------------------
     lrscale | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        agea |   5.15e-06   .0007318     0.01   0.994    -.0014292    .0014395
      rlgdgr |   .0782304   .0042905    18.23   0.000     .0698208    .0866399
             |
        gndr |
     Female  |  -.3185983   .0264032   -12.07   0.000    -.3703495    -.266847
       _cons |   5.018478   .0425542   117.93   0.000     4.935071    5.101886
------------------------------------------------------------------------------

How would you interpret the coefficients one by one?

Scaling

A standardized variable (also called z-score) is a variable that has been rescaled to have a mean of zero and a standard deviation of one.

Subtracted the mean of the variable from the value for each observation (this results in a mean of 0)
Divide the difference between individual values and mean by standard deviation (this results in a sd of 1)

standardized variable = (variable - mean)/sd

Or more formally, the formula for calculating sample z-score is

\[ z = \frac{x - \bar{x}}{s} \]

Where:

\(x\): individual data point
\(\bar{x}\): sample mean
\(s\): sample standard deviation

Image source here

Now let’s standardize our numerical variables lrscale, agea and rlgdgr

First, check the means

sum lrscale agea rlgdgr

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     lrscale |     32,314    5.216501    2.348016          0         10
        agea |     37,319     50.8512    18.41309         15         90
      rlgdgr |     37,203     4.75123     3.15127          0         10

Generate standardized variables by (x-m)/sd

gen z_lrscale = (lrscale - 5.216501) /  2.348016 
gen z_agea = (agea - 50.8512) / 18.41309 
gen z_rlgdgr = (rlgdgr - 4.75123) / 3.15127

(5,297 missing values generated)

(292 missing values generated)

(408 missing values generated)

Check if the mean is very close to 0

sum z_lrscale z_agea z_rlgdgr

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
   z_lrscale |     32,314   -1.59e-07           1  -2.221663   2.037251
      z_agea |     37,319    9.13e-08    .9999999   -1.94705    2.12614
    z_rlgdgr |     37,203   -8.40e-08           1  -1.507719   1.665605

Rounding error: the mean of a standardized variable is rarely exactly 0

Or being lazy and use std in egen command

egen z1_lrscale = std(lrscale)
egen z1_agea = std(agea)
egen z1_rlgdgr = std(rlgdgr)

(5,297 missing values generated)

(292 missing values generated)

(408 missing values generated)

sum z_lrscale z1_agea z1_rlgdgr

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
   z_lrscale |     32,314   -1.59e-07           1  -2.221663   2.037251
     z1_agea |     37,319    8.99e-10           1   -1.94705    2.12614
   z1_rlgdgr |     37,203   -1.47e-09           1  -1.507719   1.665605

The means and standard deviations are not exactly the same as those calculated manually; again, these differences are due to very small rounding errors.

Regression rescaled

Now let’s rerun our regression with rescaled variables

regress z_lrscale z1_agea z1_rlgdgr i.gndr

      Source |       SS           df       MS      Number of obs   =    31,890
-------------+----------------------------------   F(3, 31886)     =    146.09
       Model |  432.658636         3  144.219545   Prob > F        =    0.0000
    Residual |  31478.5948    31,886   .98722307   R-squared       =    0.0136
-------------+----------------------------------   Adj R-squared   =    0.0135
       Total |  31911.2534    31,889  1.00069784   Root MSE        =    .99359

------------------------------------------------------------------------------
   z_lrscale | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     z1_agea |   .0000404   .0057386     0.01   0.994    -.0112076    .0112883
   z1_rlgdgr |   .1049929   .0057583    18.23   0.000     .0937064    .1162794
             |
        gndr |
     Female  |  -.1356883   .0112449   -12.07   0.000    -.1577287   -.1136479
       _cons |   .0740751   .0081398     9.10   0.000     .0581207    .0900295
------------------------------------------------------------------------------

How you interpret these coefficients?

The coefficients show how many standard deviations the outcome variable changes for a one standard deviation change in the explanatory variable.

E.g., for religiosity, the coefficient is 0.10 in standardized regression. So, one standard deviation increase in religiosity is associated with a 0.10 standard deviation increase in the left-right scale (outcome variable).

Or only use standardized explanatory variables

regress lrscale z1_agea z1_rlgdgr i.gndr

      Source |       SS           df       MS      Number of obs   =    31,890
-------------+----------------------------------   F(3, 31886)     =    146.09
       Model |  2385.32456         3  795.108188   Prob > F        =    0.0000
    Residual |  173547.133    31,886  5.44273767   R-squared       =    0.0136
-------------+----------------------------------   Adj R-squared   =    0.0135
       Total |  175932.458    31,889   5.5170265   Root MSE        =     2.333

------------------------------------------------------------------------------
     lrscale | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     z1_agea |   .0000948   .0134744     0.01   0.994    -.0263155    .0265051
   z1_rlgdgr |    .246525   .0135206    18.23   0.000     .2200242    .2730258
             |
        gndr |
     Female  |  -.3185983   .0264032   -12.07   0.000    -.3703495    -.266847
       _cons |    5.39043   .0191125   282.04   0.000     5.352969    5.427892
------------------------------------------------------------------------------

How you interpret these coefficients?

When the outcome is not standardized, but the explanatory variables are, the coefficients represent how much the outcome changes (in its original units) for a 1 standard deviation change in the explanatory variable.

E.g., one standard deviation increase in religiosity is associated with a 0.25 unit increase in the original left-right scale.

Sounds weird, no..?

So it is up to you to decide whether or not to standardize your variables, depending on contexts.

Normalize a variable to [0, 1] range

normalized variable = (variable - min)/(max - min)

Or formula for variable X

\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]

Where:

\(x'\): normalized data point in X
\(x\): original data point in X
\(\min(x)\): minimum value in X
\(\max(x)\): maximum value in X

Note: normalization has a restricted range while standardization does not, so normalization is more likely to be affected by outliers!

Lets’ normalize variable lrscale

Run summarize, and then r(max) and r(min)

sum lrscale
gen norm_lrscale = (lrscale - r(min)) / (r(max)-r(min))

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     lrscale |     32,314    5.216501    2.348016          0         10

(5,297 missing values generated)

See more options of stored results in r() for summarize here

Check normalized lrscale

sum norm_lrscale

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
norm_lrscale |     32,314    .5216501    .2348016          0          1

Reversing a scale

E.g., “How often pray apart from at religious services?”

label list pray

pray:
           1 Every day
           2 More than once a week
           3 Once a week
           4 At least once a month
           5 Only on special holy days
           6 Less often
           7 Never
          .a Refusal
          .b Don't know
          .c No answer

Original scale: 1 == Every day, 7 == Never

Reverse and recode it, so that:

Reversed scale: 0 == Never, 6 == Every day

Tty it yourself first

This is how I do it

Check values

tab pray, nolabel

  How often |
 pray apart |
    from at |
  religious |
   services |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      6,502       17.61       17.61
          2 |      3,143        8.51       26.13
          3 |      2,364        6.40       32.53
          4 |      2,632        7.13       39.66
          5 |      2,805        7.60       47.26
          6 |      6,169       16.71       63.97
          7 |     13,300       36.03      100.00
------------+-----------------------------------
      Total |     36,915      100.00

Subtract the values from 7

gen pray_rev = 7 - pray

(696 missing values generated)

Check

tab pray_rev

   pray_rev |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     13,300       36.03       36.03
          1 |      6,169       16.71       52.74
          2 |      2,805        7.60       60.34
          3 |      2,632        7.13       67.47
          4 |      2,364        6.40       73.87
          5 |      3,143        8.51       82.39
          6 |      6,502       17.61      100.00
------------+-----------------------------------
      Total |     36,915      100.00

Creating an additive index

Now we see there are two similar variables, hmsacld and freehms, regarding people’s opinions on LGBT+ rights.

hmsacld: gay and lesbian couples right to adopt children freehms: gays and lesbians free to live life as they wish

codebook hmsacld freehms

hmsacld                                                   Gay and lesbian couples right to adopt children
---------------------------------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: hmsacld

                 Range: [1,5]                         Units: 1
         Unique values: 5                         Missing .: 0/37,611
       Unique mv codes: 3                        Missing .*: 1,280/37,611

            Tabulation: Freq.   Numeric  Label
                        7,740         1  Agree strongly
                        7,727         2  Agree
                        6,584         3  Neither agree nor disagree
                        6,514         4  Disagree
                        7,766         5  Disagree strongly
                          270        .a  Refusal
                          894        .b  Don't know
                          116        .c  No answer

---------------------------------------------------------------------------------------------------------
freehms                                                  Gays and lesbians free to live life as they wish
---------------------------------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: freehms

                 Range: [1,5]                         Units: 1
         Unique values: 5                         Missing .: 0/37,611
       Unique mv codes: 3                        Missing .*: 885/37,611

            Tabulation: Freq.   Numeric  Label
                       12,649         1  Agree strongly
                       12,655         2  Agree
                        5,799         3  Neither agree nor disagree
                        2,990         4  Disagree
                        2,633         5  Disagree strongly
                          226        .a  Refusal
                          564        .b  Don't know
                           95        .c  No answer

We want to combine them into one variable, how should we do it?

Try it yourself first!

Image source here

Creating an additive index

This is how I do it

First, I reverse the values to make them more intuitive

gen hmsacld_rev = 5 - hmsacld
gen freehms_rev = 5 - freehms

(1,280 missing values generated)

(885 missing values generated)

Check

tab hmsacld, nolabel
tab hmsacld_rev

tab freehms, nolabel
tab freehms_rev

    Gay and |
    lesbian |
    couples |
   right to |
      adopt |
   children |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      7,740       21.30       21.30
          2 |      7,727       21.27       42.57
          3 |      6,584       18.12       60.69
          4 |      6,514       17.93       78.62
          5 |      7,766       21.38      100.00
------------+-----------------------------------
      Total |     36,331      100.00


hmsacld_rev |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      7,766       21.38       21.38
          1 |      6,514       17.93       39.31
          2 |      6,584       18.12       57.43
          3 |      7,727       21.27       78.70
          4 |      7,740       21.30      100.00
------------+-----------------------------------
      Total |     36,331      100.00


   Gays and |
   lesbians |
    free to |
  live life |
    as they |
       wish |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     12,649       34.44       34.44
          2 |     12,655       34.46       68.90
          3 |      5,799       15.79       84.69
          4 |      2,990        8.14       92.83
          5 |      2,633        7.17      100.00
------------+-----------------------------------
      Total |     36,726      100.00


freehms_rev |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,633        7.17        7.17
          1 |      2,990        8.14       15.31
          2 |      5,799       15.79       31.10
          3 |     12,655       34.46       65.56
          4 |     12,649       34.44      100.00
------------+-----------------------------------
      Total |     36,726      100.00

Combine them by: summing the variables

gen gay_rights_sum = hmsacld_rev + freehms_rev
tab gay_rights_sum

(1,660 missing values generated)


gay_rights_ |
        sum |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,218        6.17        6.17
          1 |      1,564        4.35       10.52
          2 |      2,739        7.62       18.14
          3 |      3,502        9.74       27.88
          4 |      5,979       16.63       44.51
          5 |      3,905       10.86       55.37
          6 |      5,669       15.77       71.14
          7 |      3,403        9.47       80.61
          8 |      6,972       19.39      100.00
------------+-----------------------------------
      Total |     35,951      100.00

Or…

Combine them by: averaging the variables

gen gay_rights_mean = (hmsacld_rev + freehms_rev) / 2
tab gay_rights_mean

(1,660 missing values generated)


gay_rights_ |
       mean |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,218        6.17        6.17
         .5 |      1,564        4.35       10.52
          1 |      2,739        7.62       18.14
        1.5 |      3,502        9.74       27.88
          2 |      5,979       16.63       44.51
        2.5 |      3,905       10.86       55.37
          3 |      5,669       15.77       71.14
        3.5 |      3,403        9.47       80.61
          4 |      6,972       19.39      100.00
------------+-----------------------------------
      Total |     35,951      100.00

You can also simply use egen for non-missing values in the variables

Summing

egen gay_rights_sum1 = rowtotal(hmsacld_rev freehms_rev)
tab gay_rights_sum1

gay_rights_ |
       sum1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,977        7.92        7.92
          1 |      1,713        4.55       12.47
          2 |      2,933        7.80       20.27
          3 |      3,899       10.37       30.63
          4 |      6,140       16.33       46.96
          5 |      3,905       10.38       57.34
          6 |      5,669       15.07       72.41
          7 |      3,403        9.05       81.46
          8 |      6,972       18.54      100.00
------------+-----------------------------------
      Total |     37,611      100.00

Averaging

egen gay_rights_mean1 = rowmean(hmsacld_rev freehms_rev)
tab gay_rights_mean1

(505 missing values generated)


gay_rights_ |
      mean1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,472        6.66        6.66
         .5 |      1,564        4.21       10.88
          1 |      2,888        7.78       18.66
        1.5 |      3,502        9.44       28.10
          2 |      6,173       16.64       44.73
        2.5 |      3,905       10.52       55.26
          3 |      6,066       16.35       71.61
        3.5 |      3,403        9.17       80.78
          4 |      7,133       19.22      100.00
------------+-----------------------------------
      Total |     37,106      100.00

Note that two approaches treat missing values differently, so you may obtain different results.

Sum:
- hmsacld_rev + freehms_rev: . + . = ..
- hmsacld_rev + freehms_rev: . + 1 = ..
- rowtotal(hmsacld_rev freehms_rev): rowtotal(. + .) = 0
- rowtotal(hmsacld_rev freehms_rev): rowtotal(. + 1) = 0+1 =1
Mean:
- (hmsacld_rev + freehms_rev) / 2: (. + .) / 2 = ..
- (hmsacld_rev + freehms_rev) / 2: (. + 1) / 2 = ..
- rowmean(hmsacld_rev freehms_rev): rowmean(. .) = ..
- rowmean(hmsacld_rev freehms_rev): rowmean(. 1) = 1/1 = 1

Changing reference group in model

We can change reference category in categorical variables. Now let’s try with gender.

Check current reference group (by default it is the first group)

codebook gndr

gndr                                                                                               Gender
---------------------------------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: gndr

                 Range: [1,2]                         Units: 1
         Unique values: 2                         Missing .: 0/37,611

            Tabulation: Freq.   Numeric  Label
                       17,463         1  Male
                       20,148         2  Female

Inspect current regression

regress lrscale i.gndr

      Source |       SS           df       MS      Number of obs   =    32,314
-------------+----------------------------------   F(1, 32312)     =     96.95
       Model |  532.921349         1  532.921349   Prob > F        =    0.0000
    Residual |  177614.441    32,312  5.49685691   R-squared       =    0.0030
-------------+----------------------------------   Adj R-squared   =    0.0030
       Total |  178147.362    32,313  5.51317927   Root MSE        =    2.3445

------------------------------------------------------------------------------
     lrscale | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        gndr |
     Female  |  -.2571983   .0261212    -9.85   0.000    -.3083969   -.2059997
       _cons |   5.351865   .0189501   282.42   0.000     5.314722    5.389008
------------------------------------------------------------------------------

We can change reference group by specifying the reference category using ib(#) (b stands for “base”, # indicates the category number you want to set as the base/reference group)

b1: “Male” is reference

regress lrscale ib1.gndr

      Source |       SS           df       MS      Number of obs   =    32,314
-------------+----------------------------------   F(1, 32312)     =     96.95
       Model |  532.921349         1  532.921349   Prob > F        =    0.0000
    Residual |  177614.441    32,312  5.49685691   R-squared       =    0.0030
-------------+----------------------------------   Adj R-squared   =    0.0030
       Total |  178147.362    32,313  5.51317927   Root MSE        =    2.3445

------------------------------------------------------------------------------
     lrscale | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        gndr |
     Female  |  -.2571983   .0261212    -9.85   0.000    -.3083969   -.2059997
       _cons |   5.351865   .0189501   282.42   0.000     5.314722    5.389008
------------------------------------------------------------------------------

b2: “Female” is reference

regress lrscale ib2.gndr

      Source |       SS           df       MS      Number of obs   =    32,314
-------------+----------------------------------   F(1, 32312)     =     96.95
       Model |  532.921349         1  532.921349   Prob > F        =    0.0000
    Residual |  177614.441    32,312  5.49685691   R-squared       =    0.0030
-------------+----------------------------------   Adj R-squared   =    0.0030
       Total |  178147.362    32,313  5.51317927   Root MSE        =    2.3445

------------------------------------------------------------------------------
     lrscale | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        gndr |
       Male  |   .2571983   .0261212     9.85   0.000     .2059997    .3083969
       _cons |   5.094667   .0179781   283.38   0.000     5.059429    5.129905
------------------------------------------------------------------------------

Steps in regression modeling

Justify a limit number of predictors based on theories/literature
Hypothesizing
Run regression models, adding predictors stepwisely
Assumptions check (next week)
Interpret results
Go back to hypotheses
Refine model / draw conclusion

All models are wrong, but some are useful (George Box, 1976).

Image source here

In-class assignment 4.2

Use ESS10.dta
Investigate variables trstprt, trstplt, rlgatnd, pray
Transform these variables using what we learned today (as you want but justify it), then run a simple regression with the created additive indices: general trust in politics (outcome) and religiosity (explanatory)
Come up with a potential third explanatory variable, then run a multiple regression
Compare and interpret two regression tables
Submit only your do-file with comments to Moodle In-class assignment (Week 4), with:

tidy presentation of variables
variable transformation and justification
output tables and interpretation

* use asterisk "*" at the beginning of a line for comments in Stata (the comment line will appear in green in your do-file)

I will upload the solution to Moodle by the end of the class, here: