Linear Regression in Stata

Giovanni Minchio giovanni.minchio@unitn.it
Yuxin Zhang yuxin.zhang@unitn.it

Quantitative Methods Lab, Lesson 4.1
22 Oct. 2024

Common problems in Assignment 2

Using scatterplot to visualize ordinal variables:

So maybe try some other ways we learned?

graph bar, over(gay_shame) over(lrscale)

graph box gay_shame, over(lrscale)

graph bar, over(gay_shame) by(lrscale)

Common problems in Assignment 2

Treat ordinal variables as categorical (eg., Likert-scale 0-10) and run Pearson’s Chi-squared test

Pros:

easy
works for both nominal/ordinal

Cons:

does not take into account the ordinal information (it treats categories as nominal): only association, no direction and strength
requires sufficient sample size in each category

Likert, or ordinal variables with five or more categories are often used as continuous without any harm. So if your variable is ordinal, Spearman correlation is preferred.

More info see here

Common problems in Assignment 2

Association tests –> no causal statements!!

Image source here

Common problems in Assignment 2

4.1 P-value does not tell you the strength of association! It is the probability of the observed value assuming H0 is true.

4.2 We reject H0 (not H1) when p-value < .05

Image source here

Common problems in Assignment 2

Forgetting to check the direction of a scale (order of variable values)

E.g., “How often pray apart from at religious services?”

label list pray

pray:
           1 Every day
           2 More than once a week
           3 Once a week
           4 At least once a month
           5 Only on special holy days
           6 Less often
           7 Never
          .a Refusal
          .b Don't know
          .c No answer

label list rlgdgr

rlgdgr:
           0 Not at all religious
           1 1
           2 2
           3 3
           4 4
           5 5
           6 6
           7 7
           8 8
           9 9
          10 Very religious
          .a Refusal
          .b Don't know
          .c No answer

corr pray rlgdgr

(obs=36,701)

             |     pray   rlgdgr
-------------+------------------
        pray |   1.0000
      rlgdgr |  -0.6821   1.0000

Recap of bivariate tests

Variable	Binary	Nominal/ordinal	Interval/ratio
Binary	Chi-squared	Chi-squared	T-test
Nominal/ordinal	Chi-squared	Chi-squared	ANOVA
Interval/ratio	T-test	ANOVA	Correlation

More than two variables??

The main advantage of regression modelling is that it can test several predictors at once.

Linear regression

Image source here

Simple linear regression analysis characterizes the relationship between one dependent variable and one independent variable with a line.
Multiple linear regression analysis characterizes the relationship between one dependent and more than one independent variables.

Some important uses of regression:

Exploring associations
Prediction
Extrapolation
Causal inference

Regression with various treatments and effects

Image source: Gelman et al., 2021

Run simple linear regression in Stata

Load example dataset

help dta_examples

You can load the dataset by clicking on “use”, or use sysuse with the name of the dataset.

sysuse auto.dta, clear

sysdescribe auto

Contains data                                 1978 automobile data
 Observations:            74                  13 Apr 2022 17:45
    Variables:            12                  
----------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
----------------------------------------------------------------------------------
make            str18   %-18s                 Make and model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair record 1978
headroom        float   %6.1f                 Headroom (in.)
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear ratio
foreign         byte    %8.0g      origin     Car origin
----------------------------------------------------------------------------------
Sorted by: foreign

Browse the data

browse

summarize

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        make |          0
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5
-------------+---------------------------------------------------------
       trunk |         74    13.75676    4.277404          5         23
      weight |         74    3019.459    777.1936       1760       4840
      length |         74    187.9324    22.26634        142        233
        turn |         74    39.64865    4.399354         31         51
displacement |         74    197.2973    91.83722         79        425
-------------+---------------------------------------------------------
  gear_ratio |         74    3.014865    .4562871       2.19       3.89
     foreign |         74    .2972973    .4601885          0          1

Now, let’s use mpg and weight to run some tests

mpg: MPG (miles per gallon). A higher MPG means that the vehicle is more fuel-efficient. It indicates that the car can travel more miles for every gallon of fuel consumed.
weight: vehicle weight in pounds (lbs).
Visualize variables

hist mpg

hist weight

scatter mpg weight

Run correlation test

pwcorr mpg weight, sig obs

             |      mpg   weight
-------------+------------------
         mpg |   1.0000 
             |
             |       74
             |
      weight |  -0.8072   1.0000 
             |   0.0000
             |       74       74
             |

Simple linear regression equation

\[ y = a + b \cdot x \]

Where:

\(y\) = predicted value of the dependent variable
\(a\) = intercept of the regression line
\(x\) = independent variable

The regression coefficient (slope) in simple linear regression can be calculated using correlation coefficient. The formula for the slope (\(b\)) of the regression line is:

\[ b = r \cdot \left( \frac{SD_y}{SD_x} \right) \]

Where:

\(b\) = slope of the regression line (regression coefficient)
\(r\) = correlation coefficient between the independent variable (\(x\)) and the dependent variable (\(y\))
\(SD_y\) = standard deviation of the dependent variable (\(y\))
\(SD_x\) = standard deviation of the independent variable (\(x\))

Simple linear regression

Run simple regression

regress mpg weight

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =    134.62
       Model |   1591.9902         1   1591.9902   Prob > F        =    0.0000
    Residual |  851.469256        72  11.8259619   R-squared       =    0.6515
-------------+----------------------------------   Adj R-squared   =    0.6467
       Total |  2443.45946        73  33.4720474   Root MSE        =    3.4389

------------------------------------------------------------------------------
         mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |  -.0060087   .0005179   -11.60   0.000    -.0070411   -.0049763
       _cons |   39.44028   1.614003    24.44   0.000     36.22283    42.65774
------------------------------------------------------------------------------

Report:

A significant model was found, \(F(1, 72) = 134.62\), \(p < .001\), explaining approximately 65% of the variance in mpg (\(R^2_{\text{adj}} = 0.65\)).

The regression equation is: \[ \text{mpg}_{\text{predicted}} = 39.44 - 0.006 \times \text{weight} \]

There is a significant association between mpg and weight (\(p < .001\)). Specifically, for each one unit increase in weight, the predicted mpg value decreases by approximately 0.006. The standard error of the slope is 0.0005, and we are 95% confident that the true slope falls between -0.007 and -0.005.

Try this here

Don’t confuse SE with SD!

Standard Deviation (SD) measures the amount of variation or dispersion in the data.

\[ SD = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} \]

\(x_i\) = individual values
\(\bar{x}\) = sample mean
\(n\) = number of observations

Standard Error (SE): measures how far the sample mean of the data is likely to be from the true population mean.

\[ SE = \frac{SD}{\sqrt{n}} \]

\(SD\) = standard deviation
\(n\) = number of observations

A 95% confidence interval (95% CI) can be calculated using SE.

For large samples, you can use a z-score of 1.96 for a 95% confidence level:

\[ CI = \text{mean} \pm (1.96 \times SE) \]

For small samples or unknown population standard deviation, use the t-distribution with degrees of freedom \(n - 1\):

\[ CI = \text{mean} \pm (t \times SE) \]

Where: \(t\) is the critical value from the t-distribution table for the 95% CI.

Check this article here for more explanation

Plot predicted/fitted line

Now, we want to plot the predicted values of mpg by car weights. To do this, we need the predicted (fitted) values from our regression.

Predict values of mpg

predict mpghat

The predict command, when used after a regression, is called a post-estimation command. As specified, it creates a new variable called mpghat.

Now we can plot the predicted values of mpg by weight

scatter mpg weight || line mpghat weight

Or use lfit so you can skip predict mpghat manually

scatter mpg weight || lfit mpg weight

Now let’s include a third variable foreign: car’s origin.

Check variable

codebook foreign

foreign                                                                 Car origin
----------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: origin

                 Range: [0,1]                         Units: 1
         Unique values: 2                         Missing .: 0/74

            Tabulation: Freq.   Numeric  Label
                           52         0  Domestic
                           22         1  Foreign

Visualize variable

graph bar (count), over(foreign)

graph box price, over(foreign)

Run multiple linear regression

regress mpg weight foreign

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     69.75
       Model |   1619.2877         2  809.643849   Prob > F        =    0.0000
    Residual |  824.171761        71   11.608053   R-squared       =    0.6627
-------------+----------------------------------   Adj R-squared   =    0.6532
       Total |  2443.45946        73  33.4720474   Root MSE        =    3.4071

------------------------------------------------------------------------------
         mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |  -.0065879   .0006371   -10.34   0.000    -.0078583   -.0053175
     foreign |  -1.650029   1.075994    -1.53   0.130      -3.7955    .4954422
       _cons |    41.6797   2.165547    19.25   0.000     37.36172    45.99768
------------------------------------------------------------------------------

Specify that foreign is categorical using the prefix i.var

regress mpg weight i.foreign

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     69.75
       Model |   1619.2877         2  809.643849   Prob > F        =    0.0000
    Residual |  824.171761        71   11.608053   R-squared       =    0.6627
-------------+----------------------------------   Adj R-squared   =    0.6532
       Total |  2443.45946        73  33.4720474   Root MSE        =    3.4071

------------------------------------------------------------------------------
         mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |  -.0065879   .0006371   -10.34   0.000    -.0078583   -.0053175
             |
     foreign |
    Foreign  |  -1.650029   1.075994    -1.53   0.130      -3.7955    .4954422
       _cons |    41.6797   2.165547    19.25   0.000     37.36172    45.99768
------------------------------------------------------------------------------

See more basic examples using auto dataset here.

Omitted-variable/confounding effect

What is the third variable that is ignored, which may be associated with both the independent and dependent variables in the following bivariate associations?

The taller a person is, the higher their IQ scores.
People who travel more tend to be healthier.
Fewer pirates are associated with higher global temperatures.
Higher sunscreen sales correlate with higher skin cancer rates.

No more after-lab homework for Stata exercises!

Image source here

SO… We will do them during the labs :)

Image source here

Don’t panic! I am here to help as much as I can :)

In-class assignment 4.1

Download the dataset simulation_data.csv from Moodle

Use online documentation here to figure out how to import csv files.

(P.s., don’t forget to set up or change your working directory and save your data files in the desired folder first!)

cd ""

Check if you imported the data correctly using browse
Once you’ve correctly imported the data, check variables using describe
Check correlation between var_depend and var_independ
Run simple regression using explanatory variable var_independ and outcome variable var_depend, and interpret regression output
Visualize predicted values
Add the variable group as a third explanatory variable in your model, and run multiple regression
Interpret the new results.
Visualize the new predicted values of var_depend by var_independ, grouped by the third variable group.
Compare and discuss with your peers: what happened and why?
Think about a potential similar real-life scenario and write it down.
Upload a PDF formatted file by the end of today’s lab to Moodle:

In the PDF file, there should be:

regression outputs
interpretation of outputs
visualization of predicted values
a “real-life scenario” you came up with

Solution: importing CSV

This is how I imported the csv file (not unique solution)

cd "/Users/yuxin/Documents/STATALAB2024-25"
import delimited "datafile/simulation_data.csv", clear

These are the outputs

regress var_depend var_independ

      Source |       SS           df       MS      Number of obs   =     2,000
-------------+----------------------------------   F(1, 1998)      =     93.97
       Model |  170.821983         1  170.821983   Prob > F        =    0.0000
    Residual |  3632.17802     1,998  1.81790692   R-squared       =    0.0449
-------------+----------------------------------   Adj R-squared   =    0.0444
       Total |        3803     1,999  1.90245123   Root MSE        =    1.3483

------------------------------------------------------------------------------
  var_depend | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
var_independ |   .2119379   .0218636     9.69   0.000       .16906    .2548159
       _cons |   2.245977   .0692218    32.45   0.000     2.110222    2.381731
------------------------------------------------------------------------------

regress var_depend var_independ i.group

      Source |       SS           df       MS      Number of obs   =     2,000
-------------+----------------------------------   F(2, 1997)      =   1535.56
       Model |      2304.5         2     1152.25   Prob > F        =    0.0000
    Residual |      1498.5     1,997  .750375566   R-squared       =    0.6060
-------------+----------------------------------   Adj R-squared   =    0.6056
       Total |        3803     1,999  1.90245123   Root MSE        =    .86624

------------------------------------------------------------------------------
  var_depend | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
var_independ |        -.5   .0193795   -25.80   0.000    -.5380061   -.4619939
     1.group |      -2.85   .0534466   -53.32   0.000    -2.954817   -2.745183
       _cons |        5.7   .0785717    72.55   0.000     5.545909    5.854091
------------------------------------------------------------------------------

Solution: assignment

This is the plot with the fitted regression line

scatter var_depend var_independ || lfit var_depend var_independ, lwidth(thick)

lwidth() is used to adjust the line width of the fitted line, see details in documentation here

This is the plot with the fitted regression line by groups

scatter var_depend var_independ || lfit var_depend var_independ, by(group) lwidth(thick)

A relevant real-life scenario:

Probability of death from COVID appears to be positively correlated with vaccination, but after controlling for age group, this correlation disappears or becomes negative.

Simpson’s paradox

Image source here