Giovanni Minchio giovanni.minchio@unitn.it
Yuxin Zhang yuxin.zhang@unitn.it
Quantitative Methods Lab, Lesson 4.1
22
Oct. 2024
So maybe try some other ways we learned?
graph bar, over(gay_shame) over(lrscale)
graph box gay_shame, over(lrscale)
graph bar, over(gay_shame) by(lrscale)
Pros:
Cons:
Likert, or ordinal variables with five or more categories are
often used as continuous without any harm. So if your variable is
ordinal, Spearman correlation is preferred.
More info see here
4.1 P-value does not tell you the strength of association! It is the probability of the observed value assuming H0 is true.
4.2 We reject H0 (not H1) when p-value < .05
Forgetting to check the direction of a scale (order of variable values)
E.g., “How often pray apart from at religious services?”
pray:
1 Every day
2 More than once a week
3 Once a week
4 At least once a month
5 Only on special holy days
6 Less often
7 Never
.a Refusal
.b Don't know
.c No answer
rlgdgr:
0 Not at all religious
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 Very religious
.a Refusal
.b Don't know
.c No answer
(obs=36,701)
| pray rlgdgr
-------------+------------------
pray | 1.0000
rlgdgr | -0.6821 1.0000
Variable | Binary | Nominal/ordinal | Interval/ratio |
---|---|---|---|
Binary | Chi-squared | Chi-squared | T-test |
Nominal/ordinal | Chi-squared | Chi-squared | ANOVA |
Interval/ratio | T-test | ANOVA | Correlation |
The main advantage of regression modelling is that it can test several predictors at once.
Simple linear regression analysis characterizes the relationship between one dependent variable and one independent variable with a line.
Multiple linear regression analysis characterizes the relationship between one dependent and more than one independent variables.
Some important uses of regression:
sysuse
with the name of the dataset.Contains data 1978 automobile data
Observations: 74 13 Apr 2022 17:45
Variables: 12
----------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------
make str18 %-18s Make and model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear ratio
foreign byte %8.0g origin Car origin
----------------------------------------------------------------------------------
Sorted by: foreign
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
make | 0
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
headroom | 74 2.993243 .8459948 1.5 5
-------------+---------------------------------------------------------
trunk | 74 13.75676 4.277404 5 23
weight | 74 3019.459 777.1936 1760 4840
length | 74 187.9324 22.26634 142 233
turn | 74 39.64865 4.399354 31 51
displacement | 74 197.2973 91.83722 79 425
-------------+---------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1
Now, let’s use mpg
and weight
to run some
tests
mpg
: MPG (miles per gallon). A higher MPG means that
the vehicle is more fuel-efficient. It indicates that the car can travel
more miles for every gallon of fuel consumed.
weight
: vehicle weight in pounds (lbs).
Visualize variables
| mpg weight
-------------+------------------
mpg | 1.0000
|
| 74
|
weight | -0.8072 1.0000
| 0.0000
| 74 74
|
\[ y = a + b \cdot x \]
Where:
The regression coefficient (slope) in simple linear regression can be calculated using correlation coefficient. The formula for the slope (\(b\)) of the regression line is:
\[ b = r \cdot \left( \frac{SD_y}{SD_x} \right) \]
Where:
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(1, 72) = 134.62
Model | 1591.9902 1 1591.9902 Prob > F = 0.0000
Residual | 851.469256 72 11.8259619 R-squared = 0.6515
-------------+---------------------------------- Adj R-squared = 0.6467
Total | 2443.45946 73 33.4720474 Root MSE = 3.4389
------------------------------------------------------------------------------
mpg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
weight | -.0060087 .0005179 -11.60 0.000 -.0070411 -.0049763
_cons | 39.44028 1.614003 24.44 0.000 36.22283 42.65774
------------------------------------------------------------------------------
Report:
A significant model was found, \(F(1, 72) =
134.62\), \(p < .001\),
explaining approximately 65% of the variance in mpg
(\(R^2_{\text{adj}} = 0.65\)).
The regression equation is: \[ \text{mpg}_{\text{predicted}} = 39.44 - 0.006 \times \text{weight} \]
There is a significant association between mpg
and
weight
(\(p < .001\)).
Specifically, for each one unit increase in weight
, the
predicted mpg
value decreases by approximately 0.006. The
standard error of the slope is 0.0005, and we are 95% confident that the
true slope falls between -0.007 and -0.005.
Try this here
\[ SD = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} \]
\[ SE = \frac{SD}{\sqrt{n}} \]
\[ CI = \text{mean} \pm (1.96 \times SE) \]
\[ CI = \text{mean} \pm (t \times SE) \]
Where: \(t\) is the critical value from the t-distribution table for the 95% CI.
Check this article here for more explanation
Now, we want to plot the predicted values of mpg
by car
weights. To do this, we need the predicted (fitted) values from our
regression.
mpg
The predict
command, when used after a regression, is
called a post-estimation command. As specified, it creates a new
variable called mpghat
.
mpg
by
weight
lfit
so you can skip predict mpghat
manuallyNow let’s include a third variable foreign
: car’s
origin.
foreign Car origin
----------------------------------------------------------------------------------
Type: Numeric (byte)
Label: origin
Range: [0,1] Units: 1
Unique values: 2 Missing .: 0/74
Tabulation: Freq. Numeric Label
52 0 Domestic
22 1 Foreign
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 69.75
Model | 1619.2877 2 809.643849 Prob > F = 0.0000
Residual | 824.171761 71 11.608053 R-squared = 0.6627
-------------+---------------------------------- Adj R-squared = 0.6532
Total | 2443.45946 73 33.4720474 Root MSE = 3.4071
------------------------------------------------------------------------------
mpg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
weight | -.0065879 .0006371 -10.34 0.000 -.0078583 -.0053175
foreign | -1.650029 1.075994 -1.53 0.130 -3.7955 .4954422
_cons | 41.6797 2.165547 19.25 0.000 37.36172 45.99768
------------------------------------------------------------------------------
foreign
is categorical using the prefix
i.var
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 69.75
Model | 1619.2877 2 809.643849 Prob > F = 0.0000
Residual | 824.171761 71 11.608053 R-squared = 0.6627
-------------+---------------------------------- Adj R-squared = 0.6532
Total | 2443.45946 73 33.4720474 Root MSE = 3.4071
------------------------------------------------------------------------------
mpg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
weight | -.0065879 .0006371 -10.34 0.000 -.0078583 -.0053175
|
foreign |
Foreign | -1.650029 1.075994 -1.53 0.130 -3.7955 .4954422
_cons | 41.6797 2.165547 19.25 0.000 37.36172 45.99768
------------------------------------------------------------------------------
See more basic examples using auto
dataset here.
What is the third variable that is ignored, which may be associated with both the independent and dependent variables in the following bivariate associations?
The taller a person is, the higher their IQ scores.
People who travel more tend to be healthier.
Fewer pirates are associated with higher global temperatures.
Higher sunscreen sales correlate with higher skin cancer rates.
SO… We will do them during the labs :)
Don’t panic! I am here to help as much as I can :)
simulation_data.csv
from
Moodle(P.s., don’t forget to set up or change your working directory and save your data files in the desired folder first!)
cd ""
Check if you imported the data correctly using
browse
Once you’ve correctly imported the data, check variables using
describe
Check correlation between var_depend
and
var_independ
Run simple regression using explanatory variable
var_independ
and outcome variable var_depend
,
and interpret regression output
Visualize predicted values
Add the variable group
as a third explanatory
variable in your model, and run multiple regression
Interpret the new results.
Visualize the new predicted values of var_depend
by
var_independ
, grouped by the third variable
group
.
Compare and discuss with your peers: what happened and why?
Think about a potential similar real-life scenario and write it down.
Upload a PDF formatted file by the end of today’s lab to Moodle:
In the PDF file, there should be:
This is how I imported the csv file (not unique solution)
These are the outputs
Source | SS df MS Number of obs = 2,000
-------------+---------------------------------- F(1, 1998) = 93.97
Model | 170.821983 1 170.821983 Prob > F = 0.0000
Residual | 3632.17802 1,998 1.81790692 R-squared = 0.0449
-------------+---------------------------------- Adj R-squared = 0.0444
Total | 3803 1,999 1.90245123 Root MSE = 1.3483
------------------------------------------------------------------------------
var_depend | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
var_independ | .2119379 .0218636 9.69 0.000 .16906 .2548159
_cons | 2.245977 .0692218 32.45 0.000 2.110222 2.381731
------------------------------------------------------------------------------
Source | SS df MS Number of obs = 2,000
-------------+---------------------------------- F(2, 1997) = 1535.56
Model | 2304.5 2 1152.25 Prob > F = 0.0000
Residual | 1498.5 1,997 .750375566 R-squared = 0.6060
-------------+---------------------------------- Adj R-squared = 0.6056
Total | 3803 1,999 1.90245123 Root MSE = .86624
------------------------------------------------------------------------------
var_depend | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
var_independ | -.5 .0193795 -25.80 0.000 -.5380061 -.4619939
1.group | -2.85 .0534466 -53.32 0.000 -2.954817 -2.745183
_cons | 5.7 .0785717 72.55 0.000 5.545909 5.854091
------------------------------------------------------------------------------
This is the plot with the fitted regression line
lwidth()
is used to adjust the line width of the fitted
line, see details in documentation here
A relevant real-life scenario:
Probability of death from COVID appears to be positively correlated with vaccination, but after controlling for age group, this correlation disappears or becomes negative.