Giovanni Minchio giovanni.minchio@unitn.it
Yuxin Zhang yuxin.zhang@unitn.it
Quantitative Methods Lab, Lesson 2.2
10
Oct. 2024
Bivariate analysis enables the examination of the association between two variables, helping to identify whether a correlation exists and assess its strength. For researchers, it can serve as an initial check before proceeding to more complex analyses.
A hypothesis is a statement about a population parameter subject to verification. Then, data are then used to test the validity of the statement.
A procedure based on sample evidence and probability theory to assess whether the hypothesis is plausible.
Null hypothesis (\(H_0\)): ?
Alternate hypothesis (\(H_1\)): Elderly people visit hospitals for medical check-ups more often than younger people.
Null hypothesis (\(H_0\)): There is no difference in voting behavior between those who went to college and those who did not.
Alternate hypothesis (\(H_1\)): ?
Which are independent and dependent variables?
Which test? (To be answered later)
Scatter plot
Correlation
Two-sample t-test
One-way ANOVA (Analysis of Variance)
Chi-squared test
Stating your null (\(H_0\)) and alternate (\(H_1\)) hypothesis
↓
Selecting significance level and the test to be used
In social sciences typically at .05 or 5 percent level
Image source: https://www.abtasty.com/blog/type-1-and-type-2-errors/
↓
Check assumptions for the chosen test
If assumptions are not met, may yield inaccurate results
↓
Calculate the test statistic
↓
Deciding whether to reject or fail to reject null hypothesis (\(H_0\))
↓
Reporting and interpreting results (and mb visualize them)
↓
Deciding if further analysis is needed
How many observations (sample size)?
Key assumptions:
Random/representative sample
Measurement scales of variables
Normal/near-to-normal distributions
No outliers
A scatter plot reveals potential relationships between changes in two numeric variables.
Let’s see how internet use netustm
is related to age
agea
After loading the data, you may want to retain only the variables we will use today in your Stata work space. (more tidy/less RAM usage/faster computations)
Hypotheses:
Null Hypothesis (\(H_0\)): there is no correlation between the variables.
Alternative Hypothesis (\(H_1\)): there is a significant correlation between the variables.
The correlation coefficient is measured on a scale that varies from + 1 through 0 to – 1.
Image source: https://www.scribbr.com/statistics/correlation-coefficient/
histogram
bin()
to show less/more
detailscorrelate
(obs=27,395)
| netustm agea
-------------+------------------
netustm | 1.0000
agea | -0.2968 1.0000
pwcorr
, “pw” stands for “pairwise” | netustm agea
-------------+------------------
netustm | 1.0000
agea | -0.2968 1.0000
pwcorr
, with p-value in sig
| netustm agea
-------------+------------------
netustm | 1.0000
|
|
agea | -0.2968 1.0000
| 0.0000
|
pwcorr
, with p-value in star()
| netustm agea
-------------+------------------
netustm | 1.0000
agea | -0.2968* 1.0000
pwcorr
, show p-value in star()
and
sample size obs
| netustm agea
-------------+------------------
netustm | 1.0000
| 27598
|
agea | -0.2968* 1.0000
| 27395 37319
|
r(27393) = -.30, p = .00
[Report: Pearson correlation coefficient r(df = n – 2) = the r statistic, p = p value.]
Rules of thumb:
Perfect correlation: r = ±1
Strong correlation: r between ±.50 and ±1
Moderate correlation: r between ±.30 and ±.49
Weak correlation: Values below ±.30
No correlation: r = 0
spearman
Number of observations = 27,395
+-----------------+
| Key |
|-----------------|
| rho |
| Number of obs |
| p-value |
+-----------------+
| netustm agea
-------------+------------------
netustm | 1.0000
| 27395
| .
|
agea | -0.3465* 1.0000
| 27395 27395
| 0.0000 .
|
Comparing the mean values of two independent groups.
Hypotheses:
\(H_0\): The means of the two groups are equal.
\(H_1\): The means of the two groups are not equal.
E.g., comparing test scores of two separate groups of students.
Now let’s test the relationship between continuous var
agea
and binary var vote
label list vote
* drop the additional category "Not eligible to vote" to make it binary
drop if vote == 3
vote:
1 Yes
2 No
3 Not eligible to vote
.a Refusal
.b Don't know
.c No answer
(2,594 observations deleted)
twoway histogram
twoway histogram
with a fitted density
estimatebwidth
to make
it pretty. Try some different values by yourself.If equal variances (homoscedasticity/homogeneity of variances)?
robvar measurement_variable, by(grouping_variable)
Voted last | Summary of Age of respondent,
national | calculated
election | Mean Std. dev. Freq.
------------+------------------------------------
Yes | 53.687204 16.845207 26,602
No | 48.287032 18.984173 7,696
------------+------------------------------------
Total | 52.47548 17.493509 34,298
W0 = 280.19467 df(1, 34296) Pr > F = 0.00000000
W50 = 275.55155 df(1, 34296) Pr > F = 0.00000000
W10 = 277.58440 df(1, 34296) Pr > F = 0.00000000
We see that the standard deviation in age is higher for nonvoters compared to voters. We still want to know if this difference is statistically significant:
W0 = 280.19: This is the test statistic for Levene’s Test centered at the mean. The corresponding p = .00
W50 = 275.55. Centered at the median. The corresponding p = .00
W10 = 277.58. Centered using the 10% trimmed mean. This means that the top 5% and bottom 5% of values are trimmed out so they don’t overly influence the test. The corresponding p = .00
The p-value for each version of Levene’s Test is .00. This indicates that there is a statistically significant difference in the variances of age between voters and nonvoters in our sample. We need the t-test with unequal variances then.
ttest
with
equal variancesTwo-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
Yes | 26,602 53.6872 .1032807 16.84521 53.48477 53.88964
No | 7,696 48.28703 .2164009 18.98417 47.86283 48.71124
---------+--------------------------------------------------------------------
Combined | 34,298 52.47548 .0944588 17.49351 52.29034 52.66062
---------+--------------------------------------------------------------------
diff | 5.400172 .2245414 4.960063 5.84028
------------------------------------------------------------------------------
diff = mean(Yes) - mean(No) t = 24.0498
H0: diff = 0 Degrees of freedom = 34296
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
The 26602 participants who reported voting (M = 53.69, SD = 16.85) compared to the 7696 participants who did not report voting (M = 48.29, SD = 18.98) have a statistically significant higher mean age (5.40), t(34296) = 24.05, p = .00. We reject \(H_0\).
Conclude: there is a statistically significant difference in the average ages between those who voted and those who did not, with the voter group having a higher mean age.
ttest
with unequal variances adding
unequal
Two-sample t test with unequal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
Yes | 26,602 53.6872 .1032807 16.84521 53.48477 53.88964
No | 7,696 48.28703 .2164009 18.98417 47.86283 48.71124
---------+--------------------------------------------------------------------
Combined | 34,298 52.47548 .0944588 17.49351 52.29034 52.66062
---------+--------------------------------------------------------------------
diff | 5.400172 .2397838 4.930154 5.870189
------------------------------------------------------------------------------
diff = mean(Yes) - mean(No) t = 22.5210
H0: diff = 0 Satterthwaite's degrees of freedom = 11428.3
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
What about comparing means for more than
two groups (non-binary)?
(Analysis of Variance)
Hypotheses:
\(H_0\): no difference groups.
\(H_1\): at least one group is different from the others.
Let’s test the relationship between age agea
and
perceived party distance in Italy prtcleit
.
(32,586 observations deleted)
Which party feel closer to, Italy | Freq. Percent Cum.
-------------------------------------+-----------------------------------
Movimento 5 Stelle | 96 16.96 16.96
Partido Democratico | 184 32.51 49.47
Lega | 76 13.43 62.90
Forza Italia | 55 9.72 72.61
Fratelli d'Italia con Giorgia Meloni | 104 18.37 90.99
Liberi e Uguali (LEU) | 8 1.41 92.40
+ Europa | 2 0.35 92.76
Noi con l'Italia - UDC | 4 0.71 93.46
Potere al popolo | 6 1.06 94.52
SVP-PATT | 8 1.41 95.94
Altro | 3 0.53 96.47
Italia Viva | 4 0.71 97.17
Unione Valdotaine | 1 0.18 97.35
Partito Comunista | 5 0.88 98.23
Vox Italia | 1 0.18 98.41
Partito Socialista | 2 0.35 98.76
Verdi/ Europa Verde | 1 0.18 98.94
Italexit | 3 0.53 99.47
Azione di Calenda | 3 0.53 100.00
-------------------------------------+-----------------------------------
Total | 566 100.00
prtcleit:
1 Movimento 5 Stelle
2 Partido Democratico
3 Lega
4 Forza Italia
5 Fratelli d'Italia con Giorgia Meloni
6 Liberi e Uguali (LEU)
7 + Europa
8 Noi con l'Italia - UDC
9 Potere al popolo
10 Casapound Italia
11 Italia Europa Insieme
12 Il popolo della famiglia
13 Civica Popolare Lorenzin
14 SVP-PATT
31 Altro
33 Italia Viva
34 Unione Valdotaine
35 Partito Comunista
36 Vox Italia
37 Partito Socialista
38 Verdi/ Europa Verde
39 Italexit
40 Azione di Calenda
.a Not applicable
.b Refusal
.c Don't know
.d No answer
Which party |
feel closer |
to, Italy | Freq. Percent Cum.
------------+-----------------------------------
1 | 96 16.96 16.96
2 | 184 32.51 49.47
3 | 76 13.43 62.90
4 | 55 9.72 72.61
5 | 104 18.37 90.99
6 | 8 1.41 92.40
7 | 2 0.35 92.76
8 | 4 0.71 93.46
9 | 6 1.06 94.52
14 | 8 1.41 95.94
31 | 3 0.53 96.47
33 | 4 0.71 97.17
34 | 1 0.18 97.35
35 | 5 0.88 98.23
36 | 1 0.18 98.41
37 | 2 0.35 98.76
38 | 1 0.18 98.94
39 | 3 0.53 99.47
40 | 3 0.53 100.00
------------+-----------------------------------
Total | 566 100.00
(1,916 observations deleted)
Which party feel closer to, Italy | Freq. Percent Cum.
-------------------------------------+-----------------------------------
Movimento 5 Stelle | 96 18.64 18.64
Partido Democratico | 184 35.73 54.37
Lega | 76 14.76 69.13
Forza Italia | 55 10.68 79.81
Fratelli d'Italia con Giorgia Meloni | 104 20.19 100.00
-------------------------------------+-----------------------------------
Total | 515 100.00
If equal variances (homoscedasticity)?
Which party | Summary of Age of respondent,
feel closer | calculated
to, Italy | Mean Std. dev. Freq.
------------+------------------------------------
Movimento | 47.479167 15.879798 96
Partido D | 58.429348 16.45213 184
Lega | 56.746667 15.347412 75
Forza Ita | 55.527273 17.415249 55
Fratelli | 57.490196 15.174722 102
------------+------------------------------------
Total | 55.630859 16.482288 512
W0 = 1.02265803 df(4, 507) Pr > F = 0.39499018
W50 = 0.91322224 df(4, 507) Pr > F = 0.45588719
W10 = 1.00011288 df(4, 507) Pr > F = 0.40700853
Well, all p-values are pretty large, and we cannot reject the null hypothesis which states that the variances are equal. In other words, there is NO statistically significant difference in the variances of age among parties in the sample. We proceed with equal variance ANOVA then.
oneway
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 8266.80661 4 2066.70165 8.03 0.0000
Within groups 130554.426 507 257.503798
------------------------------------------------------------------------
Total 138821.232 511 271.665817
Bartlett's equal-variances test: chi2(4) = 1.9010 Prob>chi2 = 0.754
Report: a one-way ANOVA was performed to compare the difference between age and closeness to political party. F(4, 507) = 8.03, p = .00. We reject \(H_0\), and there is a statistically significant difference of age in party closeness between at least two party groups.
[F(between groups df, within groups df) = F-value, p = p-value]
P.s., in the table we see an alternative Bartlett’s test for equal variances, which also checks whether the variances across different groups are equal or not.
Difference Between the Levene’s test and Bartlett’s Test: both tests are used to test the assumptions of equal variances. However, Bartlett test requires more or less normal distributions while Levene’s test do not assume such normality, which should be more robust if data are not normally distributed.
From the results so far, we know that at least one of the group means is different from the other group means. Next, we can do pairwise comparisons of means to see how groups differ.
oneway
command with multiple-comparison option using
bonferroni
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 8266.80661 4 2066.70165 8.03 0.0000
Within groups 130554.426 507 257.503798
------------------------------------------------------------------------
Total 138821.232 511 271.665817
Bartlett's equal-variances test: chi2(4) = 1.9010 Prob>chi2 = 0.754
Comparison of Age of respondent, calculated
by Which party feel closer to, Italy
(Bonferroni)
Row Mean-|
Col Mean | Moviment Partido Lega Forza It
---------+--------------------------------------------
Partido | 10.9502
| 0.000
|
Lega | 9.2675 -1.68268
| 0.002 1.000
|
Forza It | 8.04811 -2.90208 -1.21939
| 0.032 1.000 1.000
|
Fratelli | 10.011 -.939152 .743529 1.96292
| 0.000 1.000 1.000 1.000
For extremely small groups (<5 observations/group), use Fisher’s exact test.
Hypotheses:
\(H_0\): there is no association between the variables (independent).
\(H_1\): there is an association between the variables (dependent).
Let’s test if vote
is associated with
gndr
.
We want to test the variables in the original data set before we dropped countries.
vote:
1 Yes
2 No
3 Not eligible to vote
.a Refusal
.b Don't know
.c No answer
(2,594 observations deleted)
chi2
in
tabulate
| Voted last national
| election
Gender | Yes No | Total
-----------+----------------------+----------
Male | 12,503 3,465 | 15,968
Female | 14,291 4,299 | 18,590
-----------+----------------------+----------
Total | 26,794 7,764 | 34,558
Pearson chi2(1) = 10.0231 Pr = 0.002
vars order: row col
row
to show within-row relative frequencies| Key |
|----------------|
| frequency |
| row percentage |
+----------------+
| Voted last national
| election
Gender | Yes No | Total
-----------+----------------------+----------
Male | 12,503 3,465 | 15,968
| 78.30 21.70 | 100.00
-----------+----------------------+----------
Female | 14,291 4,299 | 18,590
| 76.87 23.13 | 100.00
-----------+----------------------+----------
Total | 26,794 7,764 | 34,558
| 77.53 22.47 | 100.00
Pearson chi2(1) = 10.0231 Pr = 0.002
\(\chi^2\) (df = 1, N = 34558 ) = 10.02, p = .00. Since p-value < .05 and we reject \(H_0\). There is an association between gender and voting behavior.
Always nice to visualize your data!
Parametric tests: normally distributed samples
Non-parametric tests: not normally distributed samples
Most parametric tests have their nonparametric counterparts, e.g.:
Parametric test | Nonparametric test |
---|---|
Pearson correlation | Spearman correlation |
Two-sample t-test | Mann-Whitney U test |
One-way ANOVA | Kruskal-Wallis test |
– | Chi-squared test |
In case you need to access any
additional tests that we didn’t have time to cover in class, you can
look them up in Google (your bff in coding).
Variable | Binary | Nominal/ordinal | Interval/ratio |
---|---|---|---|
Binary | Chi-squared | Chi-squared | T-test |
Nominal/ordinal | Chi-squared | Chi-squared | ANOVA |
Interval/ratio | T-test | ANOVA | Correlation |
Due date: by 13.Oct.2024 23:59
Work individually
Use a data set we have downloaded (ESS_italy/ESS10)
(P.s., more information about the variables can be found in the corresponding codebook, which is available to download from the website)
Choose two variables at your interests and describe each of them
Visualize the relationship between them
Choose an appropriate test to check their association
By the end, we want to receive two files:
I expect to see in the PDF file:
Name the file surname_quanlab_2
Upload these two files to Moodle “Lab Materials” section
How would you rate this week’s lessons on a scale of cats?
Image source: https://x.com/catecoin/status/1742834810140402105