Hypothesis Testing in Stata: Bivariate Tests


Giovanni Minchio
Yuxin Zhang


Quantitative Methods Lab, Lesson 2.2
10 Oct. 2024

What is bivariate analysis?

Bivariate analysis enables the examination of the association between two variables, helping to identify whether a correlation exists and assess its strength. For researchers, it can serve as an initial check before proceeding to more complex analyses.

Recap

A hypothesis is a statement about a population parameter subject to verification. Then, data are then used to test the validity of the statement.

A procedure based on sample evidence and probability theory to assess whether the hypothesis is plausible.

Null & alternative hypotheses



Which are independent and dependent variables?

Which test? (To be answered later)

Types of bivariate analysis

Recall: steps of hypothesis testing

Stating your null (\(H_0\)) and alternate (\(H_1\)) hypothesis

Selecting significance level and the test to be used

In social sciences typically at .05 or 5 percent level

Image source: https://www.abtasty.com/blog/type-1-and-type-2-errors/

Check assumptions for the chosen test

If assumptions are not met, may yield inaccurate results

Calculate the test statistic

Deciding whether to reject or fail to reject null hypothesis (\(H_0\))

Reporting and interpreting results (and mb visualize them)

Deciding if further analysis is needed

Things to check before tests

  • How many observations (sample size)?

  • Key assumptions:

    • Random/representative sample

    • Measurement scales of variables

    • Normal/near-to-normal distributions

    • No outliers

Let’s learn them with examples.

  • set directory and load data
cd "/Users/yuxin/Documents/STATALAB2024-25"
use "datafile/ESS10.dta", clear 

Scatterplot

A scatter plot reveals potential relationships between changes in two numeric variables.

Let’s see how internet use netustm is related to age agea

scatter netustm agea || lfit netustm agea

Variables for today’s lab

After loading the data, you may want to retain only the variables we will use today in your Stata work space. (more tidy/less RAM usage/faster computations)

keep cntry netustm agea vote prtcleit gndr

Correlation: two interval/ratio variables

Hypotheses:

  • Null Hypothesis (\(H_0\)): there is no correlation between the variables.

  • Alternative Hypothesis (\(H_1\)): there is a significant correlation between the variables.

The correlation coefficient is measured on a scale that varies from + 1 through 0 to – 1.

Pearson’s vs. Spearman’s correlation

  1. Pearson’s r, or Pearson’s correlation coefficient, describes the linear relationship between two numeric variables.
  • Additional assumption: linear relationship
  1. Spearman’s rho, or Spearman’s rank correlation coefficient, alternative to Pearson’s r. You should use Spearman’s rho when your data fail to meet the assumptions of Pearson’s r. E.g., at least one of your variables is ordinal; unnormal distributions.
  • Additional assumption: monotonic relationship

Image source: https://www.scribbr.com/statistics/correlation-coefficient/

Distributions

  • check distributions using histogram
histogram netustm

histogram agea

  • change bin widths with bin() to show less/more details
histogram agea, bin(15)

histogram agea, bin(100)

  • overlay a kernel density estimate
hist agea, bin(100) kdensity

Add a tittle

hist agea, bin(100) kdensity title(“Distribution of Age”)

  • Pearson’s correlation using correlate
correlate netustm agea
(obs=27,395)

             |  netustm     agea
-------------+------------------
     netustm |   1.0000
        agea |  -0.2968   1.0000
  • use pwcorr, “pw” stands for “pairwise”
pwcorr netustm agea
             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
        agea |  -0.2968   1.0000 
  • use pwcorr, with p-value in sig
pwcorr netustm agea, sig
             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
             |
             |
        agea |  -0.2968   1.0000 
             |   0.0000
             |
  • use pwcorr, with p-value in star()
pwcorr netustm agea, star(.05)
             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
        agea |  -0.2968*  1.0000 
  • use pwcorr, show p-value in star() and sample size obs
pwcorr netustm agea, star(.05) obs
             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
             |    27598
             |
        agea |  -0.2968*  1.0000 
             |    27395    37319
             |

r(27393) = -.30, p = .00

[Report: Pearson correlation coefficient r(df = n – 2) = the r statistic, p = p value.]

Rules of thumb:

  • Perfect correlation: r = ±1

  • Strong correlation: r between ±.50 and ±1

  • Moderate correlation: r between ±.30 and ±.49

  • Weak correlation: Values below ±.30

  • No correlation: r = 0

  • Spearman’s rank correlation using spearman
spearman netustm agea, stats(rho obs p) star(0.05) matrix
Number of observations = 27,395

+-----------------+
|  Key            |
|-----------------|
|   rho           |
|   Number of obs |
|   p-value       |
+-----------------+

             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
             |    27395 
             |        . 
             |
        agea |  -0.3465*  1.0000 
             |    27395    27395 
             |   0.0000        . 
             |

Two-sample t-test: one interval/ratio & one binary

Comparing the mean values of two independent groups.

Hypotheses:

  • \(H_0\): The means of the two groups are equal.

  • \(H_1\): The means of the two groups are not equal.

E.g., comparing test scores of two separate groups of students.

There are different t-tests, check them out:

https://www.scribbr.com/statistics/t-test/

Now let’s test the relationship between continuous var agea and binary var vote

label list vote
* drop the additional category "Not eligible to vote" to make it binary
drop if vote == 3
vote:
           1 Yes
           2 No
           3 Not eligible to vote
          .a Refusal
          .b Don't know
          .c No answer

(2,594 observations deleted)

Check assumptions

  • check distribution of age in each group using twoway histogram
twoway histogram agea, discrete by(vote, total) 

  • add a twoway histogram with a fitted density estimate
twoway histogram agea, discrete by(vote, total) || kdensity agea

  • you can change the kernel bandwidth in bwidth to make it pretty. Try some different values by yourself.
twoway histogram agea, discrete by(vote, total) || kdensity agea, bwidth(15)

  • comparing means by boxplot
graph box agea, over(vote)

Check variances

If equal variances (homoscedasticity/homogeneity of variances)?

  • We can use Levene’s test for equal variance using robvar measurement_variable, by(grouping_variable)
robvar agea, by(vote)
 Voted last |    Summary of Age of respondent,
   national |             calculated
   election |        Mean   Std. dev.       Freq.
------------+------------------------------------
        Yes |   53.687204   16.845207      26,602
         No |   48.287032   18.984173       7,696
------------+------------------------------------
      Total |    52.47548   17.493509      34,298

W0  =  280.19467   df(1, 34296)     Pr > F = 0.00000000

W50 =  275.55155   df(1, 34296)     Pr > F = 0.00000000

W10 =  277.58440   df(1, 34296)     Pr > F = 0.00000000

We see that the standard deviation in age is higher for nonvoters compared to voters. We still want to know if this difference is statistically significant:

W0 = 280.19: This is the test statistic for Levene’s Test centered at the mean. The corresponding p = .00

W50 = 275.55. Centered at the median. The corresponding p = .00

W10 = 277.58. Centered using the 10% trimmed mean. This means that the top 5% and bottom 5% of values are trimmed out so they don’t overly influence the test. The corresponding p = .00

The p-value for each version of Levene’s Test is .00. This indicates that there is a statistically significant difference in the variances of age between voters and nonvoters in our sample. We need the t-test with unequal variances then.

Now t-test,

  • simple two-sample t-test using ttest with equal variances
ttest agea, by (vote)
Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
     Yes |  26,602     53.6872    .1032807    16.84521    53.48477    53.88964
      No |   7,696    48.28703    .2164009    18.98417    47.86283    48.71124
---------+--------------------------------------------------------------------
Combined |  34,298    52.47548    .0944588    17.49351    52.29034    52.66062
---------+--------------------------------------------------------------------
    diff |            5.400172    .2245414                4.960063     5.84028
------------------------------------------------------------------------------
    diff = mean(Yes) - mean(No)                                   t =  24.0498
H0: diff = 0                                     Degrees of freedom =    34296

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

The 26602 participants who reported voting (M = 53.69, SD = 16.85) compared to the 7696 participants who did not report voting (M = 48.29, SD = 18.98) have a statistically significant higher mean age (5.40), t(34296) = 24.05, p = .00. We reject \(H_0\).

Conclude: there is a statistically significant difference in the average ages between those who voted and those who did not, with the voter group having a higher mean age.

  • ttest with unequal variances adding unequal
ttest agea, by (vote) unequal
Two-sample t test with unequal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
     Yes |  26,602     53.6872    .1032807    16.84521    53.48477    53.88964
      No |   7,696    48.28703    .2164009    18.98417    47.86283    48.71124
---------+--------------------------------------------------------------------
Combined |  34,298    52.47548    .0944588    17.49351    52.29034    52.66062
---------+--------------------------------------------------------------------
    diff |            5.400172    .2397838                4.930154    5.870189
------------------------------------------------------------------------------
    diff = mean(Yes) - mean(No)                                   t =  22.5210
H0: diff = 0                     Satterthwaite's degrees of freedom =  11428.3

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000



What about comparing means for more than two groups (non-binary)?

One-way ANOVA: one interval/ratio & one nominal/ordinal

(Analysis of Variance)

Hypotheses:

  • \(H_0\): no difference groups.

  • \(H_1\): at least one group is different from the others.

Let’s test the relationship between age agea and perceived party distance in Italy prtcleit.

Recoding

  • keep only data from Italy
keep if cntry == "IT"
(32,586 observations deleted)
  • check frequencies of parties
tab prtcleit
   Which party feel closer to, Italy |      Freq.     Percent        Cum.
-------------------------------------+-----------------------------------
                  Movimento 5 Stelle |         96       16.96       16.96
                 Partido Democratico |        184       32.51       49.47
                                Lega |         76       13.43       62.90
                        Forza Italia |         55        9.72       72.61
Fratelli d'Italia con Giorgia Meloni |        104       18.37       90.99
               Liberi e Uguali (LEU) |          8        1.41       92.40
                            + Europa |          2        0.35       92.76
              Noi con l'Italia - UDC |          4        0.71       93.46
                    Potere al popolo |          6        1.06       94.52
                            SVP-PATT |          8        1.41       95.94
                               Altro |          3        0.53       96.47
                         Italia Viva |          4        0.71       97.17
                   Unione Valdotaine |          1        0.18       97.35
                   Partito Comunista |          5        0.88       98.23
                          Vox Italia |          1        0.18       98.41
                  Partito Socialista |          2        0.35       98.76
                 Verdi/ Europa Verde |          1        0.18       98.94
                            Italexit |          3        0.53       99.47
                   Azione di Calenda |          3        0.53      100.00
-------------------------------------+-----------------------------------
                               Total |        566      100.00
  • let’s keep only these bigger categories for analysis (n>50)
* check label list
label list prtcleit
prtcleit:
           1 Movimento 5 Stelle
           2 Partido Democratico
           3 Lega
           4 Forza Italia
           5 Fratelli d'Italia con Giorgia Meloni
           6 Liberi e Uguali (LEU)
           7 + Europa
           8 Noi con l'Italia - UDC
           9 Potere al popolo
          10 Casapound Italia
          11 Italia Europa Insieme
          12 Il popolo della famiglia
          13 Civica Popolare Lorenzin
          14 SVP-PATT
          31 Altro
          33 Italia Viva
          34 Unione Valdotaine
          35 Partito Comunista
          36 Vox Italia
          37 Partito Socialista
          38 Verdi/ Europa Verde
          39 Italexit
          40 Azione di Calenda
          .a Not applicable
          .b Refusal
          .c Don't know
          .d No answer
tab prtcleit, nolabel
Which party |
feel closer |
  to, Italy |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         96       16.96       16.96
          2 |        184       32.51       49.47
          3 |         76       13.43       62.90
          4 |         55        9.72       72.61
          5 |        104       18.37       90.99
          6 |          8        1.41       92.40
          7 |          2        0.35       92.76
          8 |          4        0.71       93.46
          9 |          6        1.06       94.52
         14 |          8        1.41       95.94
         31 |          3        0.53       96.47
         33 |          4        0.71       97.17
         34 |          1        0.18       97.35
         35 |          5        0.88       98.23
         36 |          1        0.18       98.41
         37 |          2        0.35       98.76
         38 |          1        0.18       98.94
         39 |          3        0.53       99.47
         40 |          3        0.53      100.00
------------+-----------------------------------
      Total |        566      100.00
keep if prtcleit < 6
(1,916 observations deleted)
  • check if it worked
tab prtcleit
   Which party feel closer to, Italy |      Freq.     Percent        Cum.
-------------------------------------+-----------------------------------
                  Movimento 5 Stelle |         96       18.64       18.64
                 Partido Democratico |        184       35.73       54.37
                                Lega |         76       14.76       69.13
                        Forza Italia |         55       10.68       79.81
Fratelli d'Italia con Giorgia Meloni |        104       20.19      100.00
-------------------------------------+-----------------------------------
                               Total |        515      100.00

Distributions

  • check distributions by parties
twoway histogram agea, discrete by(prtcleit, total) || kdensity agea, bwidth(10)

  • box plot
graph box agea, over(prtcleit)

Check variances

If equal variances (homoscedasticity)?

  • Levene’s test
robvar agea, by(prtcleit)
Which party |    Summary of Age of respondent,
feel closer |             calculated
  to, Italy |        Mean   Std. dev.       Freq.
------------+------------------------------------
  Movimento |   47.479167   15.879798          96
  Partido D |   58.429348    16.45213         184
       Lega |   56.746667   15.347412          75
  Forza Ita |   55.527273   17.415249          55
  Fratelli  |   57.490196   15.174722         102
------------+------------------------------------
      Total |   55.630859   16.482288         512

W0  =  1.02265803   df(4, 507)     Pr > F = 0.39499018

W50 =  0.91322224   df(4, 507)     Pr > F = 0.45588719

W10 =  1.00011288   df(4, 507)     Pr > F = 0.40700853

Well, all p-values are pretty large, and we cannot reject the null hypothesis which states that the variances are equal. In other words, there is NO statistically significant difference in the variances of age among parties in the sample. We proceed with equal variance ANOVA then.

One-way ANOVA

  • using oneway
oneway agea prtcleit
                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      8266.80661      4   2066.70165      8.03     0.0000
 Within groups      130554.426    507   257.503798
------------------------------------------------------------------------
    Total           138821.232    511   271.665817

Bartlett's equal-variances test: chi2(4) =   1.9010    Prob>chi2 = 0.754

Report: a one-way ANOVA was performed to compare the difference between age and closeness to political party. F(4, 507) = 8.03, p = .00. We reject \(H_0\), and there is a statistically significant difference of age in party closeness between at least two party groups.

[F(between groups df, within groups df) = F-value, p = p-value]

P.s., in the table we see an alternative Bartlett’s test for equal variances, which also checks whether the variances across different groups are equal or not.

Difference Between the Levene’s test and Bartlett’s Test: both tests are used to test the assumptions of equal variances. However, Bartlett test requires more or less normal distributions while Levene’s test do not assume such normality, which should be more robust if data are not normally distributed.

From the results so far, we know that at least one of the group means is different from the other group means. Next, we can do pairwise comparisons of means to see how groups differ.

  • oneway command with multiple-comparison option using bonferroni
oneway agea prtcleit, bonferroni
                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      8266.80661      4   2066.70165      8.03     0.0000
 Within groups      130554.426    507   257.503798
------------------------------------------------------------------------
    Total           138821.232    511   271.665817

Bartlett's equal-variances test: chi2(4) =   1.9010    Prob>chi2 = 0.754

                 Comparison of Age of respondent, calculated
                     by Which party feel closer to, Italy
                                (Bonferroni)
Row Mean-|
Col Mean |   Moviment   Partido        Lega   Forza It
---------+--------------------------------------------
Partido  |    10.9502
         |      0.000
         |
    Lega |     9.2675   -1.68268
         |      0.002      1.000
         |
Forza It |    8.04811   -2.90208   -1.21939
         |      0.032      1.000      1.000
         |
Fratelli |     10.011   -.939152    .743529    1.96292
         |      0.000      1.000      1.000      1.000

Pearson’s chi-squared test (\(\chi^2\)): two nominal/ordinal

  • Additional assumption:
    • levels/categories of the variables are mutually exclusive
    • at least 5 observations per cell in the input table

For extremely small groups (<5 observations/group), use Fisher’s exact test.

Hypotheses:

  • \(H_0\): there is no association between the variables (independent).

  • \(H_1\): there is an association between the variables (dependent).

Let’s test if vote is associated with gndr.

We want to test the variables in the original data set before we dropped countries.

  • so first reload original data
use "datafile/ESS10.dta", clear 
  • recode vote
label list vote
drop if vote == 3
vote:
           1 Yes
           2 No
           3 Not eligible to vote
          .a Refusal
          .b Don't know
          .c No answer

(2,594 observations deleted)
  • chi-squared of independence using the option chi2 in tabulate
tabulate gndr vote, chi2
           |  Voted last national
           |       election
    Gender |       Yes         No |     Total
-----------+----------------------+----------
      Male |    12,503      3,465 |    15,968 
    Female |    14,291      4,299 |    18,590 
-----------+----------------------+----------
     Total |    26,794      7,764 |    34,558 

          Pearson chi2(1) =  10.0231   Pr = 0.002

vars order: row col

  • option row to show within-row relative frequencies
tabulate gndr vote, row chi2
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

           |  Voted last national
           |       election
    Gender |       Yes         No |     Total
-----------+----------------------+----------
      Male |    12,503      3,465 |    15,968 
           |     78.30      21.70 |    100.00 
-----------+----------------------+----------
    Female |    14,291      4,299 |    18,590 
           |     76.87      23.13 |    100.00 
-----------+----------------------+----------
     Total |    26,794      7,764 |    34,558 
           |     77.53      22.47 |    100.00 

          Pearson chi2(1) =  10.0231   Pr = 0.002

\(\chi^2\) (df = 1, N = 34558 ) = 10.02, p = .00. Since p-value < .05 and we reject \(H_0\). There is an association between gender and voting behavior.

Always nice to visualize your data!

  • bar chart in percent
graph bar (percent), over(gndr) by(vote)

  • bar chart in percent
graph bar (count), over(gndr) by(vote)

  • you can change bar colors
graph bar, over(vote) over(gndr) ascategory asyvars bar(1, fcolor(red)) bar(2, fcolor(green))

  • other colors (search on Google sth like “color palettes Stata”)
graph bar, over(vote) over(gndr) ascategory asyvars bar(1, fcolor(ebblue)) bar(2, fcolor(sandb))

Parametric & nonparametric tests

  • Parametric tests: normally distributed samples

  • Non-parametric tests: not normally distributed samples


Most parametric tests have their nonparametric counterparts, e.g.:


Parametric test Nonparametric test
Pearson correlation Spearman correlation
Two-sample t-test Mann-Whitney U test
One-way ANOVA Kruskal-Wallis test
Chi-squared test




In case you need to access any additional tests that we didn’t have time to cover in class, you can look them up in Google (your bff in coding).

Summary: tests learned today

Variable Binary Nominal/ordinal Interval/ratio
Binary Chi-squared Chi-squared T-test
Nominal/ordinal Chi-squared Chi-squared ANOVA
Interval/ratio T-test ANOVA Correlation

Mandatory Assignment 2

Due date: by 13.Oct.2024 23:59

  • Work individually

  • Use a data set we have downloaded (ESS_italy/ESS10)

(P.s., more information about the variables can be found in the corresponding codebook, which is available to download from the website)

  • Choose two variables at your interests and describe each of them

  • Visualize the relationship between them

  • Choose an appropriate test to check their association

  • By the end, we want to receive two files:

    • a do-file, where you record your codes
    • a PDF, where you organize your report with selected Stata output

    I expect to see in the PDF file:

    • your hypotheses (\(H_0\) and \(H_1\))
    • variable description
    • key assumption check
    • visualization
    • test output and its report
    • one line of conclusion
  • Name the file surname_quanlab_2

  • Upload these two files to Moodle “Lab Materials” section

Ciao. That’s all, pals.

How would you rate this week’s lessons on a scale of cats?

Image source: https://x.com/catecoin/status/1742834810140402105