Hypothesis Testing in Stata: Bivariate Tests

Giovanni Minchio giovanni.minchio@unitn.it
Yuxin Zhang yuxin.zhang@unitn.it

Quantitative Methods Lab, Lesson 2.2
10 Oct. 2024

What is bivariate analysis?

Bivariate analysis enables the examination of the association between two variables, helping to identify whether a correlation exists and assess its strength. For researchers, it can serve as an initial check before proceeding to more complex analyses.

Recap

What is a hypothesis?

A hypothesis is a statement about a population parameter subject to verification. Then, data are then used to test the validity of the statement.

What is hypothesis testing?

A procedure based on sample evidence and probability theory to assess whether the hypothesis is plausible.

Null & alternative hypotheses

Null hypothesis (\(H_0\)): ?
Alternate hypothesis (\(H_1\)): Elderly people visit hospitals for medical check-ups more often than younger people.

Null hypothesis (\(H_0\)): There is no difference in voting behavior between those who went to college and those who did not.
Alternate hypothesis (\(H_1\)): ?

Which are independent and dependent variables?

Which test? (To be answered later)

Types of bivariate analysis

Scatter plot
Correlation
Two-sample t-test
One-way ANOVA (Analysis of Variance)
Chi-squared test

Recall: steps of hypothesis testing

Stating your null (\(H_0\)) and alternate (\(H_1\)) hypothesis

↓

Selecting significance level and the test to be used

In social sciences typically at .05 or 5 percent level

Image source: https://www.abtasty.com/blog/type-1-and-type-2-errors/

↓

Check assumptions for the chosen test

If assumptions are not met, may yield inaccurate results

↓

Calculate the test statistic

↓

Deciding whether to reject or fail to reject null hypothesis (\(H_0\))

↓

Reporting and interpreting results (and mb visualize them)

↓

Deciding if further analysis is needed

Things to check before tests

How many observations (sample size)?
Key assumptions:
- Random/representative sample
- Measurement scales of variables
- Normal/near-to-normal distributions
- No outliers

Let’s learn them with examples.

set directory and load data

cd "/Users/yuxin/Documents/STATALAB2024-25"
use "datafile/ESS10.dta", clear

Scatterplot

A scatter plot reveals potential relationships between changes in two numeric variables.

Let’s see how internet use netustm is related to age agea

scatter netustm agea || lfit netustm agea

Variables for today’s lab

After loading the data, you may want to retain only the variables we will use today in your Stata work space. (more tidy/less RAM usage/faster computations)

keep cntry netustm agea vote prtcleit gndr

Correlation: two interval/ratio variables

Hypotheses:

Null Hypothesis (\(H_0\)): there is no correlation between the variables.
Alternative Hypothesis (\(H_1\)): there is a significant correlation between the variables.

The correlation coefficient is measured on a scale that varies from + 1 through 0 to – 1.

Pearson’s vs. Spearman’s correlation

Pearson’s r, or Pearson’s correlation coefficient, describes the linear relationship between two numeric variables.

Additional assumption: linear relationship

Spearman’s rho, or Spearman’s rank correlation coefficient, alternative to Pearson’s r. You should use Spearman’s rho when your data fail to meet the assumptions of Pearson’s r. E.g., at least one of your variables is ordinal; unnormal distributions.

Additional assumption: monotonic relationship

Image source: https://www.scribbr.com/statistics/correlation-coefficient/

Distributions

check distributions using histogram

histogram netustm

histogram agea

change bin widths with bin() to show less/more details

histogram agea, bin(15)

histogram agea, bin(100)

overlay a kernel density estimate

hist agea, bin(100) kdensity

Add a tittle

hist agea, bin(100) kdensity title(“Distribution of Age”)

Pearson’s correlation using correlate

correlate netustm agea

(obs=27,395)

             |  netustm     agea
-------------+------------------
     netustm |   1.0000
        agea |  -0.2968   1.0000

use pwcorr, “pw” stands for “pairwise”

pwcorr netustm agea

             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
        agea |  -0.2968   1.0000

use pwcorr, with p-value in sig

pwcorr netustm agea, sig

             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
             |
             |
        agea |  -0.2968   1.0000 
             |   0.0000
             |

use pwcorr, with p-value in star()

pwcorr netustm agea, star(.05)

             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
        agea |  -0.2968*  1.0000

use pwcorr, show p-value in star() and sample size obs

pwcorr netustm agea, star(.05) obs

             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
             |    27598
             |
        agea |  -0.2968*  1.0000 
             |    27395    37319
             |

r(27393) = -.30, p = .00

[Report: Pearson correlation coefficient r(df = n – 2) = the r statistic, p = p value.]

Rules of thumb:

Perfect correlation: r = ±1
Strong correlation: r between ±.50 and ±1
Moderate correlation: r between ±.30 and ±.49
Weak correlation: Values below ±.30
No correlation: r = 0

Spearman’s rank correlation using spearman

spearman netustm agea, stats(rho obs p) star(0.05) matrix

Number of observations = 27,395

+-----------------+
|  Key            |
|-----------------|
|   rho           |
|   Number of obs |
|   p-value       |
+-----------------+

             |  netustm     agea
-------------+------------------
     netustm |   1.0000 
             |    27395 
             |        . 
             |
        agea |  -0.3465*  1.0000 
             |    27395    27395 
             |   0.0000        . 
             |

Two-sample t-test: one interval/ratio & one binary

Comparing the mean values of two independent groups.

Hypotheses:

\(H_0\): The means of the two groups are equal.
\(H_1\): The means of the two groups are not equal.

E.g., comparing test scores of two separate groups of students.

There are different t-tests, check them out:

https://www.scribbr.com/statistics/t-test/

Now let’s test the relationship between continuous var agea and binary var vote

label list vote
* drop the additional category "Not eligible to vote" to make it binary
drop if vote == 3

vote:
           1 Yes
           2 No
           3 Not eligible to vote
          .a Refusal
          .b Don't know
          .c No answer

(2,594 observations deleted)

Check assumptions

check distribution of age in each group using twoway histogram

twoway histogram agea, discrete by(vote, total)

add a twoway histogram with a fitted density estimate

twoway histogram agea, discrete by(vote, total) || kdensity agea

you can change the kernel bandwidth in bwidth to make it pretty. Try some different values by yourself.

twoway histogram agea, discrete by(vote, total) || kdensity agea, bwidth(15)

comparing means by boxplot

graph box agea, over(vote)

Check variances

If equal variances (homoscedasticity/homogeneity of variances)?

We can use Levene’s test for equal variance using robvar measurement_variable, by(grouping_variable)

robvar agea, by(vote)

 Voted last |    Summary of Age of respondent,
   national |             calculated
   election |        Mean   Std. dev.       Freq.
------------+------------------------------------
        Yes |   53.687204   16.845207      26,602
         No |   48.287032   18.984173       7,696
------------+------------------------------------
      Total |    52.47548   17.493509      34,298

W0  =  280.19467   df(1, 34296)     Pr > F = 0.00000000

W50 =  275.55155   df(1, 34296)     Pr > F = 0.00000000

W10 =  277.58440   df(1, 34296)     Pr > F = 0.00000000

We see that the standard deviation in age is higher for nonvoters compared to voters. We still want to know if this difference is statistically significant:

W0 = 280.19: This is the test statistic for Levene’s Test centered at the mean. The corresponding p = .00

W50 = 275.55. Centered at the median. The corresponding p = .00

W10 = 277.58. Centered using the 10% trimmed mean. This means that the top 5% and bottom 5% of values are trimmed out so they don’t overly influence the test. The corresponding p = .00

The p-value for each version of Levene’s Test is .00. This indicates that there is a statistically significant difference in the variances of age between voters and nonvoters in our sample. We need the t-test with unequal variances then.

Now t-test,

simple two-sample t-test using ttest with equal variances

ttest agea, by (vote)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
     Yes |  26,602     53.6872    .1032807    16.84521    53.48477    53.88964
      No |   7,696    48.28703    .2164009    18.98417    47.86283    48.71124
---------+--------------------------------------------------------------------
Combined |  34,298    52.47548    .0944588    17.49351    52.29034    52.66062
---------+--------------------------------------------------------------------
    diff |            5.400172    .2245414                4.960063     5.84028
------------------------------------------------------------------------------
    diff = mean(Yes) - mean(No)                                   t =  24.0498
H0: diff = 0                                     Degrees of freedom =    34296

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

The 26602 participants who reported voting (M = 53.69, SD = 16.85) compared to the 7696 participants who did not report voting (M = 48.29, SD = 18.98) have a statistically significant higher mean age (5.40), t(34296) = 24.05, p = .00. We reject \(H_0\).

Conclude: there is a statistically significant difference in the average ages between those who voted and those who did not, with the voter group having a higher mean age.

ttest with unequal variances adding unequal

ttest agea, by (vote) unequal

Two-sample t test with unequal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
     Yes |  26,602     53.6872    .1032807    16.84521    53.48477    53.88964
      No |   7,696    48.28703    .2164009    18.98417    47.86283    48.71124
---------+--------------------------------------------------------------------
Combined |  34,298    52.47548    .0944588    17.49351    52.29034    52.66062
---------+--------------------------------------------------------------------
    diff |            5.400172    .2397838                4.930154    5.870189
------------------------------------------------------------------------------
    diff = mean(Yes) - mean(No)                                   t =  22.5210
H0: diff = 0                     Satterthwaite's degrees of freedom =  11428.3

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

What about comparing means for more than two groups (non-binary)?

One-way ANOVA: one interval/ratio & one nominal/ordinal

(Analysis of Variance)

Hypotheses:

\(H_0\): no difference groups.
\(H_1\): at least one group is different from the others.

Let’s test the relationship between age agea and perceived party distance in Italy prtcleit.

Recoding

keep only data from Italy

keep if cntry == "IT"

(32,586 observations deleted)

check frequencies of parties

tab prtcleit

   Which party feel closer to, Italy |      Freq.     Percent        Cum.
-------------------------------------+-----------------------------------
                  Movimento 5 Stelle |         96       16.96       16.96
                 Partido Democratico |        184       32.51       49.47
                                Lega |         76       13.43       62.90
                        Forza Italia |         55        9.72       72.61
Fratelli d'Italia con Giorgia Meloni |        104       18.37       90.99
               Liberi e Uguali (LEU) |          8        1.41       92.40
                            + Europa |          2        0.35       92.76
              Noi con l'Italia - UDC |          4        0.71       93.46
                    Potere al popolo |          6        1.06       94.52
                            SVP-PATT |          8        1.41       95.94
                               Altro |          3        0.53       96.47
                         Italia Viva |          4        0.71       97.17
                   Unione Valdotaine |          1        0.18       97.35
                   Partito Comunista |          5        0.88       98.23
                          Vox Italia |          1        0.18       98.41
                  Partito Socialista |          2        0.35       98.76
                 Verdi/ Europa Verde |          1        0.18       98.94
                            Italexit |          3        0.53       99.47
                   Azione di Calenda |          3        0.53      100.00
-------------------------------------+-----------------------------------
                               Total |        566      100.00

let’s keep only these bigger categories for analysis (n>50)

* check label list
label list prtcleit

prtcleit:
           1 Movimento 5 Stelle
           2 Partido Democratico
           3 Lega
           4 Forza Italia
           5 Fratelli d'Italia con Giorgia Meloni
           6 Liberi e Uguali (LEU)
           7 + Europa
           8 Noi con l'Italia - UDC
           9 Potere al popolo
          10 Casapound Italia
          11 Italia Europa Insieme
          12 Il popolo della famiglia
          13 Civica Popolare Lorenzin
          14 SVP-PATT
          31 Altro
          33 Italia Viva
          34 Unione Valdotaine
          35 Partito Comunista
          36 Vox Italia
          37 Partito Socialista
          38 Verdi/ Europa Verde
          39 Italexit
          40 Azione di Calenda
          .a Not applicable
          .b Refusal
          .c Don't know
          .d No answer

tab prtcleit, nolabel

Which party |
feel closer |
  to, Italy |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         96       16.96       16.96
          2 |        184       32.51       49.47
          3 |         76       13.43       62.90
          4 |         55        9.72       72.61
          5 |        104       18.37       90.99
          6 |          8        1.41       92.40
          7 |          2        0.35       92.76
          8 |          4        0.71       93.46
          9 |          6        1.06       94.52
         14 |          8        1.41       95.94
         31 |          3        0.53       96.47
         33 |          4        0.71       97.17
         34 |          1        0.18       97.35
         35 |          5        0.88       98.23
         36 |          1        0.18       98.41
         37 |          2        0.35       98.76
         38 |          1        0.18       98.94
         39 |          3        0.53       99.47
         40 |          3        0.53      100.00
------------+-----------------------------------
      Total |        566      100.00

keep if prtcleit < 6

(1,916 observations deleted)

check if it worked

tab prtcleit

   Which party feel closer to, Italy |      Freq.     Percent        Cum.
-------------------------------------+-----------------------------------
                  Movimento 5 Stelle |         96       18.64       18.64
                 Partido Democratico |        184       35.73       54.37
                                Lega |         76       14.76       69.13
                        Forza Italia |         55       10.68       79.81
Fratelli d'Italia con Giorgia Meloni |        104       20.19      100.00
-------------------------------------+-----------------------------------
                               Total |        515      100.00

Distributions

check distributions by parties

twoway histogram agea, discrete by(prtcleit, total) || kdensity agea, bwidth(10)

box plot

graph box agea, over(prtcleit)

Check variances

If equal variances (homoscedasticity)?

Levene’s test

robvar agea, by(prtcleit)

Which party |    Summary of Age of respondent,
feel closer |             calculated
  to, Italy |        Mean   Std. dev.       Freq.
------------+------------------------------------
  Movimento |   47.479167   15.879798          96
  Partido D |   58.429348    16.45213         184
       Lega |   56.746667   15.347412          75
  Forza Ita |   55.527273   17.415249          55
  Fratelli  |   57.490196   15.174722         102
------------+------------------------------------
      Total |   55.630859   16.482288         512

W0  =  1.02265803   df(4, 507)     Pr > F = 0.39499018

W50 =  0.91322224   df(4, 507)     Pr > F = 0.45588719

W10 =  1.00011288   df(4, 507)     Pr > F = 0.40700853

Well, all p-values are pretty large, and we cannot reject the null hypothesis which states that the variances are equal. In other words, there is NO statistically significant difference in the variances of age among parties in the sample. We proceed with equal variance ANOVA then.

One-way ANOVA

using oneway

oneway agea prtcleit

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      8266.80661      4   2066.70165      8.03     0.0000
 Within groups      130554.426    507   257.503798
------------------------------------------------------------------------
    Total           138821.232    511   271.665817

Bartlett's equal-variances test: chi2(4) =   1.9010    Prob>chi2 = 0.754

Report: a one-way ANOVA was performed to compare the difference between age and closeness to political party. F(4, 507) = 8.03, p = .00. We reject \(H_0\), and there is a statistically significant difference of age in party closeness between at least two party groups.

[F(between groups df, within groups df) = F-value, p = p-value]

P.s., in the table we see an alternative Bartlett’s test for equal variances, which also checks whether the variances across different groups are equal or not.

Difference Between the Levene’s test and Bartlett’s Test: both tests are used to test the assumptions of equal variances. However, Bartlett test requires more or less normal distributions while Levene’s test do not assume such normality, which should be more robust if data are not normally distributed.

From the results so far, we know that at least one of the group means is different from the other group means. Next, we can do pairwise comparisons of means to see how groups differ.

oneway command with multiple-comparison option using bonferroni

oneway agea prtcleit, bonferroni

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      8266.80661      4   2066.70165      8.03     0.0000
 Within groups      130554.426    507   257.503798
------------------------------------------------------------------------
    Total           138821.232    511   271.665817

Bartlett's equal-variances test: chi2(4) =   1.9010    Prob>chi2 = 0.754

                 Comparison of Age of respondent, calculated
                     by Which party feel closer to, Italy
                                (Bonferroni)
Row Mean-|
Col Mean |   Moviment   Partido        Lega   Forza It
---------+--------------------------------------------
Partido  |    10.9502
         |      0.000
         |
    Lega |     9.2675   -1.68268
         |      0.002      1.000
         |
Forza It |    8.04811   -2.90208   -1.21939
         |      0.032      1.000      1.000
         |
Fratelli |     10.011   -.939152    .743529    1.96292
         |      0.000      1.000      1.000      1.000

Pearson’s chi-squared test (\(\chi^2\)): two nominal/ordinal

Additional assumption:
- levels/categories of the variables are mutually exclusive
- at least 5 observations per cell in the input table

For extremely small groups (<5 observations/group), use Fisher’s exact test.

Hypotheses:

\(H_0\): there is no association between the variables (independent).
\(H_1\): there is an association between the variables (dependent).

Let’s test if vote is associated with gndr.

We want to test the variables in the original data set before we dropped countries.

so first reload original data

use "datafile/ESS10.dta", clear

recode vote

label list vote
drop if vote == 3

vote:
           1 Yes
           2 No
           3 Not eligible to vote
          .a Refusal
          .b Don't know
          .c No answer

(2,594 observations deleted)

chi-squared of independence using the option chi2 in tabulate

tabulate gndr vote, chi2

           |  Voted last national
           |       election
    Gender |       Yes         No |     Total
-----------+----------------------+----------
      Male |    12,503      3,465 |    15,968 
    Female |    14,291      4,299 |    18,590 
-----------+----------------------+----------
     Total |    26,794      7,764 |    34,558 

          Pearson chi2(1) =  10.0231   Pr = 0.002

vars order: row col

option row to show within-row relative frequencies

tabulate gndr vote, row chi2

| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

           |  Voted last national
           |       election
    Gender |       Yes         No |     Total
-----------+----------------------+----------
      Male |    12,503      3,465 |    15,968 
           |     78.30      21.70 |    100.00 
-----------+----------------------+----------
    Female |    14,291      4,299 |    18,590 
           |     76.87      23.13 |    100.00 
-----------+----------------------+----------
     Total |    26,794      7,764 |    34,558 
           |     77.53      22.47 |    100.00 

          Pearson chi2(1) =  10.0231   Pr = 0.002

\(\chi^2\) (df = 1, N = 34558 ) = 10.02, p = .00. Since p-value < .05 and we reject \(H_0\). There is an association between gender and voting behavior.

Always nice to visualize your data!

bar chart in percent

graph bar (percent), over(gndr) by(vote)

bar chart in percent

graph bar (count), over(gndr) by(vote)

you can change bar colors

graph bar, over(vote) over(gndr) ascategory asyvars bar(1, fcolor(red)) bar(2, fcolor(green))

other colors (search on Google sth like “color palettes Stata”)

graph bar, over(vote) over(gndr) ascategory asyvars bar(1, fcolor(ebblue)) bar(2, fcolor(sandb))

Parametric & nonparametric tests

Parametric tests: normally distributed samples
Non-parametric tests: not normally distributed samples

Most parametric tests have their nonparametric counterparts, e.g.:

Parametric test	Nonparametric test
Pearson correlation	Spearman correlation
Two-sample t-test	Mann-Whitney U test
One-way ANOVA	Kruskal-Wallis test
–	Chi-squared test

In case you need to access any additional tests that we didn’t have time to cover in class, you can look them up in Google (your bff in coding).

Summary: tests learned today

Variable	Binary	Nominal/ordinal	Interval/ratio
Binary	Chi-squared	Chi-squared	T-test
Nominal/ordinal	Chi-squared	Chi-squared	ANOVA
Interval/ratio	T-test	ANOVA	Correlation

Mandatory Assignment 2

Due date: by 13.Oct.2024 23:59

Work individually
Use a data set we have downloaded (ESS_italy/ESS10)

(P.s., more information about the variables can be found in the corresponding codebook, which is available to download from the website)

Choose two variables at your interests and describe each of them
Visualize the relationship between them
Choose an appropriate test to check their association
By the end, we want to receive two files:
- a do-file, where you record your codes
- a PDF, where you organize your report with selected Stata output
I expect to see in the PDF file:
- your hypotheses (\(H_0\) and \(H_1\))
- variable description
- key assumption check
- visualization
- test output and its report
- one line of conclusion
Name the file surname_quanlab_2
Upload these two files to Moodle “Lab Materials” section

Ciao. That’s all, pals.

How would you rate this week’s lessons on a scale of cats?

Image source: https://x.com/catecoin/status/1742834810140402105