Week 13 Lecture

Probability Theory and Statistical Inference

Please install the “stargazer” package if you have not already done so.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stargazer)

## 
## Please cite as: 
## 
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

cars_data<-mtcars

One of the interesting things about statistics is that it allows us to calculate odds and probabilities rather than just regular estimates (as x goes up, y goes up). With odds and probabilities, we can now say things like “as x goes up, the chances of y happening goes up”. This is very useful because we live in a world that is strongly influenced by seemingly random events, while probabilities give us a language to tackle this uncertainty.

The easiest way to determine how an independent variable affects the odds of a dependent variable is to choose a dependent variable that is binary or categorical (i.e., 0 or 1).

Luckily, the great thing about R is that almost anything can be turned into a categorical or binary variable. For example, we can turn “Mercedes” (the brand of cars) into it’s own binary variable:

cars_data<-mtcars %>% mutate(MERCEDES=ifelse(grepl("Merc",rownames(mtcars)),1,0))

Now we can see what the odds of a car being a Mercedes are, given the variables in the dataset:

merc_glm<-glm(MERCEDES~hp+disp+vs+am+mpg+drat,data=cars_data,family="binomial")

summary(merc_glm)

## 
## Call:
## glm(formula = MERCEDES ~ hp + disp + vs + am + mpg + drat, family = "binomial", 
##     data = cars_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    7.19474   18.47262   0.389   0.6969  
## hp             0.01524    0.04522   0.337   0.7360  
## disp          -0.05872    0.03516  -1.670   0.0949 .
## vs            -7.74271    7.66406  -1.010   0.3124  
## am           -30.11643 3764.68437  -0.008   0.9936  
## mpg            0.09283    0.52085   0.178   0.8585  
## drat           1.92980    5.17325   0.373   0.7091  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33.621  on 31  degrees of freedom
## Residual deviance: 11.602  on 25  degrees of freedom
## AIC: 25.602
## 
## Number of Fisher Scoring iterations: 19

The difference between null deviance and residual deviance is much greater than the difference between degrees of freedom, so this model is probably significant. What about “disp”? It’s technically not significant, but it almost is. Let’s just assume it’s significant for the sake of discussion.

What is the meaning of an estimate of -0.058?

We can exponentate -0.058 to find the effects of engine displacement (disp) on the odds of the car being a Mercedes, all else being equal:

exp(-0.058)

## [1] 0.9436499

.94 means that for every additional unit of engine displacement, the odds of the car being a Mercedes goes down by 1-.94 or 0.06 = 6%.

If the estimate was, say. 0.07, the odds of the car being a Mercedes would change to:

exp(0.07)

## [1] 1.072508

1.07 - this means that the odds of the car being a Mercedes goes up by 7% for every additional unit of engine displacement.

But this is just a number I made up. The actual log-odds is -0.058 which is .94. What does this tell you about Mercedes? It tells you, in general, that the smaller the engine displacement, the more likely the car is to be Mercedes, all else being equal. Or, to put it another way, Mercedes are somewhat associated with smaller engine displacement.

To visualize this better, let’s add the probabilities from the model to the cars dataset:

cars_data$MERC_PROB<-format(predict(merc_glm,type="response"),scientific = FALSE)

head(cars_data)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb MERCEDES
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4        0
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4        0
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1        0
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1        0
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2        0
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1        0
##                              MERC_PROB
## Mazda RX4         0.000000000643365066
## Mazda RX4 Wag     0.000000000643365066
## Datsun 710        0.000000000004898261
## Hornet 4 Drive    0.002259373126975573
## Hornet Sportabout 0.030419088629643758
## Valiant           0.005750168822446475

We can now plot the predicted “Mercedes” probability against the displacement of the engines:

plot(cars_data$MERC_PROB, cars_data$disp, xlab = "Predicted Probability of Mercedes", ylab = "Engine Displacement", 
     main = "Predicted Probability of Mercedes vs. Engine Displacement (R - 2024)")
abline(lm(cars_data$disp~cars_data$MERCEDES),col='red',lwd=3)

This is not a direct measurement of the GLM (the red line is a linear model rather than a binomial model), but the overall shape of the trend is similar. The shallowness of the relationship between “probability of Mercedes” vs. engine displacement seems to support the model’s assertion that the relationship is barely significant.

There are other kinds of generalized linear models you can run (they are called different “families”), but only the binomial family is fit for an outcome variable that is….binomial (0 or 1).

We can demonstrate that the binomial family is better for this data, by trying to fit a different family, the “poisson” family (not poison):

merc_poisson_glm<-glm(MERCEDES~hp+disp+vs+am+mpg+drat,data=cars_data,family="poisson")
summary(merc_poisson_glm)

## 
## Call:
## glm(formula = MERCEDES ~ hp + disp + vs + am + mpg + drat, family = "poisson", 
##     data = cars_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2.697e+00  1.073e+01   0.251    0.802
## hp          -2.371e-03  3.090e-02  -0.077    0.939
## disp        -2.946e-02  1.921e-02  -1.534    0.125
## vs          -5.171e+00  5.141e+00  -1.006    0.314
## am          -2.532e+01  5.532e+03  -0.005    0.996
## mpg         -4.631e-02  3.134e-01  -0.148    0.883
## drat         2.027e+00  3.441e+00   0.589    0.556
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 21.278  on 31  degrees of freedom
## Residual deviance:  5.858  on 25  degrees of freedom
## AIC: 33.858
## 
## Number of Fisher Scoring iterations: 19

Notice that even the mild p-values of “disp” are no longer significant in the poisson model. In fact, nothing is significant. This is because the poisson model is not appropriate for binary outcomes.

Without p-values for the overall model, how can we decide which model is better? Our old friend AIC:

AIC(merc_glm,merc_poisson_glm)

##                  df      AIC
## merc_glm          7 25.60157
## merc_poisson_glm  7 33.85797

We see here that merc_glm is a better fit to the model than merc_poisson_glm, because the binomial glm is more appropriate for binary (0,1) data than poisson (which isn’t appropriate at all).

So how do you know what GLM family to use, and when?

Cheat Sheet for GLM Selection

There are more types of GLM’s than just the following, but these are the types you will most likely encounter in this program:

Dependent variable - Independent variable - Method

Continuous - Continuous or categorical - Gaussian (family=“gaussian”) Binary - Categorical or continuous - Binomial (family=“binomial”) Counts - Categorical and continuous integers - Poisson (family=“poisson”) Continuous (skewed)- Continuous and positive - Gamma (family=“Gamma”)

*gamma only really works for data skewed to the right; left skewed would benefit from gaussian after having logged the independent variable

Each of the families can also have “links” specified to the data:

“Identity” - for linear relationships “Log” - when dependent variable is positive or when estimating rates “Logit” - when the dependent variable is binary “Inverse” - for skewed data “Inverse-Square” - skewed data with outliers

Don’t worry about the links too much, each family has it’s own default links which will probably work just fine.

Remember these types of variables as well:

Continuous - Can be written on a number line (1.1,1.2,1.3,1.4). Examples: height in cm (160.5 cm, etc.) Binary - Either it happened, or it didn’t (1, 0). Examples: patient lived after surgery (0) or died (1) Counts - Integers you can “count” on your fingers (1,2,3). Examples: there were 4 car crashes yesterday Continuous (skewed - Gamma) - Same as continuous but positive and with a grouping around extreme values. Example: prices in diamonds Categorical - different “categories” the data can be sorted into. Example: OJ vs. Vitamin C (from ToothGrowth)

Estimating a good sample size

How big of a sample do you need to get reliable estimates?

Rule of thumb: it depends on how diverse the population is. The more similar a population is, the more representative a small random sample of that population will be. Think of opinion polling within a single family versus polling the same number of people from many different countries. Would the opinions in the family be more easily modeled than the opinions of people from distant lands?

Is there a way to mathematically determine the right sample size? Actually, there is:

Let “n” be our sample size Let “P” be the population size (the population we are trying to measure) Let “e” be the significance level (typically 0.05)

n = P / 1+P(e)^2

For a population of 2,000 randomly selected people, with a significance of 0.05:

(2000)/(1+2000*(0.05)^2)

## [1] 333.3333

…so we should select 333 randomly selected persons to sample a population of 2,000.

You might think that 333 out of 2,000 is a lot (and it is), so obviously when sampling for the entire population of the US, the number of people to be polled must be huge. However, this is not the case:

n = ? P = 350,000,000 e = 0.05

(350000000)/(1+350000000*(0.05)^2)

## [1] 399.9995

You are reading this correctly: you only need to sample about 400 people to accurately model the behavior of 350 million people (at a p-value of 0.05, assuming all people are TRULY RANDOMLY selected).

If you want a more accurate p-value, like say 0.03 (most political polls are +/- 3%), then it changes the calculation somewhat:

(350000000)/(1+350000000*(0.03)^2)

## [1] 1111.108

It went from 400 to about 1,111. Still, this is pretty small compared to the entire US population. You may notice that most political polls tend to ask about “1,100 likely voters” or “1,100 registered voters” – this is why.

The really neat thing is that this applies not just to human populations, but to populations of anything. Schools, cars, objects made in a factory, anything.

Visual display of regression summaries

At some point, you will want to display the information you get from your models in a paper. There are easy and hard ways to do this. The hard way is to try and create a table in Microsoft word (or Excel) and put your data in there, while copy/pasting it into your paper.

The easy way is to use stargarzer. First, let’s create some regression models:

cars_gamma<-glm(mpg~wt+disp+qsec,data=cars_data,family="Gamma")
cars_gauss<-glm(mpg~wt+disp+qsec,data=cars_data,family="gaussian")

Then, let’s plug them into Stargazer. Remember to put “results=‘asis’” into the {r} brackets:

stargazer(cars_gamma,cars_gauss,type="html")


	Dependent variable:

	mpg
	gamma	normal
	(1)	(2)

wt	0.011^***	-5.034^***
	(0.003)	(1.224)

disp	0.00002	-0.0001
	(0.00002)	(0.011)

qsec	-0.002^**	0.927^**
	(0.001)	(0.342)

Constant	0.044^***	19.778^***
	(0.012)	(5.938)


Observations	32	32
Log Likelihood	-68.307	-75.360
Akaike Inf. Crit.	144.613	158.720

Note:	p<0.1; p<0.05; p<0.01

The result is a bunch of HTML, but it will make more sense once the code is knitted.

We see that Gamma is a bit more accurate than the “normal” Gaussian model, probably because MPG is skewed a bit towards the right (most MPG’s are around 15-20, but they go all the way to 35).

Now you can copy/paste these results into your paper, or just take a screenshot / snippet and paste that as an image.

Citations:

Dobson, A. J. (2002). *An introduction to generalized linear models*. 2nd Ed. Chapman and Hall/CRC.


Israel, G. D. (1992). *Determining sample size*. PEOD-6. November, 1992.

Week 13 Lecture

Karlerik Naslund

2024-10-28

Probability Theory and Statistical Inference

Cheat Sheet for GLM Selection

Estimating a good sample size

Visual display of regression summaries