Variable Types, Measurement, Descriptive Statistics

Author

Jack T. Rametta

Following the lecture material, this week we’re going to talk about measurement, variable types, and descriptive statistics. This stuff might appear boring, but if you want to analyze real data you need these preliminaries. A reminder, all the data used in discussion are simulated, please don’t draw substantive conclusions from the toy examples.

Show the code

# Load necessary packages, if you don't have them use install.packages("packagename")
library(ggplot2)
library(tidyverse)
library(kableExtra)
library(forcats)
library(ggthemes)
library(distributional)
# 
set.seed(1995)
#
# ggplot2 theme setup 
theme_teach <- function(){
  theme_few() + 
  theme(axis.text = element_text(color = "black", face = "bold"),
        plot.title = element_text(color="black", face="bold", hjust = .5),
        axis.title = element_text(color = "black", face = "bold")) 
}

Types of variables

There are lots of different ways to typologize the different types of variables (annoyingly, every textbook seems to invent slightly different categories). For this course, we’ll consider three different categories of variables: categorical, continuous, and ordinal. You must recognize the differences between these types!

Here’s a bunch of simulated variables. Tell me what type each of them is. To begin with, what type of variable is age?

Show the code

n <- 1000 #sample size
age_dist <- distributional::dist_beta(shape1 = 2, shape2 = 5) #define the age distro
sex_dist <- distributional::dist_binomial(1,.5)
# 
df <- data.frame(age = unlist(generate(age_dist,n))*100,
                 sex = unlist(generate(sex_dist,n)),
                 education = as.factor(sample(c("political science","economics","pysch","biology"),n,T,c(0.15, 0.25, 0.35, 0.25))),
                 ideology = as.factor(sample(c("Far Left","Left","Middle of the Road","Right","Far Right"),n,T,c(.15,.25,.23,.25,.12))),
                 birth_place = as.factor(sample(c("Sacramento","Elsewhere"),n,T,c(.6,.4))),
                 years_of_education = rnorm(n,12,4))
# 
df$years_of_education <- ifelse(df$years_of_education < 0, 0, df$years_of_education) 
df$ideology.num <- as.numeric(df$ideology)
#
df |> ggplot(aes(x = age)) + geom_histogram(color = "white",fill = "forestgreen",alpha = .8) + xlab("Distribution of Age") + ylab("Count") + theme_teach()

How about sex?

Show the code

df |> ggplot(aes(x = sex)) + 
  geom_histogram(color = "white",fill = "forestgreen",alpha = .8) + 
  xlab("Distribution of Sex") + ylab("Count") + theme_teach()

How about education?

Show the code

df |> 
  ggplot(aes(x = education)) + 
  geom_bar(color = "white",fill = "forestgreen",alpha = .8) + 
  xlab("Distribution of College Major") + ylab("Count") + 
  theme_teach()

What about this measurement of education? How about education?

Show the code

df |> 
  ggplot(aes(x = years_of_education)) + 
  geom_histogram(color = "white",fill = "forestgreen",alpha = .8) + 
  xlab("Distribution of Years of Education") + ylab("Count") + 
  theme_teach()

How about ideology?

Show the code

df |> 
  mutate(ideology = fct_relevel(ideology, c("Far Left","Left","Middle of the Road","Right","Far Right"))) |> 
  ggplot(aes(x = ideology)) + 
  geom_bar(color = "white",fill = "forestgreen",alpha = .8) + 
  xlab("Distribution of Self-Identified Ideology") + ylab("Count") + 
  theme_teach()

How about birth place?

Show the code

df |> 
  ggplot(aes(x = birth_place)) + 
  geom_bar(color = "white",fill = "forestgreen",alpha = .8) + 
  xlab("Distribution of Birth Places") + ylab("Count") + 
  theme_teach()

If you’re confused by these you should review them!

Measurement

Measurement is very important. You can have a great theory, huge sample size, great research design, etc., but if you have bad measures you can easily get the wrong answer. Measures vary in terms of reliability and validity - what are these by the way?

Take a simple example. For a moment, pretend we’re omniscient martians who can gather measurements with near-perfect reliability and validity. We measure political ideology by scanning the brains of 20K Americans and see it’s distributed in the American mass public on a uni-dimensional left-right scale as below.

Show the code

ideology <- c(rnorm(10000,-.5,.25),rnorm(10000,.5,.25))
ggplot(data = NULL,aes(x = ideology)) + geom_histogram(color = "white",fill = "grey") + 
  xlab("Ideology") + 
  ylab("Count") + 
  ggtitle("Oracle (true) Distribution of Political Ideology") + 
  theme_teach()

Well this is great! In real life, to get a clean continuous measure like this you’d need to collect lots of questions from many participants. Then you could compile those responses to derive a single value for ideology. The process of combining all these responses into a single value is quite complex, but you can think of it like taking a simple average across many questions.

If we wanted to, we could then use this measure as an IV or DV, for example as an IV predicting Trump approval.

Show the code

trump_approve <- 50 + 100*ideology^3 + rnorm(20000,0,20)
trump_approve <- ifelse(trump_approve < 0,0,trump_approve)
trump_approve <- ifelse(trump_approve > 100,100,trump_approve)

data.frame(ideology,trump_approve) |> ggplot(aes(x = ideology,y = trump_approve)) + 
  geom_point(shape = 1,alpha = .1) + 
  geom_smooth(se = F,color = "red",method = "lm",formula = y ~ splines::bs(x, 4),size = 1) + 
  xlab("Ideology") +
  ylab("Trump Approval (Feeling Therm)") + 
  ylim(0,100) + 
  theme_teach()

Not too shabby. But now let’s pretend we’re economists who know nothing about the literature on measuring political ideology. This won’t stop us from plowing ahead anyway (we’re economists after all). We’ll take a small (N = 1,000) sample and use a single coarse, 5 category self-reported measure. Here’s a table summarizing respondents by category in our data.

Show the code

small <- data.frame(ideology,trump_approve) |> sample_n(1000)
#
small$ideo_cats <- cut(x = small$ideology,breaks = c(-5,-.75,-.5,.5,.75,5))
levels(small$ideo_cats) <- c("Far Left","Left","Moderate","Right","Far Right")
#
table(small$ideo_cats)


 Far Left      Left  Moderate     Right Far Right 
       92       174       486       148       100

What has happened here? Our categorical measure is based on the perfect continuous distribution from before. What happened? (hint: cut points!)

Now let’s see if we can recover that relationship with trump approval.

Show the code

small |> 
  ggplot(aes(x = ideo_cats,y = trump_approve)) + 
  geom_boxplot() + 
  geom_smooth(aes(x = as.numeric(ideo_cats)),method = "lm",formula = y ~ splines::bs(x, 4),size = 1,se = F,color = "red") + 
  xlab("Ideology") +
  ylab("Trump Approval (Feeling Therm)") + 
  ylim(0,100) + 
  theme_teach()

We still manage to recover the basic relationship, but the non linearity is now gone. We’ve lost a lot of information by converting the continuous scale to categorical.

Let’s move on to another example and pretend we have a two wave survey. In both waves, we ask respondents if they voted in the last election. We know in reality about 66% of the population actually voted.

Show the code

r_lie <- .3333333 #of those who didn't vote, what are the odds they will lie about it when asked? Let's assume 1/3

val_df <- data.frame(vote_real = rbinom(1000,1,.666666666))

val_df |> mutate(vote1 = case_when(vote_real == 1 ~ 1,
                                   vote_real == 0 ~ rbinom(1000,1,prob = r_lie)),
                 vote2 = vote1) |> 
  summarize(real_vote_perc = mean(vote_real),
            self_report_wave_1 = mean(vote1),
            self_report_wave_2 = mean(vote2))

  real_vote_perc self_report_wave_1 self_report_wave_2
1          0.663              0.773              0.773

Assess the validity and reliability of our self-reported measure.

Descriptive Statistics

Descriptive statistics are simple measures of the central tendency and variance that give you a sense of the distributions of different variables in your data.

Measures of the central tendency attempt to summarize the “middle” or “center” of the distribution in a single number. The two most important measures of the central tendency of a distribution are the mean (average) and median. What are these?

Here’s an example of both on our age distribution from earlier

Show the code

df |> ggplot(aes(x = age)) + 
  geom_histogram(color = "white",fill = "grey") + 
  geom_vline(xintercept = mean(df$age),color = "dodgerblue",size = 2) + 
  geom_vline(xintercept = median(df$age),color = "firebrick1",size = 2) + 
  annotate("text", x = 33, y = 70, angle = 90, label = "Mean",color = "dodgerblue",size = 10) + 
  annotate("text", x = 24, y = 40, angle = 90,label = "Median",color = "firebrick1",size = 10) + 
  ylab("Count") + 
  xlab("Distribution of Age") + 
  theme_teach()

Why is the mean greater than the median here?

Let’s take a more extreme example, the distribution of net worth among US presidents.

Show the code

p.wealth <- data.frame(
  potus = c("Donald Trump", "George Washington", "Thomas Jefferson", "Theodore Roosevelt", "Andrew Jackson",
            "James Madison", "Lyndon B. Johnson", "Herbert Hoover", "John F. Kennedy", "Bill Clinton",
            "Franklin D. Roosevelt", "John Tyler", "Barack Obama", "George W. Bush", "James Monroe",
            "Martin Van Buren", "Grover Cleveland", "George H. W. Bush", "John Quincy Adams", "John Adams",
            "Richard Nixon", "Ronald Reagan", "James K. Polk", "Dwight D. Eisenhower", "Joe Biden",
            "Gerald Ford", "Jimmy Carter", "Zachary Taylor", "William Henry Harrison", "Benjamin Harrison",
            "Millard Fillmore", "Rutherford B. Hayes", "William Howard Taft", "Franklin Pierce", "William McKinley",
            "Warren G. Harding", "James Buchanan", "Abraham Lincoln", "Andrew Johnson", "Ulysses S. Grant",
            "James A. Garfield", "Chester A. Arthur", "Woodrow Wilson", "Calvin Coolidge", "Harry S. Truman"),
  wealth = c(3000, 707, 284, 168, 159,
             136, 131, 100, 99, 90,
             79, 68, 48, 47, 36,
             34, 33, 31, 27, 25,
             20, 16, 13, 10, 10,
             9, 9, 8, 7, 7,
             5, 3, 3, 2, 1,
             1, 1, 1, 1, 1,
             1, 1, 1, 1, 1)
)
#
p.wealth |> ggplot(aes(x = wealth)) + geom_histogram(color = "white",fill = "grey") + 
  geom_vline(xintercept = mean(p.wealth$wealth),color = "dodgerblue",size = 2) + 
  geom_vline(xintercept = median(p.wealth$wealth),color = "firebrick1",size = 2) + 
  annotate("text", x = 240, y = 25, label = "Mean",angle = 90, color = "dodgerblue",size = 10) + 
  ylab("Count") + 
  xlab("Presidential Net Worth (Millions of 2022 Dollars)") + 
  theme_teach()

Uh oh! What’s going on here? This is real data by the way (to the extent we can measure this well).

Let’s do one more, remember the distribution of Trump feeling thermometers from last week. Let’s take a look at those.

Show the code

# 
df2 <- data.frame(treat = rbinom(1000,1,.5)) #generate treatment assignment
df2$trump_support <- df2$treat*5 + rnorm(1000,75,5) #simulate the outcome variable 
index_nt <- sample(seq(1,1000,1),250)
# let's make it bimodal for never trumpers. 
df2[index_nt,]$trump_support <- df2[index_nt,]$trump_support - rnorm(250,50,5) 
df2$treat <- ifelse(df2$treat == 1, "Treatment","Control")
#
# simple histogram 
df2 |> 
  filter(treat == "Control") |> 
  ggplot(aes(x = trump_support)) + 
  geom_histogram(alpha=0.8, color = "white",fill = "grey") + 
  xlab("Trump Support (0-100)") + 
  ylab("Count") + 
  xlim(0,100) +
  geom_vline(xintercept = mean(df2$trump_support),color = "dodgerblue",size = 2) + 
  geom_vline(xintercept = median(df2$trump_support),color = "firebrick1",size = 2) + 
  annotate("text", x = 61, y = 70, angle = 90, label = "Mean",color = "dodgerblue",size = 10) + 
  annotate("text", x = 79, y = 40, angle = 90,label = "Median",color = "firebrick1",size = 10) + 
  theme_teach()

What’s going on here? How should we think about the central tendency of this distribution?

Now let’s talk about variance. Variance is a measure of how “wide” a distribution is, or in other words, the level of dispersion in a distribution. To measure variance, we generally use standard deviation. The standard deviation is the expected value of the squared deviation from the mean in a given distribution.

Here’s an example of two distributions with the same exact mean and different variance.

Show the code

two_devs <- data.frame(X1 = rnorm(n,0,1),
                       X2 = rnorm(n,0,10))
two_devs_melt <- reshape2::melt(two_devs)
two_devs_melt |> 
  ggplot(aes(x = value, y= variable,color = variable,fill = variable )) + 
  ggdist::stat_dotsinterval(aes(color = variable,fill = variable),quantiles = 75) +
  ylab("") + 
  xlab("") + 
  theme_teach() + 
  theme(legend.position = "none")

What do we notice here?

To calculate the standard deviation, we first calculate the sample mean. Then for each observation, we take the difference from the mean, square those and sum them up. Then we divide that by the sample size minus one and take the square root, then BAM: we have the standard deviation. You can see the manual calculation in the code below or use the built in R function.

Show the code

#sqrt(sum((mean(two_devs$X1) - two_devs$X1)^2)/length(two_devs$X1 - 1)) #manual sd, result same 
sd(two_devs$X1) #built in command

[1] 0.9998152

Interpret the standard deviation here.

And on the wider distribution:

Show the code

#sqrt(sum((mean(two_devs$X1) - two_devs$X1)^2)/length(two_devs$X1 - 1)) #manual sd, result same 
sd(two_devs$X2) #built in command

[1] 9.930511

Putting this all together, you’re likely to see tables like this one that display descriptive statistics in a given dataset. In your own work you should always look at your data both in raw form AND summarized with descriptive statistics. I can’t begin to tell you how often I see analyses in published work where the author clearly never looked at their data, they skipped straight to some kind of modeling. This is always a mistake and it can lead to all kinds of strange conclusions and/or bad analysis.

Show the code

modelsummary::datasummary_skim(df)

	Unique (#)	Mean	SD	Min	Median	Max
age	1000	29.4	16.1	1.3	27.4	84.9
sex	2	0.5	0.5	0.0	1.0	1.0
years_of_education	1000	11.9	3.9	0.7	12.0	23.4
ideology.num	5	3.3	1.4	1.0	3.0	5.0