COMM621 Quantitative Research Methods

Weiai Wayne Xu curiositybits.cc Department of Communication, UMass-Amherst

2019-02-14

Samples and sampling distribution
Sampling designs
Is your statistical significance substantive?

This tutorial series will introduce you to computational tools, tricks, and tips that you will find useful in conducting quantitative social science research. I develop the tutorial for COMM621 taught at UMass-Amherst. If you are an instructor or a student of quantitative research methods, feel free to adapt it for your class.

If you want to learn R for mining, analyzing, and visualizing social media data, here is [another set of tutorials] I’ve developed (https://curiositybits.shinyapps.io/R_social_data_analytics/#section-data-frames).

Samples and sampling distribution

If you recall from reading Charles Wheelan’s Naked Statistics, the magic of inferential statistics lies in the central limit theorem (The Lebron James of statistics as Wheelan would call it). Based on the theorem, a larger and properly drawn sample will approximate the population data we want to make inference about.

Below we demonstrate the central limit theorem using part of the open data provided by LA Police Department. The dataset contains all arrests made by LAPD in 2016. For simplicity, I included only columns that are relevant to the demo. Read here for a description of variables.

library(DT)
library(readr)
library(ggplot2)
library(pastecs)
library(sampling)
library(gmodels)
library(data.table)

arrests <- read.csv("https://curiositybits.cc/files/la_arrests_2016.csv")

#show the first 500 arrests
datatable(arrests[1:500,], options = list(pageLength = 5))

Before drawing samples, take a look at the distribution of age (the age of those arrested) in the population data. We run the following code to get a set of descriptive stats, including mean, median, standard deviations, etc. Because the dataset includes all of the arrests in 2016, we can treat it as population data. The population mean, as noted in the output below, is 35.55.

#the age info is stored in the AGE column
age <- arrests$AGE 
stat.desc(age, basic = F) #show the descriptive stats of age

##       median         mean      SE.mean CI.mean.0.95          var 
##   32.0000000   34.9274404    0.0894575    0.1753430  177.8987791 
##      std.dev     coef.var 
##   13.3378701    0.3818737

qplot(age, geom="histogram") #make a histogram

We will construct a sample of 100 from age to see how the sample mean approximates the population mean.

sample1 <- sample(age, 100)
summary(sample1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   23.75   30.50   35.76   45.25   85.00

qplot(sample1, geom="histogram")

Based on the histogram of sample1, the age values are quite dispersed, with notable outliers. Next, we construct a sample of 500 to see if the new sample mean moves closer to the population mean.

sample2 <- sample(age, 500)
summary(sample2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   24.00   33.00   35.65   46.00   78.00

qplot(sample2, geom="histogram")

Try a bigger sample. This time we draw 5000 cases.

sample3 <- sample(age, 5000)
summary(sample3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   32.00   34.85   45.00   85.00

qplot(sample3, geom="histogram")

The histogram of sample3 appears more similar to how the age values are distributed in the population data.

Based on the central limit theorem, if we draw large enough number of samples, the distribution of sample means will approximate the normal distribution (the bell curve), and the larger the sample size, the tighter that distribution will be.

Let’s demonstrate how that plays out in our data. Below we draw 5000 samples with each sample containing 1000 cases. We plot the sample means in the histogram shown below.

sample_means <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(age, 1000)
  sample_means[i] <- mean(samp)
}
qplot(sample_means, geom="histogram")

summary(sample_means)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.19   34.65   34.92   34.92   35.19   36.42

Sampling designs

To be able to arrive at generalizable conclusions about the population, we must use properly constructed samples. Below introduces several commonly used sampling designs.

In our case, we have access to the population data (all arrests made in 2016 in LA), so it is fairly easy to produce a random sample. In the code below, we create a sample of 500 cases, without replacement.

randomsample <- arrests[sample(1:nrow(arrests), 500, replace=FALSE),] 
knitr::kable(head(randomsample, 5)) #show the first 5 cases

	AREA_DESC	AGE	SEX_CD	DESCENT_CD	GRP_DESC	ARST_TYP_CD	CHARGE	CHRG_DESC	LOCATION
10091	Mission	24	M	O	Other Assaults	M	243(A)PC	BATTERY ON PERSON	13600 GLENOAKS BL
4923	Foothill	43	F	B	Drunkeness	M	41.27CLAMC	DRINKING IN PUBLIC	LEHIGH
3042	Central	57	M	B	Miscellaneous Other Violations	M	LAMC	LOS ANGELES MUNICIPAL CODE	6TH ST
22117	Wilshire	57	M	H	Driving Under Influence	F	23153(A)VC	DUI ALCOHOL/DRUGS W/INJURY	LA BREA
15027	Pacific	39	M	O	NA	M	5411PUC	NA	86TH

Take a look at the descriptive stats of the random sample and see how cases are distributed by sex and geographic areas.

mean(randomsample$AGE) #show mean

## [1] 35.25

sd(randomsample$AGE) #show standard deviation

## [1] 13.44578

table(randomsample$SEX_CD) #show frequency count by sex

## 
##   F   M 
## 107 393

table(randomsample$AREA_DESC) #show frequency count by geographic areas

## 
## 77th Street     Central  Devonshire    Foothill      Harbor  Hollenbeck 
##          39          55          12          16          16          14 
##   Hollywood     Mission      Newton N Hollywood   Northeast     Olympic 
##          39          28          23          23          16          16 
##     Pacific     Rampart   Southeast   Southwest     Topanga    Van Nuys 
##          43          24           8          34          17          29 
##     West LA West Valley    Wilshire 
##          19          15          14

As you learn from the above sample and the population, the majority of the arrested are male. Also, it is not far-fetched to say that crimes are probably concentrated in certain areas. But, what if we disregard the role of gender and locality and want to generate a sample with equal representation from both sexes and all LA areas?

Below we create a stratified sample based on two separate strata (sex and area). The sex vairable is called SEX_CD and geographic area info is stored in AREA_DESC.

arrests <- data.table(arrests)
setkey(arrests, AREA_DESC, SEX_CD)

#display # of cases in each strata
knitr::kable(arrests[, .N, keyby = list(AREA_DESC, SEX_CD)])

AREA_DESC	SEX_CD	N
77th Street	F	321
77th Street	M	949
Central	F	716
Central	M	2308
Devonshire	F	170
Devonshire	M	446
Foothill	F	129
Foothill	M	569
Harbor	F	140
Harbor	M	542
Hollenbeck	F	103
Hollenbeck	M	585
Hollywood	F	397
Hollywood	M	1718
Mission	F	263
Mission	M	848
Newton	F	166
Newton	M	888
N Hollywood	F	254
N Hollywood	M	801
Northeast	F	123
Northeast	M	536
Olympic	F	130
Olympic	M	699
Pacific	F	346
Pacific	M	1622
Rampart	F	160
Rampart	M	762
Southeast	F	203
Southeast	M	582
Southwest	F	236
Southwest	M	1033
Topanga	F	178
Topanga	M	482
Van Nuys	F	315
Van Nuys	M	861
West LA	F	96
West LA	M	370
West Valley	F	156
West Valley	M	471
Wilshire	F	105
Wilshire	M	451

You will note that there are 42 rows in the above output, meaning there are 42 strata. Run the code below to randomly select 15 cases from each strata. The parameter “srswr” means “simple random sampling with replacement,” which is the method used to select units. Run ?strata() for the help document.

set.seed(2)
ss <- data.table(strata(arrests, c("AREA_DESC", "SEX_CD"), rep(15, 42), "srswr"))
datatable(ss, options = list(pageLength = 5))

Cluster sampling is another commonly used design. The code below demonstrates a simple case of cluster sampling: we first sample 5 areas and then randomly select cases from the 5 areas.

cs <- cluster(arrests, clustername=c("AREA_DESC"),size=5,method="srswor")

#see which areas are selected in the sample and the number of cases in each area
cs <- data.table(cs)
knitr::kable(cs[, .N, keyby = list(AREA_DESC)])

AREA_DESC	N
Hollenbeck	688
Newton	1054
N Hollywood	1055
Pacific	1968
Rampart	922

Is your statistical significance substantive?

A great myth in social science is p-value. It is not an intuitive concept to wrap your head around. Even many scientists struggle to explain it in laymen’s terms, as shown in the video below.

What is a p-value?

We use p-value to indicate statistical significance. The magic number here is .05: a p-value below .05 indicates a statistically significant relationship between variables. It indicates there is a 95% chance that the relationship found in the study sample is NOT due to random chances, and thus the finding from the sample accurately reflects the patterns in the population.

But, even that is an over-extrapolation. Read this article by FiveThirtyEight.

This is how the article explains p-value:

So what information can you glean from a p-value? The most straightforward explanation I found came from Stuart Buck, vice president of research integrity at the Laura and John Arnold Foundation. Imagine, he said, that you have a coin that you suspect is weighted toward heads. (Your null hypothesis is then that the coin is fair.) You flip it 100 times and get more heads than tails. The p-value won’t tell you whether the coin is fair, but it will tell you the probability that you’d get at least as many heads as you did if the coin was fair. That’s it — nothing more.

Putting p-value to test

We will use a dataset containing the math scores of all students in the New York area. We download the data and name the data frame math. The scores are in the Mean.Scale.Score column and students’ gender is stored in Category.

library(readr)
library(ggplot2)
library(pastecs)
library(sampling)
library(gmodels)
library(dplyr)
library(data.table)

math <- read.csv("https://curiositybits.cc/files/math.csv")

#add a column for row id. We will need the row ids for merging datasets in later steps
math <- mutate(math, ID_unit = rownames(math)) 
math$ID_unit <- as.integer(math$ID_unit)

# Convert the Category column into a categorical variable
math$Category <- as.factor(math$Category)

#show the first 50 observations
datatable(math[1:50,], options = list(pageLength = 5))

Since this dataset can be treated as the population data, we know exactly how female and male students differ in math scores. Run the code below to see the mean difference. Female student do slightly better than male student. But the difference isn’t that big (<2).

aggregate(math$Mean.Scale.Score,by=list(Gender=math$Category), mean)

##   Gender        x
## 1 Female 300.2872
## 2   Male 298.7334

In most research scenarios, we don’t have access to the population data and thus have to resort to sampling. Next, we draw a random sample of 100, compare means across groups, and make inference about the population using this sample. Because this is a by group comparison, we will use a simple t-test.

sample1 <- sample_n(math, 100)
t.test(Mean.Scale.Score ~ Category,data=sample1)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = -0.57588, df = 97.961, p-value = 0.566
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.751783   5.915172
## sample estimates:
## mean in group Female   mean in group Male 
##             296.6383             299.0566

What is the p-value of the above t-test? Is it statistically significant?

If see a p value lower than .05, it means that the test is NOT significant, thus the difference could be due to random change and unlikely reflect the true difference in the population. The 95 percent confidence interval gives you an estimate of a range of possible mean difference. Notice that 0 is inside of the range.

Let run a test using a bigger sample (n=1000). Is the test statistically significant this time?

sample2 <- sample_n(math, 1000)
t.test(Mean.Scale.Score ~ Category,data=sample2)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = 0.44966, df = 995.83, p-value = 0.6531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.061520  3.287114
## sample estimates:
## mean in group Female   mean in group Male 
##             300.0813             299.4685

Try a even bigger sample (n=5000)

sample3 <- sample_n(math, 5000)
t.test(Mean.Scale.Score ~ Category,data=sample3)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = 3.4766, df = 4997.4, p-value = 0.0005122
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9315152 3.3404871
## sample estimates:
## mean in group Female   mean in group Male 
##             300.2922             298.1562

Very likely, you will see a significant difference in a large sample. A rule of thumb is that all else constant, you will more likely find significant findings when using larger samples.

In our case, with a sample of 5000, even very tiny differences in math scores become detectable. But is the difference meaningful? In other words, does statistical significance mean substantive significance? It does not.

A p-value says nothing about the “strength of association,” or the meaningfulness of difference. Yes, females have better math scores than males, but the difference in the sample and the population is negligible–it will be absurd to talk about a substantive gap in mathematical skills when male and female students are just a few points apart.

The takeaway? Statistical significance ≠ substantive significance.

Some social scientists are accused of engaging in p-hacking to inflate their scholarly contribution. Sometimes, when they have trouble finding significant results, they are tempted to try a bigger sample without thinking about whether a significant, but tiny difference between groups is truly meaningful. There is an entire episode on Last Week Tonight with John Oliver dedicated to uncovering p-hacking.