Types of research questions
There are two very different types of statistics, each designed to
use this data to answer different types of research questions.
- Descriptive statistics – techniques used to describe and condense
data
- Inferential statistics – techniques used to draw conclusions from
data
Descriptive statistics
- Goal is to summarize the main features of the data
Measures of Central Tendency
- Central tendency: Summarize data with a single value that best
represents it
# Imagine you want to measure perceptions of a public policy program.
# We asked N people to rate it on a scale of 1 – 7.
dataop<-sample(c(1:7), replace=T, size=2200) # 2200
table(dataop) # Tabela
mean(dataop) # Média
median(dataop) # Mediana
summary(dataop) # Sumário de Estatísticas
hist(dataop) # Histograma (Gráfico de Barras)
# How do you choose the appropriate measure of central tendency?
# Level of measurement
##### Most appropriate measure?
# Age (18-65) - Mean/Median
# Support for LGBTQ rights (1 to 5) - Mean
# Risk-taking (1 to 7) - Mean
# Partisan Identity - Mode
# Religion - Mode
# Health (poor, fair, good, excellent) - Mean/Median/Mode
# For nominal (factor, e.g. race) variables. Mean or Median do not work. Mode...
x=c("PT", "PSL", "PSDB", "DEM", "PT", "PDT", "PDT", "PT", "REDE", "PSL", "PT")
mean(x)
sort(table(x))
# Variability: Measure of how compressed/spread out data points are
# Standard Deviation
sd <- magick::image_read("/Users/rb5286/Library/Mobile Documents/com~apple~CloudDocs/PIBIC2021/figures/sd.png"); plot(sd)
sqrt(sum((dataop-mean(dataop))^2 / (length(dataop)-1))) # Desvio-padrão manual
sd(dataop) # Desvio-padrão
# Whereas mean, median, and mode are all truly different from each other...
# Variance and standard deviation are directly related.
psych::describe(dataop)
# skewness and kurtosis tell us about the shape of the distribution.
# if perfectly normal, mean = median = mode; and skew and kurtosis zero
# The beauty of descriptive statistics
# Can summarize an entire distribution with 2 numbers!
# Typically Mean and Standard deviation
sd2 <- magick::image_read("/Users/rb5286/Library/Mobile Documents/com~apple~CloudDocs/PIBIC2021/figures/sd2.png"); plot(sd2)
# 95% of the data is inside 2SD, while 68% inside 1SD and 99.9% inside 3 SD.
Hypothesis Testing

Chi-squared Test
# CHI^2 distribution. The chi-square test (χ2) analyzes whether there is a relationship between two categorical variables. For example, gender, age range, neighborhood, or ethnicity.
# Are men more conservative than women?
chisq.test(databr$petista, databr$homem) # .91 n.s.
chisq.test(databr$tucano, databr$homem) # .73 n.s.
chisq.test(databr$ideology, databr$homem)
chisqgender<-chisq.test(databr$ideology, databr$homem)
format(chisqgender$p.value, scientific=FALSE) # Men are more conservative in Brazil.
# Based on the simple idea of comparing the frequencies you observe in certain categories to the frequencies you might expect to get in those categories by chance.
table(databr$homem, databr$ideology)
T-TEST: Means difference
- The t-test analyzes whether there are significant differences
between the means between two groups/samples.
?t.test()
t.test(group1$varX, group2$varX) # Variáveis sejam numéricas
# t-tests can answer three types of research questions:
# Single-sample t-tests: Is this sample from this particular population?
# Related-samples t-tests: Was there a change between two time points?
## Repeated measures design – each same person is in both samples
# Independent-samples t-tests: Are these two samples from the same population?
## When researchers want to compare two different sets of participants.
# e.g., Do Petistas and Anti-petistas have different views on abortion?
# e.g., Do people who read the Carta Capital have more political knowledge than people who read Tititi?
# e.g., Are people who saw a negative political propaganda less likely to vote for the candidate than people who didn’t?
hs<-subset(databr, homem==1) # Homens
ms<-subset(databr, homem==0) # Mulheres
t.test(ms$ideology, hs$ideology) # Significant
ind.t.test<-t.test(ideology ~ homem, databr)
# We can also calculate r pearson (the effect size) using R.
# The value of t is stored in our model as a variable called statistic[[1]]
# and the degrees of freedom are stored as parameter[[1]].
# The value of the effect size of Pearson r correlation varies between -1 (a perfect negative corr) to +1 (a perfect positive corr).
# According to Cohen (1988, 1992), the effect size is LOW if r varies around 0.1,
# MEDIUM if r varies around 0.3, and LARGE if r varies more than 0.5.
t<-ind.t.test$statistic[[1]]
df<-ind.t.test$parameter[[1]]
r <- sqrt( t^2 / (t^2+df) ) # Rosenthal, 1991; Rosnow & Rosenthal, 2005
round(r, 3) # small effect size
# A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the significance.
# To conduct a test with three or more means, one must use an analysis of variance.
TUKEY TEST & ANOVA - Variance Analysis
- When we are performing an analysis of variance, the null hypothesis
considered is that there is no difference between groups. Instead of
just 2 groups, we are comparing several groups at the same time.
What if your research question involves more than two groups?
- Does listening to rock, country, or classical music increase
studying efficiency?
- Do independents differ from left and right wing partisans or they
simply partisans in disguise?
You could systematically compare all possible combinations of these
groups using independent samples t-tests. Conducting multiple tests
inflates the test-wise error rate.

Instead of looking at the difference between population means, ANOVA
(analysis of variance) calculates the variance between means. H0: m1 =
m2 = … = m_n
- Calculating ANOVA is different than calculating t-tests: difference
versus variation.
- F-tests: always one-tailed, because variances can never be less than
zero.
databr$partidario2<-as.factor(databr$partidario2) # Transforming into a categorical variable
databr$idpart<-factor(databr$idpart, levels=c(-2, -1, 0, 1, 2),
labels=c("Petista Forte", "Petista Moderado",
"Indiferente","Tucano Moderado", "Tucano Forte"))
# ONE-WAY ANOVA
a1<-aov(ideology ~ partidario2, data = databr); summary(a1) # Significante
a2<-aov(ideology ~ idpart, data = databr); summary(a2) # Significante
# Sum Sq: tells us the total variation that the regression model.
# Mean Sq: tells us the total variation that is due to extraneous factors.
# F value: The F-ratio is a measure of the ratio of the variation explained by the model and the variation explained by unsystematic factors.
# In other words, it is the ratio of how good the model is against how bad it is (how much error there is).
# F-Value: Mean Sq Model / Mean Sq Residuals
# Compute Tukey Honest Significant Differences (TukeyHSD)
# Create a set of confidence intervals on the differences between the means of the levels of a factor.
# First aov() then TukeyHSD()
posthoc1 <- TukeyHSD(x=a1, 'partidario2', conf.level=0.95) # 3 Groups
posthoc2 <- TukeyHSD(x=a2, 'idpart', conf.level=0.95) # 5 Groups
plot(posthoc1)
plot(posthoc2)
# 2-WAY ANOVA
a1_2way<-aov(ideology ~ partidario2 + homem, data = databr); summary(a1_2way)
a2_2way<-aov(ideology ~ idpart + homem, data = databr); summary(a2_2way)
# The intervals constructed in this way would only apply exactly to balanced designs where there are the same number of observations made at each level of the factor.
# This function incorporates an adjustment for sample size that produces sensible intervals for mildly unbalanced designs.
FURTHER READING
- Field, A., Miles, J. and Field, Z., 2012. Discovering statistics
using R. Sage publications. Chapters 9-10