Types of research questions
There are two very different types of statistics, each designed to use this data to answer different types of research questions.
  • Descriptive statistics – techniques used to describe and condense data
  • Inferential statistics – techniques used to draw conclusions from data
Descriptive statistics
Measures of Central Tendency
# Imagine you want to measure perceptions of a public policy program.
# We asked N people to rate it on a scale of 1 – 7. 
dataop<-sample(c(1:7), replace=T, size=2200) # 2200
 
table(dataop) # Tabela
mean(dataop) # Média
median(dataop) # Mediana
summary(dataop) # Sumário de Estatísticas
hist(dataop) # Histograma (Gráfico de Barras)

# How do you choose the appropriate measure of central tendency?
# Level of measurement

##### Most appropriate measure?
# Age (18-65) - Mean/Median
# Support for LGBTQ rights (1 to 5) - Mean
# Risk-taking (1 to 7) - Mean
# Partisan Identity - Mode
# Religion - Mode
# Health (poor, fair, good, excellent) - Mean/Median/Mode

# For nominal (factor, e.g. race) variables. Mean or Median do not work. Mode...
x=c("PT", "PSL", "PSDB", "DEM", "PT", "PDT", "PDT", "PT", "REDE", "PSL", "PT")
mean(x)
sort(table(x))

# Variability: Measure of how compressed/spread out data points are

# Standard Deviation
sd <- magick::image_read("/Users/rb5286/Library/Mobile Documents/com~apple~CloudDocs/PIBIC2021/figures/sd.png"); plot(sd)

sqrt(sum((dataop-mean(dataop))^2 / (length(dataop)-1))) # Desvio-padrão manual
sd(dataop) # Desvio-padrão 

# Whereas mean, median, and mode are all truly different from each other...
# Variance and standard deviation are directly related.

psych::describe(dataop)
# skewness and kurtosis tell us about the shape of the distribution.
# if perfectly normal, mean = median = mode; and skew and kurtosis zero

# The beauty of descriptive statistics
# Can summarize an entire distribution with 2 numbers!
# Typically Mean and Standard deviation
sd2 <- magick::image_read("/Users/rb5286/Library/Mobile Documents/com~apple~CloudDocs/PIBIC2021/figures/sd2.png"); plot(sd2)
# 95% of the data is inside 2SD, while 68% inside 1SD and 99.9% inside 3 SD.
Hypothesis Testing

Chi-squared Test
# CHI^2 distribution. The chi-square test (χ2) analyzes whether there is a relationship between two categorical variables. For example, gender, age range, neighborhood, or ethnicity.

# Are men more conservative than women?
chisq.test(databr$petista, databr$homem) # .91 n.s.
chisq.test(databr$tucano, databr$homem) # .73 n.s.

chisq.test(databr$ideology, databr$homem)

chisqgender<-chisq.test(databr$ideology, databr$homem)
format(chisqgender$p.value, scientific=FALSE) # Men are more conservative in Brazil.

# Based on the simple idea of comparing the frequencies you observe in certain categories to the frequencies you might expect to get in those categories by chance.
table(databr$homem, databr$ideology)
T-TEST: Means difference
?t.test()

t.test(group1$varX, group2$varX) # Variáveis sejam numéricas

# t-tests can answer three types of research questions:
# Single-sample t-tests: Is this sample from this particular population?

# Related-samples t-tests: Was there a change between two time points?
## Repeated measures design – each same person is in both samples 

# Independent-samples t-tests: Are these two samples from the same population?
## When researchers want to compare two different sets of participants.

# e.g., Do Petistas and Anti-petistas have different views on abortion?
# e.g., Do people who read the Carta Capital have more political knowledge than people who read Tititi?
# e.g., Are people who saw a negative political propaganda less likely to vote for the candidate than people who didn’t?

hs<-subset(databr, homem==1) # Homens
ms<-subset(databr, homem==0) # Mulheres

t.test(ms$ideology, hs$ideology) # Significant

ind.t.test<-t.test(ideology ~ homem, databr)
# We can also calculate r pearson (the effect size) using R. 
# The value of t is stored in our model as a variable called statistic[[1]] 
# and the degrees of freedom are stored as parameter[[1]].

# The value of the effect size of Pearson r correlation varies between -1 (a perfect negative corr) to +1 (a perfect positive corr). 
# According to Cohen (1988, 1992), the effect size is LOW if r varies around 0.1, 
# MEDIUM if r varies around 0.3, and LARGE if r varies more than 0.5.

t<-ind.t.test$statistic[[1]]
df<-ind.t.test$parameter[[1]]
r <- sqrt( t^2 / (t^2+df) ) # Rosenthal, 1991; Rosnow & Rosenthal, 2005
round(r, 3) # small effect size

# A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the significance. 

# To conduct a test with three or more means, one must use an analysis of variance.
TUKEY TEST & ANOVA - Variance Analysis
What if your research question involves more than two groups?
  • Does listening to rock, country, or classical music increase studying efficiency?
  • Do independents differ from left and right wing partisans or they simply partisans in disguise?
You could systematically compare all possible combinations of these groups using independent samples t-tests. Conducting multiple tests inflates the test-wise error rate.

Instead of looking at the difference between population means, ANOVA (analysis of variance) calculates the variance between means. H0: m1 = m2 = … = m_n
  • Calculating ANOVA is different than calculating t-tests: difference versus variation.
  • F-tests: always one-tailed, because variances can never be less than zero.
databr$partidario2<-as.factor(databr$partidario2) # Transforming into a categorical variable
databr$idpart<-factor(databr$idpart, levels=c(-2, -1, 0, 1, 2), 
                      labels=c("Petista Forte", "Petista Moderado", 
                               "Indiferente","Tucano Moderado", "Tucano Forte"))

# ONE-WAY ANOVA
a1<-aov(ideology ~ partidario2, data = databr); summary(a1) # Significante
a2<-aov(ideology ~ idpart, data = databr); summary(a2) # Significante

# Sum Sq: tells us the total variation that the regression model.
# Mean Sq: tells us the total variation that is due to extraneous factors.
# F value: The F-ratio is a measure of the ratio of the variation explained by the model and the variation explained by unsystematic factors. 
# In other words, it is the ratio of how good the model is against how bad it is (how much error there is).

# F-Value: Mean Sq Model / Mean Sq Residuals

# Compute Tukey Honest Significant Differences (TukeyHSD)
# Create a set of confidence intervals on the differences between the means of the levels of a factor.

# First aov() then TukeyHSD()
posthoc1 <- TukeyHSD(x=a1, 'partidario2', conf.level=0.95) # 3 Groups
posthoc2 <- TukeyHSD(x=a2, 'idpart', conf.level=0.95) # 5 Groups 
plot(posthoc1)
plot(posthoc2)

# 2-WAY ANOVA
a1_2way<-aov(ideology ~ partidario2 + homem, data = databr); summary(a1_2way)
a2_2way<-aov(ideology ~ idpart + homem, data = databr); summary(a2_2way)

# The intervals constructed in this way would only apply exactly to balanced designs where there are the same number of observations made at each level of the factor. 
# This function incorporates an adjustment for sample size that produces sensible intervals for mildly unbalanced designs.
FURTHER READING