Descriptive Statistics in R

Types of research questions

There are two very different types of statistics, each designed to use this data to answer different types of research questions.

Descriptive statistics – techniques used to describe and condense data
Inferential statistics – techniques used to draw conclusions from data

Descriptive statistics

Goal is to summarize the main features of the data

Measures of Central Tendency

Central tendency: Summarize data with a single value that best represents it

# Imagine you want to measure perceptions of a public policy program.
# We asked N people to rate it on a scale of 1 – 7. 
dataop<-sample(c(1:7), replace=T, size=2200) # 2200
 
table(dataop) # Tabela
mean(dataop) # Média
median(dataop) # Mediana
summary(dataop) # Sumário de Estatísticas
hist(dataop) # Histograma (Gráfico de Barras)

# How do you choose the appropriate measure of central tendency?
# Level of measurement

##### Most appropriate measure?
# Age (18-65) - Mean/Median
# Support for LGBTQ rights (1 to 5) - Mean
# Risk-taking (1 to 7) - Mean
# Partisan Identity - Mode
# Religion - Mode
# Health (poor, fair, good, excellent) - Mean/Median/Mode

# For nominal (factor, e.g. race) variables. Mean or Median do not work. Mode...
x=c("PT", "PSL", "PSDB", "DEM", "PT", "PDT", "PDT", "PT", "REDE", "PSL", "PT")
mean(x)
sort(table(x))

# Variability: Measure of how compressed/spread out data points are

# Standard Deviation
sd <- magick::image_read("/Users/rb5286/Library/Mobile Documents/com~apple~CloudDocs/PIBIC2021/figures/sd.png"); plot(sd)

sqrt(sum((dataop-mean(dataop))^2 / (length(dataop)-1))) # Desvio-padrão manual
sd(dataop) # Desvio-padrão 

# Whereas mean, median, and mode are all truly different from each other...
# Variance and standard deviation are directly related.

psych::describe(dataop)
# skewness and kurtosis tell us about the shape of the distribution.
# if perfectly normal, mean = median = mode; and skew and kurtosis zero

# The beauty of descriptive statistics
# Can summarize an entire distribution with 2 numbers!
# Typically Mean and Standard deviation
sd2 <- magick::image_read("/Users/rb5286/Library/Mobile Documents/com~apple~CloudDocs/PIBIC2021/figures/sd2.png"); plot(sd2)
# 95% of the data is inside 2SD, while 68% inside 1SD and 99.9% inside 3 SD.

Hypothesis Testing

Chi-squared Test

# CHI^2 distribution. The chi-square test (χ2) analyzes whether there is a relationship between two categorical variables. For example, gender, age range, neighborhood, or ethnicity.

# Are men more conservative than women?
chisq.test(databr$petista, databr$homem) # .91 n.s.
chisq.test(databr$tucano, databr$homem) # .73 n.s.

chisq.test(databr$ideology, databr$homem)

chisqgender<-chisq.test(databr$ideology, databr$homem)
format(chisqgender$p.value, scientific=FALSE) # Men are more conservative in Brazil.

# Based on the simple idea of comparing the frequencies you observe in certain categories to the frequencies you might expect to get in those categories by chance.
table(databr$homem, databr$ideology)

T-TEST: Means difference

The t-test analyzes whether there are significant differences between the means between two groups/samples.

?t.test()

t.test(group1$varX, group2$varX) # Variáveis sejam numéricas

# t-tests can answer three types of research questions:
# Single-sample t-tests: Is this sample from this particular population?

# Related-samples t-tests: Was there a change between two time points?
## Repeated measures design – each same person is in both samples 

# Independent-samples t-tests: Are these two samples from the same population?
## When researchers want to compare two different sets of participants.

# e.g., Do Petistas and Anti-petistas have different views on abortion?
# e.g., Do people who read the Carta Capital have more political knowledge than people who read Tititi?
# e.g., Are people who saw a negative political propaganda less likely to vote for the candidate than people who didn’t?

hs<-subset(databr, homem==1) # Homens
ms<-subset(databr, homem==0) # Mulheres

t.test(ms$ideology, hs$ideology) # Significant

ind.t.test<-t.test(ideology ~ homem, databr)
# We can also calculate r pearson (the effect size) using R. 
# The value of t is stored in our model as a variable called statistic[[1]] 
# and the degrees of freedom are stored as parameter[[1]].

# The value of the effect size of Pearson r correlation varies between -1 (a perfect negative corr) to +1 (a perfect positive corr). 
# According to Cohen (1988, 1992), the effect size is LOW if r varies around 0.1, 
# MEDIUM if r varies around 0.3, and LARGE if r varies more than 0.5.

t<-ind.t.test$statistic[[1]]
df<-ind.t.test$parameter[[1]]
r <- sqrt( t^2 / (t^2+df) ) # Rosenthal, 1991; Rosnow & Rosenthal, 2005
round(r, 3) # small effect size

# A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the significance. 

# To conduct a test with three or more means, one must use an analysis of variance.

TUKEY TEST & ANOVA - Variance Analysis

When we are performing an analysis of variance, the null hypothesis considered is that there is no difference between groups. Instead of just 2 groups, we are comparing several groups at the same time.

What if your research question involves more than two groups?

Does listening to rock, country, or classical music increase studying efficiency?
Do independents differ from left and right wing partisans or they simply partisans in disguise?

You could systematically compare all possible combinations of these groups using independent samples t-tests. Conducting multiple tests inflates the test-wise error rate.

Instead of looking at the difference between population means, ANOVA (analysis of variance) calculates the variance between means. H0: m1 = m2 = … = m_n

Calculating ANOVA is different than calculating t-tests: difference versus variation.
F-tests: always one-tailed, because variances can never be less than zero.

databr$partidario2<-as.factor(databr$partidario2) # Transforming into a categorical variable
databr$idpart<-factor(databr$idpart, levels=c(-2, -1, 0, 1, 2), 
                      labels=c("Petista Forte", "Petista Moderado", 
                               "Indiferente","Tucano Moderado", "Tucano Forte"))

# ONE-WAY ANOVA
a1<-aov(ideology ~ partidario2, data = databr); summary(a1) # Significante
a2<-aov(ideology ~ idpart, data = databr); summary(a2) # Significante

# Sum Sq: tells us the total variation that the regression model.
# Mean Sq: tells us the total variation that is due to extraneous factors.
# F value: The F-ratio is a measure of the ratio of the variation explained by the model and the variation explained by unsystematic factors. 
# In other words, it is the ratio of how good the model is against how bad it is (how much error there is).

# F-Value: Mean Sq Model / Mean Sq Residuals

# Compute Tukey Honest Significant Differences (TukeyHSD)
# Create a set of confidence intervals on the differences between the means of the levels of a factor.

# First aov() then TukeyHSD()
posthoc1 <- TukeyHSD(x=a1, 'partidario2', conf.level=0.95) # 3 Groups
posthoc2 <- TukeyHSD(x=a2, 'idpart', conf.level=0.95) # 5 Groups 
plot(posthoc1)
plot(posthoc2)

# 2-WAY ANOVA
a1_2way<-aov(ideology ~ partidario2 + homem, data = databr); summary(a1_2way)
a2_2way<-aov(ideology ~ idpart + homem, data = databr); summary(a2_2way)

# The intervals constructed in this way would only apply exactly to balanced designs where there are the same number of observations made at each level of the factor. 
# This function incorporates an adjustment for sample size that produces sensible intervals for mildly unbalanced designs.

Descriptive Statistics in R

Robert Vidigal, PhD

Types of research questions

There are two very different types of statistics, each designed to use this data to answer different types of research questions.

Descriptive statistics

Measures of Central Tendency

Hypothesis Testing

Chi-squared Test

T-TEST: Means difference

TUKEY TEST & ANOVA - Variance Analysis

What if your research question involves more than two groups?

You could systematically compare all possible combinations of these groups using independent samples t-tests. Conducting multiple tests inflates the test-wise error rate.

Instead of looking at the difference between population means, ANOVA (analysis of variance) calculates the variance between means. H0: m1 = m2 = … = m_n

FURTHER READING