Introduction

This vignette demonstrates how to use functions from the ssstats package to carry out common statistical tasks in R. To follow along, load the required packages:

library(ssstats)
library(palmerpenguins)

Summary and Distribution of Continuous Variables

When working with continuous data, it is important to examine measures of central tendency (mean, median, mode), distribution, and the correlation between variables. We will use the penguins dataset from the palmerpenguins package.

mean_median(penguins, bill_length_mm, bill_depth_mm, flipper_length_mm)
## # A tibble: 3 × 3
##   variable          mean_sd      median_iqr  
##   <chr>             <chr>        <chr>       
## 1 bill_depth_mm     17.2 (2.0)   17.3 (3.1)  
## 2 bill_length_mm    43.9 (5.5)   44.5 (9.3)  
## 3 flipper_length_mm 200.9 (14.1) 197.0 (23.0)
normality_correlation(penguins, method = "spearman")
## $`Normality Test`
##            Variable Test Statistic p-value
## 1    bill_length_mm        0.97485   1e-05
## 2     bill_depth_mm        0.97258   0e+00
## 3 flipper_length_mm        0.95155   0e+00
## 4       body_mass_g        0.95921   0e+00
## 5              year        0.79228   0e+00
## 
## $`Correlation Matrix`
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm                 1            NA                NA          NA
## bill_depth_mm                 NA             1                NA          NA
## flipper_length_mm             NA            NA                 1          NA
## body_mass_g                   NA            NA                NA           1
## year                          NA            NA                NA          NA
##                   year
## bill_length_mm      NA
## bill_depth_mm       NA
## flipper_length_mm   NA
## body_mass_g         NA
## year                 1
## 
## $`QQ Plots`

The correlation matrix includes NA values due to missing data. To avoid this, and to exclude numerically coded categorical variables, clean the data before applying the normality_correlation() function:

penguins_clean <- na.omit(penguins)
normality_correlation(penguins_clean, method = "spearman")
## $`Normality Test`
##            Variable Test Statistic p-value
## 1    bill_length_mm        0.97434   1e-05
## 2     bill_depth_mm        0.97329   1e-05
## 3 flipper_length_mm        0.95171   0e+00
## 4       body_mass_g        0.95801   0e+00
## 5              year        0.79395   0e+00
## 
## $`Correlation Matrix`
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm           1.00000      -0.21386           0.66975     0.57648
## bill_depth_mm           -0.21386       1.00000          -0.51732    -0.42928
## flipper_length_mm        0.66975      -0.51732           1.00000     0.84039
## body_mass_g              0.57648      -0.42928           0.84039     1.00000
## year                     0.04290      -0.05706           0.15580     0.01845
##                       year
## bill_length_mm     0.04290
## bill_depth_mm     -0.05706
## flipper_length_mm  0.15580
## body_mass_g        0.01845
## year               1.00000
## 
## $`QQ Plots`

Counts and Percentages for Categorical Variables

For categorical variables, use n_pct() to generate one-way or two-way frequency and percentage tables. For example, to examine the distribution of penguins by sex:

n_pct(penguins_clean, sex)
##     sex     n (pct)
##  female 165 (49.5%)
##    male 168 (50.5%)

You can also summarize by two variables. For instance, to see the distribution of species across years:

n_pct(penguins_clean, row_var = year, col_var = species)
## # A tibble: 3 × 4
##    year Adelie     Chinstrap  Gentoo    
##   <int> <chr>      <chr>      <chr>     
## 1  2007 44 (30.1%) 26 (38.2%) 33 (27.7%)
## 2  2008 50 (34.2%) 18 (26.5%) 45 (37.8%)
## 3  2009 52 (35.6%) 24 (35.3%) 41 (34.5%)

By default, percentages are column-based. You can also specify type = "row" or type = "all" to change the denominator.

One-Sample Inference for Means and Proportions

Depending on our research goals, we might be interested in comparing an observed mean or proportion to a specified value to assess statistical significance. The ssstats package provides the one_mean_* and one_prop_* functions to perform hypothesis tests and compute confidence intervals for means and proportions, respectively.

For demonstration purposes, let us test whether the mean bill length is greater than 40 mm (assuming 40 mm is a relevant benchmark for penguin bill length):

one_mean_HT(penguins_clean, continuous = bill_length_mm, mu = 40, alternative = "greater", alpha = 0.05)
## One-sample t-test for the population mean:
## Null: H0: μ = 40
## Alternative: H1: μ > 40
## Test statistic: t(332) = 13.323
## p-value: p < 0.001
## Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)

As shown in the output, we obtain a very small p-value and reject the null hypothesis that \(\mu = 40\) at the \(\alpha = 0.05\) level. Since this is a one-sided test (alternative = “greater”), we conclude that there is sufficient evidence to suggest the true population mean bill length exceeds 40 mm.

To compute the corresponding confidence interval and point estimate:

one_mean_CI(data = penguins_clean, continuous = bill_length_mm, confidence = 0.95)
## # A tibble: 1 × 5
##    mean    sd conf_level ci_lower ci_upper
##   <dbl> <dbl> <chr>         <dbl>    <dbl>
## 1  44.0  5.47 95%            43.4     44.6

We can conduct a similar analysis for proportions. For example, let us test the null hypothesis that 50% of the penguins in our research population are of the Adelie species:

one_prop_HT(data = penguins_clean, grouping = species, success = "Adelie", p = 0.5, alternative = "two.sided", alpha = 0.05)
## One-sample z-test for the population proportion:
## Null: H0: π = 0.5
## Alternative: H1: π ≠ 0.5
## Test statistic: z = 2.25
## p-value: p = 0.025
## Conclusion: Reject the null hypothesis (p = 0.0247 < α = 0.05)

To obtain the confidence interval for this proportion:

one_prop_CI(data = penguins_clean, grouping = species, success = "Adelie", confidence = 0.95)
## # A tibble: 1 × 6
##   proportion successes sample_size conf_level ci_lower ci_upper
##        <dbl>     <int>       <int> <chr>         <dbl>    <dbl>
## 1      0.438       146         333 95%           0.386    0.492

Comparing Groups: Independent Samples t-Test

Suppose we want to compare bill length between male and female penguins. First, check the assumption of within-group normality using QQ plots:

independent_qq_plot(penguins_clean, variable = "bill_length_mm", group = "sex")

The QQ plots suggest deviations from normality, but with a large enough sample size, the Central Limit Theorem justifies using a t-test. Please note that this example is for illustration only. In practice, a non-parametric test may be more appropriate.

Now perform a two-sample t-test assuming equal variance:

independent_mean_HT(
  penguins_clean,
  continuous = "bill_length_mm",
  grouping = "sex",
  alternative = "two.sided",
  alpha = 0.05,
  variance = "equal"
)
## Two-sample t-test for two independent means and equal variance:
## Null: H₀: μ₁ − μ₂ = 0
## Alternative: H₁: μ₁ − μ₂ ≠ 0
## Test statistic: t(331) = -6.667
## p-value: p < 0.001
## Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)

The result suggests a statistically significant difference in bill length between sexes. To estimate the magnitude of this difference, we compute a 95% confidence interval:

independent_mean_CI(
  penguins_clean,
  grouping = "sex",
  continuous = "bill_length_mm",
  confidence = 0.95
)
## The point estimate for the difference in means is x̄₁ − x̄₂ = -3.7578
## The 95% confidence interval for μ₁ − μ₂ is (-4.8666, -2.649)

With 95% confidence, females have bills that are, on average, 2.65 to 4.87 mm shorter than males, which is a more informative conclusion than simply noting a difference!

Comparing Groups: Beyond t-Test

In the previous example, we performed a two-sample t-test to see if there was a statistically significant different between male and female penguins with respect to bill length. However, we did not account for differences in bill length by species. We can resort to a regression model to determine if the same difference in bill length persists even after adjusting for species.

model <- lm(bill_length_mm ~ sex + species, data = penguins_clean)

Now, we use the anova_check function to assess the assumptions of a linear model.

anova_check(model)

There is no clear evidence of either heteroscedasticity or non-normality of residuals; however, there are some points with large residuals (~10). Let us further examine these points.

In linear regression, outliers are points with large residuals, while influential points are those that significantly affect the model’s estimates. We examine residuals to assess model fit and Cook’s distance to identify observations that unduly influence the regression.

We begin with outlier detection. While definitions vary, we define outliers as observations with standardized residuals greater than 2.5 or less than -2.5. The function returns a ggplot object, allowing for further customization using +. In this example, point shapes are varied by penguin species to aid interpretation.

outlier_count(penguins_clean, model)
## # A tibble: 2 × 2
##   outlier       count
##   <chr>         <int>
## 1 Not Suspected   328
## 2 Suspected         5
outlier_graph(df = penguins_clean, model = model, 
              x_var = bill_depth_mm, y_var = bill_length_mm) +
  ggplot2::aes(shape = species) +
  ggplot2::scale_shape_manual(values = c(16, 17, 15))

There is no indication of data entry errors (such as negative lengths) or other anomalies that would justify excluding these observations. Therefore, we retain them in the analytic sample and proceed to examine influential points. As with outliers, the definition of an influential observation varies. In this package, users can choose from three commonly used thresholds for identifying influential points based on Cook’s distance: (1) the rule, which flags values exceeding three times the mean Cook’s distance; (2) the convention, which highlights points with Cook’s distance greater than 0.5 or 1; and (3) the rule, which flags values above \(4/n\), where \(n\) is the sample size. The rule is used by default.

cooks(model, show.threshold = T, n_labels_desired = 3)

Observations 179, 283, and 296 appear to be influential. While alternative modeling approaches, such as robust regression, quantile regression, or generalized linear models with a gamma distribution, may be appropriate, we proceed under the assumption that retaining these points does not unduly compromise our analysis. We now turn to the regression coefficients to connect our findings with the earlier t-test results.

summary(model)$coef
##                   Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)      36.977019  0.2307447 160.25076 2.727713e-314
## sexmale           3.693907  0.2548031  14.49710  3.597005e-37
## speciesChinstrap 10.009851  0.3412900  29.32946  8.148878e-94
## speciesGentoo     8.697533  0.2871096  30.29342  3.431234e-97

Even after adjusting for species, there is a significant difference in mean bill length between males and females. Specifically, holding species constant, males are expected to have bills that are 3.694 mm longer than females (95% CI: 3.193, 4.195). These results are consistent with the earlier t-test, though the estimated difference is slightly smaller.

Dependent t-Test

The penguins dataset does not contain variables suitable for demonstrating dependent t-test procedures using ssstats. To that end, we will generate a synthetic dataset. Let us simulate standardized pre- and post-test scores for 500 individuals. This type of data is commonly referred to as “wide” format, where each row corresponds to an individual and each column to a measurement occasion. The dependent t-test functions in ssstats require data in this format. If your data is currently in “long” format (where each row is a single measurement), consider reshaping it using tools from packages such as tidyr

set.seed(123)  

# Number of observations
n <- 100

# Mean and standard deviation for pre and post
mu_pre <- 0
mu_post <- 1
sd_pre <- 1
sd_post <- 1

# correlation between pre and post
rho <- 0.6 

# Define variance-covariance matrix for bivariate normal

Sigma <- matrix(c(sd_pre^2, rho * sd_pre * sd_post,
                  rho * sd_pre * sd_post, sd_post^2), 
                nrow = 2)

bivn <- MASS::mvrnorm(n = n, mu = c(mu_pre, mu_post), Sigma = Sigma)
pre <- bivn[, 1]
post <- bivn[, 2]
id <- 1:n

data <- data.frame(id = id, pre = pre, post = post)

Let us begin by examining the distribution of the difference between the post- and pre-test scores. Specifically, we will inspect the QQ plot of the differences to assess the normality assumption required for the paired t-test. We will also compute the mean and median of the difference for a quick descriptive summary.

# Compare means and medians of pre and post scores
dependent_mean_median(data, col1 = pre, col2 = post, accuracy = 3)
## # A tibble: 3 × 3
##   Variable   `Mean (SD)`   `Median (IQR)`
##   <chr>      <chr>         <chr>         
## 1 post – pre 0.904 (0.865) 0.798 (1.135) 
## 2 pre        0.129 (0.943) 0.072 (1.095) 
## 3 post       1.033 (0.905) 1.092 (1.161)
# Examine QQ plot of the difference (post - pre)
dependent_qq_plot(data, var1 = "pre", var2 = "post")

Assuming that all appropriate criteria for a t-test are met, we can proceed with the dependent t-test.

dependent_mean_HT(data = data, col1 = pre, col2 = post, alternative = "two.sided", alpha = 0.05, mu = 0)
## Paired t-test for the mean of differences:
## Null: H₀: μ_d = 0
## Alternative: H₁: μ_d ≠ 0
## Test statistic: t(99) = -10.45
## p-value: p < 0.001
## Conclusion: Reject the null hypothesis (p = < 0.001 < α = 0.05)

We complete this example by examining the 95% confidence interval for this difference.

dependent_mean_CI(data = data, col1 = pre, col2 = post, confidence = 0.95)
## The point estimate for the mean difference is x̄ = -0.9038.
## The point estimate for the standard deviation of differences is s = 0.8649.
## The 95% confidence interval for the mean difference μ_d is (-1.0754, -0.7322).