library(tidyverse)
This function accepts a vector of raw data (established in step 2), converts it into a data frame named ‘Data’, and then calculates the primary descriptive statistics.
create_and_describe_data <- function(raw_data_vector, variable_name = "Value") {
# 1. Taking our raw data vector and calling it data_frame
data_frame <- data.frame(Data = raw_data_vector)
# 2. Taking the name for our value column and making it the second part of our function
names(data_frame) <- variable_name
# 3. Calculate Descriptive Statistics for our Raw Values
# Measures of Central Tendency (mean and median)
data_mean <- mean(data_frame[[variable_name]], na.rm = TRUE)
data_median <- median(data_frame[[variable_name]], na.rm = TRUE)
# Geometric Mean
if (any(data_frame[[variable_name]] <= 0)) {
geometric_mean <- NA
gm_warning <- "Geometric Mean cannot be calculated due to zero or negative values."
} else {
# Calculate Geometric Mean: exp(mean(log(x)))
geometric_mean <- exp(mean(log(data_frame[[variable_name]])))
gm_warning <- "Geometric Mean calculated successfully."
}
# Measures of Variability
data_sd <- sd(data_frame[[variable_name]], na.rm = TRUE)
data_variance <- var(data_frame[[variable_name]], na.rm = TRUE)
data_range <- range(data_frame[[variable_name]], na.rm = TRUE)
data_iqr <- IQR(data_frame[[variable_name]], na.rm = TRUE)
# Summary and Sample Size
data_summary <- summary(data_frame[[variable_name]])
sample_size <- length(data_frame[[variable_name]])
# 4. print
cat("--- Descriptive Statistics Summary --- \n\n")
cat(paste("Variable Name:", variable_name, "\n"))
cat(paste("Sample Size (n):", sample_size, "\n\n"))
cat("Measures of Central Tendency: \n")
cat(paste(" Mean:", round(data_mean, 2), "\n"))
cat(paste(" Median:", round(data_median, 2), "\n\n"))
cat(paste(" Geometric Mean:", ifelse(is.na(geometric_mean), "N/A", round(geometric_mean, 2)), "\n"))
cat(paste(" *Note:", gm_warning, "\n\n"))
cat("Measures of Variability: \n")
cat(paste(" Standard Deviation (SD):", round(data_sd, 2), "\n"))
cat(paste(" Variance:", round(data_variance, 2), "\n"))
cat(paste(" Range (Min - Max):", data_range[1], "to", data_range[2], "\n"))
cat(paste(" Interquartile Range (IQR):", round(data_iqr, 2), "\n\n"))
cat("Five Number Summary (from summary() function): \n")
print(data_summary)
}
Use the c() function to input your raw data, and then
call your custom function. - Example
# raw values go in here (e.g., replace example with your variable)
example_data <- c(85, 92, 105, 88, 95, 110, 80, 98, 102, 90, 115, 87)
# this will run the function create_and_describe_data
example_df <- create_and_describe_data(
raw_data_vector = example_data,
variable_name = "example_unit" # Give your variable a good name
)
## --- Descriptive Statistics Summary ---
##
## Variable Name: example_unit
## Sample Size (n): 12
##
## Measures of Central Tendency:
## Mean: 95.58
## Median: 93.5
##
## Geometric Mean: 95.05
## *Note: Geometric Mean calculated successfully.
##
## Measures of Variability:
## Standard Deviation (SD): 10.66
## Variance: 113.72
## Range (Min - Max): 80 to 115
## Interquartile Range (IQR): 15
##
## Five Number Summary (from summary() function):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 80.00 87.75 93.50 95.58 102.75 115.00
c() |
Concatenates (combines) your raw input numbers into a
vector. | Used to define the raw data set (the sample). |data.frame() |
Converts the vector into a formal data frame object. |
Essential for consistency, especially when passing data to functions
like ggplot2 or t.test(). |mean(),
median(), sd(), var() |
Calculate the sample statistics. | The foundation of
descriptive biostatistics. |range(),
IQR() | Calculate simple and robust measures of
data spread. | Used for assessing variability and checking for outliers.
|summary() |
Provides the Five Number Summary (Min, Q1, Median,
Mean, Q3, Max). | Quick check for central tendency, spread, and
symmetry/skewness. |length() |
Gives the number of elements in the vector, which is the sample
size (\(n\)). | Necessary for
all inferential statistics (e.g., standard error calculations).
|| Function | Purpose | Inputs | Output | Qualitative.Output | Example |
|---|---|---|---|---|---|
| Binomial | |||||
| dbinom() | Exact probability of \(x\) successes. |
x = number of successes (output will be exact prob.) number of trials (\(n\)) probability of success (\(p\)) |
A single probability (0 to 1). | The height of the probability mass function at \(x\). | dbinom(x=2, size=10, prob=0.2) |
| pbinom() | Cumulative probability \(P(X \le x)\). |
q = number of successes (output will be the cumulative prob.) number of trials (\(n\)) probability of success (\(p\)) |
A single cumulative probability (0 to 1). | The total area/mass under the distribution up to and including \(x\). | pbinom(q=2, size=10, prob=0.2) |
| qbinom() | Quantile: Finds the number of successes (\(q\)) for a given \(p\). |
p = cumulative probability number of trials (\(n\)) probability of success (\(p\)) |
An integer number of successes (the quantile). | The minimum number of successes needed to achieve the cumulative probability \(p\). | qbinom(p=0.95, size=20, prob=0.1) |
| Poisson | |||||
| dpois() | Exact probability of \(x\) events. |
x = number of events lambda (\(\lambda\)) = average rate of occurrence |
A single probability (0 to 1). | The height of the probability mass function at \(x\). | dpois(x=3, lambda=4.5) |
| ppois() | Cumulative probability \(P(X \le x)\). |
q = number of events (quantile) lambda (\(\lambda\)) = average rate of occurrence |
A single cumulative probability (0 to 1). | The total area/mass under the distribution up to and including \(x\). | ppois(q=3, lambda=4.5) |
| qpois() | Quantile: Finds the number of events (\(q\)) for a given \(p\). |
p = cumulative probability lambda (\(\lambda\)) = average rate of occurrence |
An integer number of events (the quantile). | The minimum number of events needed to achieve the cumulative probability \(p\). | qpois(p=0.90, lambda=3.5) |
| Normal | |||||
| pnorm() | Cumulative probability \(P(X \le q)\). |
q = specified value mean (\(\mu\)) sd (\(\sigma\)) |
A single cumulative probability (0 to 1). | The \(p\)-value corresponding to a one-sided test statistic \(q\). | pnorm(q=1.96, mean=0, sd=1) |
| qnorm() | Quantile/Critical value (\(z\)) for a given \(p\). |
p = cumulative probability mean (\(\mu\)) sd (\(\sigma\)) |
A single \(z\)-score (critical value). | The critical \(Z\)-score needed to define a confidence interval or rejection region. | qnorm(p=0.975, mean=0, sd=1) |
| t-Distribution | |||||
| pt() | Cumulative probability \(P(T \le t)\). |
q = specified t-value df (\(n-1\)) = degrees of freedom |
A single cumulative probability (0 to 1). | The \(p\)-value corresponding to a one-sided \(t\)-test statistic \(q\). | pt(q=2.04, df=29) |
| qt() | Quantile/Critical value (\(t\)) for a given \(p\). |
p = cumulative probability df (\(n-1\)) = degrees of freedom |
A single \(t\)-value (critical value). | The critical \(T\)-score needed to define a confidence interval or rejection region. | qt(p=0.975, df=29) |
| Chi-Squared (\(\chi^2\)) | |||||
| pchisq() | Cumulative probability \(P(\chi^2 \le x)\). |
q = specified \(\chi^2\) value df = degrees of freedom |
A single cumulative probability (0 to 1). | The \(p\)-value for the goodness-of-fit or independence test statistic \(q\). | pchisq(q=3.84, df=1) |
| qchisq() | Quantile/Critical value (\(\chi^2\)) for a given \(p\). |
p = cumulative probability df = degrees of freedom |
A single \(\chi^2\) value (critical value). | The critical \(\chi^2\) threshold for a specified significance level \(\alpha\). | qchisq(p=0.95, df=1) |