library(tidyverse)

Descriptive Statistics

FUNCTION: create_and_describe_data

1. Custom Function to Create dataframe from raw values and Describe the Data

This function accepts a vector of raw data (established in step 2), converts it into a data frame named ‘Data’, and then calculates the primary descriptive statistics.

create_and_describe_data <- function(raw_data_vector, variable_name = "Value") {
  
  # 1. Taking our raw data vector and calling it data_frame
  data_frame <- data.frame(Data = raw_data_vector)
  
  # 2. Taking the name for our value column and making it the second part of our function
  names(data_frame) <- variable_name
  
  # 3. Calculate Descriptive Statistics for our Raw Values
  
  # Measures of Central Tendency (mean and median)
  data_mean <- mean(data_frame[[variable_name]], na.rm = TRUE)
  data_median <- median(data_frame[[variable_name]], na.rm = TRUE)
  
  # Geometric Mean
  if (any(data_frame[[variable_name]] <= 0)) {
    geometric_mean <- NA
    gm_warning <- "Geometric Mean cannot be calculated due to zero or negative values."
  } else {
    # Calculate Geometric Mean: exp(mean(log(x)))
    geometric_mean <- exp(mean(log(data_frame[[variable_name]])))
    gm_warning <- "Geometric Mean calculated successfully."
  }
  
  # Measures of Variability
  data_sd <- sd(data_frame[[variable_name]], na.rm = TRUE)
  data_variance <- var(data_frame[[variable_name]], na.rm = TRUE)
  data_range <- range(data_frame[[variable_name]], na.rm = TRUE)
  data_iqr <- IQR(data_frame[[variable_name]], na.rm = TRUE)
  
  # Summary and Sample Size
  data_summary <- summary(data_frame[[variable_name]])
  sample_size <- length(data_frame[[variable_name]])
  
  # 4. print
  cat("--- Descriptive Statistics Summary --- \n\n")
  cat(paste("Variable Name:", variable_name, "\n"))
  cat(paste("Sample Size (n):", sample_size, "\n\n"))
  
  cat("Measures of Central Tendency: \n")
  cat(paste("  Mean:", round(data_mean, 2), "\n"))
  cat(paste("  Median:", round(data_median, 2), "\n\n"))
  cat(paste("  Geometric Mean:", ifelse(is.na(geometric_mean), "N/A", round(geometric_mean, 2)), "\n"))
  cat(paste("  *Note:", gm_warning, "\n\n"))
  
  cat("Measures of Variability: \n")
  cat(paste("  Standard Deviation (SD):", round(data_sd, 2), "\n"))
  cat(paste("  Variance:", round(data_variance, 2), "\n"))
  cat(paste("  Range (Min - Max):", data_range[1], "to", data_range[2], "\n"))
  cat(paste("  Interquartile Range (IQR):", round(data_iqr, 2), "\n\n"))
  
  cat("Five Number Summary (from summary() function): \n")
  print(data_summary)
  
}

2. Inputting the raw values to pull then run in statistics : Once you have the create_and_describe_data function saved this is all you need to use - Will Put in Function Sheet

Use the c() function to input your raw data, and then call your custom function. - Example

# raw values go in here (e.g., replace example with your variable)
example_data <- c(85, 92, 105, 88, 95, 110, 80, 98, 102, 90, 115, 87)

# this will run the function create_and_describe_data
example_df <- create_and_describe_data(
  raw_data_vector = example_data, 
  variable_name = "example_unit" # Give your variable a good name
)
## --- Descriptive Statistics Summary --- 
## 
## Variable Name: example_unit 
## Sample Size (n): 12 
## 
## Measures of Central Tendency: 
##   Mean: 95.58 
##   Median: 93.5 
## 
##   Geometric Mean: 95.05 
##   *Note: Geometric Mean calculated successfully. 
## 
## Measures of Variability: 
##   Standard Deviation (SD): 10.66 
##   Variance: 113.72 
##   Range (Min - Max): 80 to 115 
##   Interquartile Range (IQR): 15 
## 
## Five Number Summary (from summary() function): 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   80.00   87.75   93.50   95.58  102.75  115.00

Key Functions Used (Recap):

  • Function | Purpose in Custom Function | Biostatistics Context |
    • c() | Concatenates (combines) your raw input numbers into a vector. | Used to define the raw data set (the sample). |
    • data.frame() | Converts the vector into a formal data frame object. | Essential for consistency, especially when passing data to functions like ggplot2 or t.test(). |
    • mean(), median(), sd(), var() | Calculate the sample statistics. | The foundation of descriptive biostatistics. |
    • range(), IQR() | Calculate simple and robust measures of data spread. | Used for assessing variability and checking for outliers. |
    • summary() | Provides the Five Number Summary (Min, Q1, Median, Mean, Q3, Max). | Quick check for central tendency, spread, and symmetry/skewness. |
    • length() | Gives the number of elements in the vector, which is the sample size (\(n\)). | Necessary for all inferential statistics (e.g., standard error calculations). |

Probability Distributions

Prefix explanation for all distribution functions - p and q are the most important

  • Function Type - - - Description - - - Example (for Normal Distribution)
  • p…(),Cumulative distribution function (probability P(X ≤ x)).
    • “pnorm(q=1.96, mean=0, sd=1)”
    • the output for any prob function starting with a p will be the area under the curve.
    • If you graph the function - the output is the area under the curve while you are giving the x coordinate, the mean that is at the center of the distribution, and the standard deviation of the distribution
    • lower.tail = false will give you the right shaded region
  • q…(),“Quantile function (inverse of p…(), finds the value x for a given probability).”,
    • “qnorm(p=0.975, mean=0, sd=1)”
    • the output for any prob function starting with a q will be the X value corresponding to the area under the curve (probability/proportion) you input into the first value of the function.
    • If you graph the function - the output is the x coordinate given the cumulative probability that you have provided the function, the mean that is at the center of the distribution, and the standard deviation of the distribution
    • lower.tail = FALSE will give you the x - coordinate given the area under the curve is to the right of the x - coordinate
  • d…(),Density function (height of the probability distribution).,“dnorm(x=1.96, mean=0, sd=1)”
    • Most likely will only need to use for Discrete probability Functions - will give you the probability of the exact number of successes happening
  • r…(),Random number generation (often used for simulation/bootstrapping).,“rnorm(n=100, mean=10, sd=2)” - most likely won’t need

All distribution functions we will need for this midterm in r

Essential R Functions for Common Probability Distributions
Function Purpose Inputs Output Qualitative.Output Example
Binomial
dbinom() Exact probability of \(x\) successes. x = number of successes (output will be exact prob.)
number of trials (\(n\))
probability of success (\(p\))
A single probability (0 to 1). The height of the probability mass function at \(x\). dbinom(x=2, size=10, prob=0.2)
pbinom() Cumulative probability \(P(X \le x)\). q = number of successes (output will be the cumulative prob.)
number of trials (\(n\))
probability of success (\(p\))
A single cumulative probability (0 to 1). The total area/mass under the distribution up to and including \(x\). pbinom(q=2, size=10, prob=0.2)
qbinom() Quantile: Finds the number of successes (\(q\)) for a given \(p\). p = cumulative probability
number of trials (\(n\))
probability of success (\(p\))
An integer number of successes (the quantile). The minimum number of successes needed to achieve the cumulative probability \(p\). qbinom(p=0.95, size=20, prob=0.1)
Poisson
dpois() Exact probability of \(x\) events. x = number of events
lambda (\(\lambda\)) = average rate of occurrence
A single probability (0 to 1). The height of the probability mass function at \(x\). dpois(x=3, lambda=4.5)
ppois() Cumulative probability \(P(X \le x)\). q = number of events (quantile)
lambda (\(\lambda\)) = average rate of occurrence
A single cumulative probability (0 to 1). The total area/mass under the distribution up to and including \(x\). ppois(q=3, lambda=4.5)
qpois() Quantile: Finds the number of events (\(q\)) for a given \(p\). p = cumulative probability
lambda (\(\lambda\)) = average rate of occurrence
An integer number of events (the quantile). The minimum number of events needed to achieve the cumulative probability \(p\). qpois(p=0.90, lambda=3.5)
Normal
pnorm() Cumulative probability \(P(X \le q)\). q = specified value
mean (\(\mu\))
sd (\(\sigma\))
A single cumulative probability (0 to 1). The \(p\)-value corresponding to a one-sided test statistic \(q\). pnorm(q=1.96, mean=0, sd=1)
qnorm() Quantile/Critical value (\(z\)) for a given \(p\). p = cumulative probability
mean (\(\mu\))
sd (\(\sigma\))
A single \(z\)-score (critical value). The critical \(Z\)-score needed to define a confidence interval or rejection region. qnorm(p=0.975, mean=0, sd=1)
t-Distribution
pt() Cumulative probability \(P(T \le t)\). q = specified t-value
df (\(n-1\)) = degrees of freedom
A single cumulative probability (0 to 1). The \(p\)-value corresponding to a one-sided \(t\)-test statistic \(q\). pt(q=2.04, df=29)
qt() Quantile/Critical value (\(t\)) for a given \(p\). p = cumulative probability
df (\(n-1\)) = degrees of freedom
A single \(t\)-value (critical value). The critical \(T\)-score needed to define a confidence interval or rejection region. qt(p=0.975, df=29)
Chi-Squared (\(\chi^2\))
pchisq() Cumulative probability \(P(\chi^2 \le x)\). q = specified \(\chi^2\) value
df = degrees of freedom
A single cumulative probability (0 to 1). The \(p\)-value for the goodness-of-fit or independence test statistic \(q\). pchisq(q=3.84, df=1)
qchisq() Quantile/Critical value (\(\chi^2\)) for a given \(p\). p = cumulative probability
df = degrees of freedom
A single \(\chi^2\) value (critical value). The critical \(\chi^2\) threshold for a specified significance level \(\alpha\). qchisq(p=0.95, df=1)