library(tidyverse)

Descriptive Statistics

FUNCTION: create_and_describe_data

1. Custom Function to Create dataframe from raw values and Describe the Data

This function accepts a vector of raw data (established in step 2), converts it into a data frame named ‘Data’, and then calculates the primary descriptive statistics.

create_and_describe_data <- function(raw_data_vector, variable_name = "Value") {
  
  # 1. Taking our raw data vector and calling it data_frame
  data_frame <- data.frame(Data = raw_data_vector)
  
  # 2. Taking the name for our value column and making it the second part of our function
  names(data_frame) <- variable_name
  
  # 3. Calculate Descriptive Statistics for our Raw Values
  
  # Measures of Central Tendency (mean and median)
  data_mean <- mean(data_frame[[variable_name]], na.rm = TRUE)
  data_median <- median(data_frame[[variable_name]], na.rm = TRUE)
  
  # Geometric Mean
  if (any(data_frame[[variable_name]] <= 0)) {
    geometric_mean <- NA
    gm_warning <- "Geometric Mean cannot be calculated due to zero or negative values."
  } else {
    # Calculate Geometric Mean: exp(mean(log(x)))
    geometric_mean <- exp(mean(log(data_frame[[variable_name]])))
    gm_warning <- "Geometric Mean calculated successfully."
  }
  
  # Measures of Variability
  data_sd <- sd(data_frame[[variable_name]], na.rm = TRUE)
  data_variance <- var(data_frame[[variable_name]], na.rm = TRUE)
  data_range <- range(data_frame[[variable_name]], na.rm = TRUE)
  data_iqr <- IQR(data_frame[[variable_name]], na.rm = TRUE)
  
  # Summary and Sample Size
  data_summary <- summary(data_frame[[variable_name]])
  sample_size <- length(data_frame[[variable_name]])
  
  # 4. print
  cat("--- Descriptive Statistics Summary --- \n\n")
  cat(paste("Variable Name:", variable_name, "\n"))
  cat(paste("Sample Size (n):", sample_size, "\n\n"))
  
  cat("Measures of Central Tendency: \n")
  cat(paste("  Mean:", round(data_mean, 2), "\n"))
  cat(paste("  Median:", round(data_median, 2), "\n\n"))
  cat(paste("  Geometric Mean:", ifelse(is.na(geometric_mean), "N/A", round(geometric_mean, 2)), "\n"))
  cat(paste("  *Note:", gm_warning, "\n\n"))
  
  cat("Measures of Variability: \n")
  cat(paste("  Standard Deviation (SD):", round(data_sd, 2), "\n"))
  cat(paste("  Variance:", round(data_variance, 2), "\n"))
  cat(paste("  Range (Min - Max):", data_range[1], "to", data_range[2], "\n"))
  cat(paste("  Interquartile Range (IQR):", round(data_iqr, 2), "\n\n"))
  
  cat("Five Number Summary (from summary() function): \n")
  print(data_summary)
  
}

2. Inputting the raw values to pull then run in statistics : Once you have the create_and_describe_data function saved this is all you need to use - Will Put in Function Sheet

Use the c() function to input your raw data, and then call your custom function. - Example

# raw values go in here (e.g., replace example with your variable)
example_data <- c(85, 92, 105, 88, 95, 110, 80, 98, 102, 90, 115, 87)

# this will run the function create_and_describe_data
example_df <- create_and_describe_data(
  raw_data_vector = example_data, 
  variable_name = "example_unit" # Give your variable a good name
)

## --- Descriptive Statistics Summary --- 
## 
## Variable Name: example_unit 
## Sample Size (n): 12 
## 
## Measures of Central Tendency: 
##   Mean: 95.58 
##   Median: 93.5 
## 
##   Geometric Mean: 95.05 
##   *Note: Geometric Mean calculated successfully. 
## 
## Measures of Variability: 
##   Standard Deviation (SD): 10.66 
##   Variance: 113.72 
##   Range (Min - Max): 80 to 115 
##   Interquartile Range (IQR): 15 
## 
## Five Number Summary (from summary() function): 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   80.00   87.75   93.50   95.58  102.75  115.00

Key Functions Used (Recap):

Function | Purpose in Custom Function | Biostatistics Context |
- c() | Concatenates (combines) your raw input numbers into a vector. | Used to define the raw data set (the sample). |
- data.frame() | Converts the vector into a formal data frame object. | Essential for consistency, especially when passing data to functions like ggplot2 or t.test(). |
- mean(), median(), sd(), var() | Calculate the sample statistics. | The foundation of descriptive biostatistics. |
- range(), IQR() | Calculate simple and robust measures of data spread. | Used for assessing variability and checking for outliers. |
- summary() | Provides the Five Number Summary (Min, Q1, Median, Mean, Q3, Max). | Quick check for central tendency, spread, and symmetry/skewness. |
- length() | Gives the number of elements in the vector, which is the sample size (\(n\)). | Necessary for all inferential statistics (e.g., standard error calculations). |

Probability Distributions

Prefix explanation for all distribution functions - p and q are the most important

Function Type - - - Description - - - Example (for Normal Distribution)
p…(),Cumulative distribution function (probability P(X ≤ x)).
- “pnorm(q=1.96, mean=0, sd=1)”
- the output for any prob function starting with a p will be the area under the curve.
- If you graph the function - the output is the area under the curve while you are giving the x coordinate, the mean that is at the center of the distribution, and the standard deviation of the distribution
- lower.tail = false will give you the right shaded region
q…(),“Quantile function (inverse of p…(), finds the value x for a given probability).”,
- “qnorm(p=0.975, mean=0, sd=1)”
- the output for any prob function starting with a q will be the X value corresponding to the area under the curve (probability/proportion) you input into the first value of the function.
- If you graph the function - the output is the x coordinate given the cumulative probability that you have provided the function, the mean that is at the center of the distribution, and the standard deviation of the distribution
- lower.tail = FALSE will give you the x - coordinate given the area under the curve is to the right of the x - coordinate
d…(),Density function (height of the probability distribution).,“dnorm(x=1.96, mean=0, sd=1)”
- Most likely will only need to use for Discrete probability Functions - will give you the probability of the exact number of successes happening
r…(),Random number generation (often used for simulation/bootstrapping).,“rnorm(n=100, mean=10, sd=2)” - most likely won’t need

All distribution functions we will need for this midterm in r

Essential R Functions for Common Probability Distributions
Function	Purpose	Inputs	Output	Qualitative.Output	Example
Binomial
dbinom()	Exact probability of \(x\) successes.	x = number of successes (output will be exact prob.) number of trials (\(n\)) probability of success (\(p\))	A single probability (0 to 1).	The height of the probability mass function at \(x\).	dbinom(x=2, size=10, prob=0.2)
pbinom()	Cumulative probability \(P(X \le x)\).	q = number of successes (output will be the cumulative prob.) number of trials (\(n\)) probability of success (\(p\))	A single cumulative probability (0 to 1).	The total area/mass under the distribution up to and including \(x\).	pbinom(q=2, size=10, prob=0.2)
qbinom()	Quantile: Finds the number of successes (\(q\)) for a given \(p\).	p = cumulative probability number of trials (\(n\)) probability of success (\(p\))	An integer number of successes (the quantile).	The minimum number of successes needed to achieve the cumulative probability \(p\).	qbinom(p=0.95, size=20, prob=0.1)
Poisson
dpois()	Exact probability of \(x\) events.	x = number of events lambda (\(\lambda\)) = average rate of occurrence	A single probability (0 to 1).	The height of the probability mass function at \(x\).	dpois(x=3, lambda=4.5)
ppois()	Cumulative probability \(P(X \le x)\).	q = number of events (quantile) lambda (\(\lambda\)) = average rate of occurrence	A single cumulative probability (0 to 1).	The total area/mass under the distribution up to and including \(x\).	ppois(q=3, lambda=4.5)
qpois()	Quantile: Finds the number of events (\(q\)) for a given \(p\).	p = cumulative probability lambda (\(\lambda\)) = average rate of occurrence	An integer number of events (the quantile).	The minimum number of events needed to achieve the cumulative probability \(p\).	qpois(p=0.90, lambda=3.5)
Normal
pnorm()	Cumulative probability \(P(X \le q)\).	q = specified value mean (\(\mu\)) sd (\(\sigma\))	A single cumulative probability (0 to 1).	The \(p\)-value corresponding to a one-sided test statistic \(q\).	pnorm(q=1.96, mean=0, sd=1)
qnorm()	Quantile/Critical value (\(z\)) for a given \(p\).	p = cumulative probability mean (\(\mu\)) sd (\(\sigma\))	A single \(z\)-score (critical value).	The critical \(Z\)-score needed to define a confidence interval or rejection region.	qnorm(p=0.975, mean=0, sd=1)
t-Distribution
pt()	Cumulative probability \(P(T \le t)\).	q = specified t-value df (\(n-1\)) = degrees of freedom	A single cumulative probability (0 to 1).	The \(p\)-value corresponding to a one-sided \(t\)-test statistic \(q\).	pt(q=2.04, df=29)
qt()	Quantile/Critical value (\(t\)) for a given \(p\).	p = cumulative probability df (\(n-1\)) = degrees of freedom	A single \(t\)-value (critical value).	The critical \(T\)-score needed to define a confidence interval or rejection region.	qt(p=0.975, df=29)
Chi-Squared (\(\chi^2\))
pchisq()	Cumulative probability \(P(\chi^2 \le x)\).	q = specified \(\chi^2\) value df = degrees of freedom	A single cumulative probability (0 to 1).	The \(p\)-value for the goodness-of-fit or independence test statistic \(q\).	pchisq(q=3.84, df=1)
qchisq()	Quantile/Critical value (\(\chi^2\)) for a given \(p\).	p = cumulative probability df = degrees of freedom	A single \(\chi^2\) value (critical value).	The critical \(\chi^2\) threshold for a specified significance level \(\alpha\).	qchisq(p=0.95, df=1)

Biostatistics Midterm 1 - Markdown information through chapter 7

Alex Tan