### loading and attaching installed packages "tidyverse"
library("tidyverse")
# Load library for skewness and kurtosis
library("moments")
library("patchwork")
# Load library for Confidence band in Q-Q plot
library("ggpubr")
library("nortest")Scope: Population includes all elements; a sample includes only a subset.
Cost & Time: Studying a population is often expensive and slow; samples are faster and more feasible.
Accuracy: A well-designed sample can provide accurate estimates of population parameters.
Measures: Population uses parameters (e.g., \(\mu\), \(\sigma\)); sample uses statistics (e.g., \(\bar{x}\), \(s\)).
In Business Intelligence (BI) projects, especially survey-based data collection, defining the population and designing an appropriate sampling strategy are critical before building the questionnaire.
Ensures representativeness of responses
Reduces bias in results
Optimizes sample size vs. cost
Improves decision-making reliability
A company wants to measure customer satisfaction in Monterrey:
Population: All customers who purchased in the last year
Sample: 400 customers selected using stratified sampling
Alert: If the sample is poorly designed (e.g., only surveying frequent buyers), results will be biased and misleading.
Sampling is a fundamental concept in Business Intelligence (BI), allowing us to make inferences about a population using a subset of data.
Defining the wrong population
Using convenience samples when generalization is required
Ignoring sample size calculations
Poor questionnaire design (leading or biased questions)
Not accounting for non-response bias
Bias: excludes dinner customers
Result: misleading satisfaction insights
Balanced representation
Reliable insights for decision-making
In probabilistic sampling, every element in the population has a known, non-zero probability of being selected.
Two fundamental pillars make probabilistic sampling reliable for Business Intelligence:
Randomness:Every element has the same probability of being selected. This ensures unbiased selection and allows valid statistical inference.
Sample Size Calculation (Representativeness): Proper calculation of the sample size ensures that the sample accurately represents the population, controlling error and confidence levels.
Without these two elements, results cannot be generalized to the population with confidence.
Types:
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster Sampling
In non-probabilistic sampling, selection is based on subjective judgment rather than randomization.
Types:
Convenience Sampling
Judgmental Sampling
Quota Sampling
Snowball Sampling
Each element has equal probability of selection
Independence between selections
Sampling frame is complete
No bias in selection process
Both randomness** and sample size are required to achieve representativeness of the sample.**
| Objective | Critical Variable | Type |
|---|---|---|
| Measure customer satisfaction | % satisfied | Qualitative |
| Estimate average spending | Income | Quantitative |
| Predict churn | Churn (Yes/No) | Qualitative |
| Analyze usage frequency | Number of visits | Quantitative |
Before applying any formula, always follow these steps:
Define the business objective
Identify the key variable
Determine the variable type (quantitative or qualitative)
Identify population type (finite or infinite)
Apply the correct formula
BI Example Context:
A retail company wants to estimate the average monthly spending of customers.
When to Use:
Formula
\[ n = \frac{Z^2 \cdot \sigma^2}{E^2} \]
Example:
A company wants to estimate average spending (in dollars):
\[ n = \frac{(1.96)^2 \cdot (10)^2}{2^2} = 96.04 \approx 97 \]
Interpretation: You need at least 97 customers to estimate the average spending with the desired precision.
When to Use:
Formula
\[ n = \frac{N \cdot Z^2 \cdot \sigma^2}{E^2 (N - 1) + Z^2 \cdot \sigma^2} \]
Example:
\[ n = \frac{(500)^2 \cdot (1.96)^2 \cdot (10)^2}{2^2 (500 - 1) + 1.96^2 \cdot 10^2} = 80.70 \approx 81 \]
Interpretation: You need a smaller sample than the infinite case because the population is limited. You need at least 81 customers to estimate the average spending with the desired precision.
BI Example Context:
A company wants to estimate the percentage of satisfied customers.
When to Use:
Large or unknown population
Example: all potential users of a service
Formula
\[ n = \frac{Z^2 \cdot p \cdot q}{E^2} \]
Example:
A company wants to estimate the percentage of satisfied customers:
\[ n = \frac{(1.96)^2 \cdot (0.5)^2 \cdot (0.5)^2}{0.05^2} = 384.16 \approx 385 \]
Interpretation: You need 385 respondents to estimate satisfaction with 5% error.
When to Use:
Formula
\[ n = \frac{N \cdot Z^2 \cdot p \cdot q}{E^2 (N - 1) + Z^2 \cdot p \cdot q} \]
Example:
A company has 1000 customers:
\[ n = \frac{(1000)^2 \cdot (1.96)^2 \cdot (0.5)^2 \cdot (0.5)^2}{0.05^2 (1000 - 1) + 1.96^2 \cdot (0.5)^2 \cdot (0.5)^2} = 277.74 \approx 278 \]
Interpretation: The required sample is smaller than 385 because the population is finite. You need 278 respondents to estimate satisfaction with 5% error.
Exercise 1)
A streaming platform wants to estimate the average daily usage time (in minutes) of its users. Assuming a confidence level of 95%, a standard deviation of 30 minutes, and a margin of error of 5 minutes, what is the required sample size?
In R:
# Quantitative Variable and Infinite Population
Z <- 1.96
sigma <- 30
E <- 5
sample_size_quant_inf <- function(Z, sigma, E){
n <- (Z^2 * sigma^2) / (E^2)
return(ceiling(n))
}
sample_size_quant_inf(Z, sigma, E)## [1] 139
Interpretation: At least 139 users are needed.
Exercise 2)
An e-commerce company wants to estimate the average purchase value (in dollars). Using a confidence level of 99%, a standard deviation of 50, and a margin of error of 10, how many customers should be included in the sample?
In R:
# Quantitative Variable and Infinite Population
Z <- 2.58
sigma <- 50
E <- 10
sample_size_quant_inf <- function(Z, sigma, E){
n <- (Z^2 * sigma^2) / (E^2)
return(ceiling(n))
}
sample_size_quant_inf(Z, sigma, E)## [1] 167
Interpretation: At least 167 customers should be included in the sample to estimate the average purchase value with the specified precision and confidence level.
Exercise 3)
A fitness app wants to estimate the average number of steps per day. Given a confidence level of 90%, a standard deviation of 2000 steps, and a margin of error of 500 steps, determine the minimum sample size required.
In R:
# Quantitative Variable and Infinite Population
Z <- 1.64
sigma <- 2000
E <- 500
sample_size_quant_inf <- function(Z, sigma, E){
n <- (Z^2 * sigma^2) / (E^2)
return(ceiling(n))
}
sample_size_quant_inf(Z, sigma, E)## [1] 44
Interpretation: At least 44 users are needed to estimate the average number of steps per day with the specified precision.
Exercise 4)
A telecom company wants to estimate the average monthly data usage (in GB). With a 95% confidence level, a standard deviation of 8 GB, and a margin of error of 2 GB, what sample size is needed?
In R:
# Quantitative Variable and Infinite Population
Z <- 1.96
sigma <- 8
E <- 2
sample_size_quant_inf <- function(Z, sigma, E){
n <- (Z^2 * sigma^2) / (E^2)
return(ceiling(n))
}
sample_size_quant_inf(Z, sigma, E)## [1] 62
Interpretation: At least 62 customers are needed to estimate the average monthly data usage with the desired confidence and margin of error.
Exercise 5)
A bank wants to estimate the average transaction amount (in dollars). Using a confidence level of 98%, a standard deviation of 120, and a margin of error of 25, calculate the required sample size.
In R:
# Quantitative Variable and Infinite Population
Z <- 2.33
sigma <- 120
E <- 25
sample_size_quant_inf <- function(Z, sigma, E){
n <- (Z^2 * sigma^2) / (E^2)
return(ceiling(n))
}
sample_size_quant_inf(Z, sigma, E)## [1] 126
Interpretation: At least 126 customers are needed to estimate the average transaction amount with the desired precision.
Exercise 1)
A retailer has a database of 800 customers and wants to estimate the average monthly spending. With a confidence level of 95%, a standard deviation of 40, and a margin of error of 8, what is the required sample size?
In R:
# Quantitative Variable and Finite Population
N <- 800
Z <- 1.96
sigma <- 40
E <- 8
sample_size_quant_fin <- function(N, Z, sigma, E){
n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
return(ceiling(n))
}
sample_size_quant_fin(N, Z, sigma, E)## [1] 86
Interpretation: At least 86 customers are required to estimate the average monthly spending for this finite population.
Exercise 2)
A SaaS company has 1200 users and wants to estimate the average session duration (in minutes). Using a confidence level of 99%, a standard deviation of 25, and a margin of error of 5 minutes, how large should the sample be?
In R:
# Quantitative Variable and Finite Population
N <- 1200
Z <- 2.58
sigma <- 25
E <- 5
sample_size_quant_fin <- function(N, Z, sigma, E){
n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
return(ceiling(n))
}
sample_size_quant_fin(N, Z, sigma, E)## [1] 147
Interpretation: At least 147 users should be sampled to estimate the average session duration with the specified conditions. At least 167 customers should be included in the sample to estimate the average purchase value with the specified precision and confidence level.
Exercise 3)
A gym has 500 members and wants to estimate the average number of weekly visits. Given a confidence level of 90%, a standard deviation of 3 visits, and a margin of error of 1 visit, determine the sample size.
In R:
# Quantitative Variable and Finite Population
N <- 500
Z <- 1.64
sigma <- 3
E <- 1
sample_size_quant_fin <- function(N, Z, sigma, E){
n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
return(ceiling(n))
}
sample_size_quant_fin(N, Z, sigma, E)## [1] 24
Interpretation: At least 24 members are needed to estimate the average number of weekly visits.
Exercise 4)
A delivery company has 300 drivers and wants to estimate the average delivery time (in minutes). With a confidence level of 95%, a standard deviation of 12 minutes, and a margin of error of 3 minutes, calculate the required sample size.
In R:
# Quantitative Variable and Finite Population
N <- 300
Z <- 1.96
sigma <- 12
E <- 3
sample_size_quant_fin <- function(N, Z, sigma, E){
n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
return(ceiling(n))
}
sample_size_quant_fin(N, Z, sigma, E)## [1] 52
Interpretation: At least 52 drivers are required to estimate the average delivery time with the desired precision.
Exercise 5)
A university has 1500 students and wants to estimate the average study hours per week. Using a confidence level of 98%, a standard deviation of 10 hours, and a margin of error of 2 hours, what is the minimum sample size needed?
In R:
# Quantitative Variable and Finite Population
N <- 1500
Z <- 2.33
sigma <- 10
E <- 2
sample_size_quant_fin <- function(N, Z, sigma, E){
n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
return(ceiling(n))
}
sample_size_quant_fin(N, Z, sigma, E)## [1] 125
Interpretation: At least 125 students are needed to estimate the average study hours per week.
Exercise 1)
A company wants to estimate the percentage of satisfied customers. Assuming a confidence level of 95%, an estimated proportion of 50%, and a margin of error of 5%, what sample size is required?
In R:
# Qualitative Variable and Infinite Population
Z <- 1.96
p <- 0.5
E <- 0.05
sample_size_qual_inf <- function(Z, p, E){
q <- 1 - p
n <- (Z^2 * p * q) / (E^2)
return(ceiling(n))
}
sample_size_qual_inf(Z, p, E)## [1] 385
Interpretation: At least 385 respondents are required to estimate the percentage of satisfied customers with the stated confidence level and margin of error.
Exercise 2)
A mobile app company wants to estimate the percentage of users who upgrade to premium. With a confidence level of 99%, an estimated proportion of 40%, and a margin of error of 4%, how many users should be surveyed?
In R:
# Qualitative Variable and Infinite Population
Z <- 2.58
p <- 0.4
E <- 0.04
sample_size_qual_inf <- function(Z, p, E){
q <- 1 - p
n <- (Z^2 * p * q) / (E^2)
return(ceiling(n))
}
sample_size_qual_inf(Z, p, E)## [1] 999
Interpretation: At least 999 respondents are required to estimate the percentage of users who upgrade to premium with the stated confidence level and margin of error. At least 167 customers should be included in the sample to estimate the average purchase value with the specified precision and confidence level.
Exercise 3)
A bank wants to estimate the percentage of customers using mobile banking services. Given a confidence level of 90%, an estimated proportion of 60%, and a margin of error of 3%, determine the required sample size.
In R:
# Qualitative Variable and Infinite Population
Z <- 1.64
p <- 0.6
E <- 0.03
sample_size_qual_inf <- function(Z, p, E){
q <- 1 - p
n <- (Z^2 * p * q) / (E^2)
return(ceiling(n))
}
sample_size_qual_inf(Z, p, E)## [1] 718
Interpretation: At least 718 respondents are required to estimate the percentage of customers using mobile banking services with the stated confidence level and margin of error.
Exercise 4)
A retailer wants to estimate the percentage of customers who prefer online shopping. Using a 95% confidence level, an estimated proportion of 70%, and a margin of error of 5%, what is the necessary sample size?
In R:
# Qualitative Variable and Infinite Population
Z <- 1.96
p <- 0.7
E <- 0.05
sample_size_qual_inf <- function(Z, p, E){
q <- 1 - p
n <- (Z^2 * p * q) / (E^2)
return(ceiling(n))
}
sample_size_qual_inf(Z, p, E)## [1] 323
Interpretation: At least 323 respondents are required to estimate the percentage of customers who prefer online shopping with the stated confidence level and margin of error.
Exercise 5)
A streaming platform wants to estimate the percentage of users who churn. With a confidence level of 98%, an estimated proportion of 30%, and a margin of error of 6%, calculate the sample size.
In R:
# Qualitative Variable and Infinite Population
Z <- 2.33
p <- 0.3
E <- 0.06
sample_size_qual_inf <- function(Z, p, E){
q <- 1 - p
n <- (Z^2 * p * q) / (E^2)
return(ceiling(n))
}
sample_size_qual_inf(Z, p, E)## [1] 317
Interpretation: At least 317 respondents are required to estimate the percentage of users who churn with the stated confidence level and margin of error.
Exercise 1)
A company has 1000 customers and wants to estimate the percentage of satisfied customers. Assuming a confidence level of 95%, an estimated proportion of 50%, and a margin of error of 5%, what is the required sample size?
In R:
# Qualitative Variable and Finite Population
N <- 1000
Z <- 1.96
p <- 0.5
E <- 0.05
sample_size_qual_fin <- function(N, Z, p, E){
q <- 1 - p
n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
return(ceiling(n))
}
sample_size_qual_fin(N, Z, p, E)## [1] 278
Interpretation: At least 278 respondents are required to estimate the percentage of satisfied customers for this finite population.
Exercise 2)
A bank has 2000 clients and wants to estimate the percentage of customers using online banking. Using a confidence level of 99%, an estimated proportion of 40%, and a margin of error of 4%, how many clients should be included in the sample?
In R:
# Qualitative Variable and Finite Population
N <- 2000
Z <- 2.58
p <- 0.4
E <- 0.04
sample_size_qual_fin <- function(N, Z, p, E){
q <- 1 - p
n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
return(ceiling(n))
}
sample_size_qual_fin(N, Z, p, E)## [1] 667
Interpretation: At least 667 respondents are required to estimate the percentage of customers using online banking for this finite population. At least 167 customers should be included in the sample to estimate the average purchase value with the specified precision and confidence level.
Exercise 3)
A university has 800 students and wants to estimate the percentage of students attending classes regularly. Given a confidence level of 90%, an estimated proportion of 60%, and a margin of error of 3%, determine the sample size.
In R:
# Qualitative Variable and Finite Population
N <- 800
Z <- 1.64
p <- 0.6
E <- 0.03
sample_size_qual_fin <- function(N, Z, p, E){
q <- 1 - p
n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
return(ceiling(n))
}
sample_size_qual_fin(N, Z, p, E)## [1] 379
Interpretation: At least 379 respondents are required to estimate the percentage of students attending classes regularly for this finite population.
Exercise 4)
A retail chain has 600 customers and wants to estimate the percentage using loyalty programs. With a confidence level of 95%, an estimated proportion of 70%, and a margin of error of 5%, what is the required sample size?
In R:
# Qualitative Variable and Finite Population
N <- 600
Z <- 1.96
p <- 0.7
E <- 0.05
sample_size_qual_fin <- function(N, Z, p, E){
q <- 1 - p
n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
return(ceiling(n))
}
sample_size_qual_fin(N, Z, p, E)## [1] 211
Interpretation: At least 211 respondents are required to estimate the percentage of customers using loyalty programs for this finite population.
Exercise 5)
A telecom company has 1500 users and wants to estimate the percentage of customers who will churn. Using a confidence level of 98%, an estimated proportion of 30%, and a margin of error of 6%, calculate the required sample size.
In R:
# Qualitative Variable and Finite Population
N <- 1500
Z <- 2.33
p <- 0.3
E <- 0.06
sample_size_qual_fin <- function(N, Z, p, E){
q <- 1 - p
n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
return(ceiling(n))
}
sample_size_qual_fin(N, Z, p, E)## [1] 262
Interpretation: At least 262 respondents are required to estimate the percentage of customers who will churn for this finite population.
Silva, P. L. N., et al. Amostragem com R. https://amostragemcomr.github.io/livro/index.html
Wu & Thompson. Sampling Theory and Practice. https://link.springer.com/book/10.1007/978-3-030-44246-0?utm_source=chatgpt.com
Bencardino, C. M. Estadística y muestreo. https://www.ecoeediciones.com/product/estadistica-y-muestreo-14a-edicion-ebook/?utm_source=chatgpt.com