###  loading and attaching installed packages "tidyverse"
library("tidyverse")

# Load library for skewness and kurtosis
library("moments")

library("patchwork")

# Load library for Confidence band in Q-Q plot
library("ggpubr")

library("nortest")

1 Introduction

2 Population vs Sample

2.0.1 Key Differences:

  • Scope: Population includes all elements; a sample includes only a subset.

  • Cost & Time: Studying a population is often expensive and slow; samples are faster and more feasible.

  • Accuracy: A well-designed sample can provide accurate estimates of population parameters.

  • Measures: Population uses parameters (e.g., \(\mu\), \(\sigma\)); sample uses statistics (e.g., \(\bar{x}\), \(s\)).

2.0.2 Why This Matters for Questionnaire Design

In Business Intelligence (BI) projects, especially survey-based data collection, defining the population and designing an appropriate sampling strategy are critical before building the questionnaire.

  • Ensures representativeness of responses

  • Reduces bias in results

  • Optimizes sample size vs. cost

  • Improves decision-making reliability

2.0.3 Example

A company wants to measure customer satisfaction in Monterrey:

Population: All customers who purchased in the last year

Sample: 400 customers selected using stratified sampling

Alert: If the sample is poorly designed (e.g., only surveying frequent buyers), results will be biased and misleading.

Sampling is a fundamental concept in Business Intelligence (BI), allowing us to make inferences about a population using a subset of data.

2.1 Common Mistakes in Survey Sampling

  • Defining the wrong population

  • Using convenience samples when generalization is required

  • Ignoring sample size calculations

  • Poor questionnaire design (leading or biased questions)

  • Not accounting for non-response bias

2.2 Mini Case Study (Good vs Bad Design)

  1. Bad Design: A restaurant surveys only customers present during lunch hours.
  • Bias: excludes dinner customers

  • Result: misleading satisfaction insights

  1. Good Design: Random sample of customers across different days and times.
  • Balanced representation

  • Reliable insights for decision-making

3 Types of Sampling

3.1 Probabilistic Sampling

In probabilistic sampling, every element in the population has a known, non-zero probability of being selected.

  • Why It Is Essential:

Two fundamental pillars make probabilistic sampling reliable for Business Intelligence:

  1. Randomness:Every element has the same probability of being selected. This ensures unbiased selection and allows valid statistical inference.

  2. Sample Size Calculation (Representativeness): Proper calculation of the sample size ensures that the sample accurately represents the population, controlling error and confidence levels.

Without these two elements, results cannot be generalized to the population with confidence.

Types:

  • Simple Random Sampling

  • Systematic Sampling

  • Stratified Sampling

  • Cluster Sampling

3.2 Non-Probabilistic Sampling

In non-probabilistic sampling, selection is based on subjective judgment rather than randomization.

Types:

  • Convenience Sampling

  • Judgmental Sampling

  • Quota Sampling

  • Snowball Sampling

3.3 Probabilistic Sampling: Simple Random Sampling

  1. Assumptions of Randomness
  • Each element has equal probability of selection

  • Independence between selections

  • Sampling frame is complete

  • No bias in selection process

  1. Representativeness
  • The sample size must be properly calculated

Both randomness** and sample size are required to achieve representativeness of the sample.**

4 Sample Size Calculation

4.0.1 Decision Guide (Which Formula Should I Use?)

Objective Critical Variable Type
Measure customer satisfaction % satisfied Qualitative
Estimate average spending Income Quantitative
Predict churn Churn (Yes/No) Qualitative
Analyze usage frequency Number of visits Quantitative

4.1 Sample Size Calculation

Before applying any formula, always follow these steps:

  1. Define the business objective

  2. Identify the key variable

  3. Determine the variable type (quantitative or qualitative)

  4. Identify population type (finite or infinite)

  5. Apply the correct formula

4.2 Quantitative Variable

BI Example Context:

A retail company wants to estimate the average monthly spending of customers.

  • Objective �? Estimate average spending
  • Variable �? Spending ($)
  • Type �? Quantitative
  • Decision �? Which formula to use?

4.2.1 Infinite Population

When to Use:

  • Population is very large or unknown
  • Example: all potential customers, online users

Formula

\[ n = \frac{Z^2 \cdot \sigma^2}{E^2} \]

Example:

A company wants to estimate average spending (in dollars):

  • \(Z = 1.96\) (95% confidence)
  • \(\sigma = 10\) (estimated variability in spending in dollars)
  • \(E = 2\) (acceptable error in dollars)

\[ n = \frac{(1.96)^2 \cdot (10)^2}{2^2} = 96.04 \approx 97 \]

Interpretation: You need at least 97 customers to estimate the average spending with the desired precision.

4.2.1.1 R Functions for Sample Size

# Quantitative Variable and Infinite Population
Z <- 1.96
sigma <- 10
E <- 2

sample_size_quant_inf <- function(Z, sigma, E){
  n <- (Z^2 * sigma^2) / (E^2)
  return(ceiling(n))
}

sample_size_quant_inf(Z, sigma, E)
## [1] 97

4.2.2 Finite Population

When to Use:

  • You know the total number of customers
  • Example: customer database

Formula

\[ n = \frac{N \cdot Z^2 \cdot \sigma^2}{E^2 (N - 1) + Z^2 \cdot \sigma^2} \]

Example:

  • A company has a database of 500 customers:

\[ n = \frac{(500)^2 \cdot (1.96)^2 \cdot (10)^2}{2^2 (500 - 1) + 1.96^2 \cdot 10^2} = 80.70 \approx 81 \]

Interpretation: You need a smaller sample than the infinite case because the population is limited. You need at least 81 customers to estimate the average spending with the desired precision.

4.2.2.1 R Functions for Sample Size

# Quantitative Variable and Finite Population
N <- 500
Z <- 1.96
sigma <- 10
E <- 2

sample_size_quant_fin <- function(N, Z, sigma, E){
  n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
  return(ceiling(n))
}

sample_size_quant_fin(N, Z, sigma, E)
## [1] 81

4.3 Qualitative Variable (Proportions)

BI Example Context:

A company wants to estimate the percentage of satisfied customers.

  • Objective �? Measure satisfaction
  • Variable �? % satisfied
  • Type �? Qualitative (proportion)

4.3.1 Infinite Population

When to Use:

  • Large or unknown population

  • Example: all potential users of a service

Formula

\[ n = \frac{Z^2 \cdot p \cdot q}{E^2} \]

Example:

A company wants to estimate the percentage of satisfied customers:

  • \(Z = 1.96\)
  • \(p = 0.5\) (maximum variability �? safest choice)
  • \(q = 1 - p = 0.5\)
  • \(E = 0.05\)

\[ n = \frac{(1.96)^2 \cdot (0.5)^2 \cdot (0.5)^2}{0.05^2} = 384.16 \approx 385 \]

Interpretation: You need 385 respondents to estimate satisfaction with 5% error.

4.3.1.1 R Functions for Sample Size

# Qualitative Variable and Infinite Population
Z <- 1.96
p <- 0.5
E <- 0.05

sample_size_qual_inf <- function(Z, p, E){
  q <- 1 - p
  n <- (Z^2 * p * q) / (E^2)
  return(ceiling(n))
}

sample_size_qual_inf(Z, p, E)
## [1] 385

4.3.2 Finite Population

When to Use:

  • Known number of customers

Formula

\[ n = \frac{N \cdot Z^2 \cdot p \cdot q}{E^2 (N - 1) + Z^2 \cdot p \cdot q} \]

Example:

A company has 1000 customers:

  • \(N = 1000\)
  • \(Z = 1.96\)
  • \(p = 0.5\) (maximum variability �? safest choice)
  • \(q = 1 - p = 0.5\)
  • \(E = 0.05\)

\[ n = \frac{(1000)^2 \cdot (1.96)^2 \cdot (0.5)^2 \cdot (0.5)^2}{0.05^2 (1000 - 1) + 1.96^2 \cdot (0.5)^2 \cdot (0.5)^2} = 277.74 \approx 278 \]

Interpretation: The required sample is smaller than 385 because the population is finite. You need 278 respondents to estimate satisfaction with 5% error.

4.3.2.1 R Functions for Sample Size

# Qualitative Variable and Finite Population
N <- 1000
Z <- 1.96
p <- 0.5
E <- 0.05

sample_size_qual_fin <- function(N, Z, p, E){
  q <- 1 - p
  n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
  return(ceiling(n))
}

sample_size_qual_fin(N, Z, p, E)
## [1] 278

4.4 Exercises

4.4.1 Quantitative Variable and Infinite Population

Exercise 1)

A streaming platform wants to estimate the average daily usage time (in minutes) of its users. Assuming a confidence level of 95%, a standard deviation of 30 minutes, and a margin of error of 5 minutes, what is the required sample size?

In R:

# Quantitative Variable and Infinite Population
Z <- 1.96
sigma <- 30
E <- 5

sample_size_quant_inf <- function(Z, sigma, E){
  n <- (Z^2 * sigma^2) / (E^2)
  return(ceiling(n))
}

sample_size_quant_inf(Z, sigma, E)
## [1] 139

Interpretation: At least 139 users are needed.

Exercise 2)

An e-commerce company wants to estimate the average purchase value (in dollars). Using a confidence level of 99%, a standard deviation of 50, and a margin of error of 10, how many customers should be included in the sample?

In R:

# Quantitative Variable and Infinite Population
Z <- 2.58
sigma <- 50
E <- 10

sample_size_quant_inf <- function(Z, sigma, E){
  n <- (Z^2 * sigma^2) / (E^2)
  return(ceiling(n))
}

sample_size_quant_inf(Z, sigma, E)
## [1] 167

Interpretation: At least 167 customers should be included in the sample to estimate the average purchase value with the specified precision and confidence level.

Exercise 3)

A fitness app wants to estimate the average number of steps per day. Given a confidence level of 90%, a standard deviation of 2000 steps, and a margin of error of 500 steps, determine the minimum sample size required.

In R:

# Quantitative Variable and Infinite Population
Z <- 1.64
sigma <- 2000
E <- 500

sample_size_quant_inf <- function(Z, sigma, E){
  n <- (Z^2 * sigma^2) / (E^2)
  return(ceiling(n))
}

sample_size_quant_inf(Z, sigma, E)
## [1] 44

Interpretation: At least 44 users are needed to estimate the average number of steps per day with the specified precision.

Exercise 4)

A telecom company wants to estimate the average monthly data usage (in GB). With a 95% confidence level, a standard deviation of 8 GB, and a margin of error of 2 GB, what sample size is needed?

In R:

# Quantitative Variable and Infinite Population
Z <- 1.96
sigma <- 8
E <- 2

sample_size_quant_inf <- function(Z, sigma, E){
  n <- (Z^2 * sigma^2) / (E^2)
  return(ceiling(n))
}

sample_size_quant_inf(Z, sigma, E)
## [1] 62

Interpretation: At least 62 customers are needed to estimate the average monthly data usage with the desired confidence and margin of error.

Exercise 5)

A bank wants to estimate the average transaction amount (in dollars). Using a confidence level of 98%, a standard deviation of 120, and a margin of error of 25, calculate the required sample size.

In R:

# Quantitative Variable and Infinite Population
Z <- 2.33
sigma <- 120
E <- 25

sample_size_quant_inf <- function(Z, sigma, E){
  n <- (Z^2 * sigma^2) / (E^2)
  return(ceiling(n))
}

sample_size_quant_inf(Z, sigma, E)
## [1] 126

Interpretation: At least 126 customers are needed to estimate the average transaction amount with the desired precision.

4.4.2 Quantitative Variable �? Finite Population

Exercise 1)

A retailer has a database of 800 customers and wants to estimate the average monthly spending. With a confidence level of 95%, a standard deviation of 40, and a margin of error of 8, what is the required sample size?

In R:

# Quantitative Variable and Finite Population
N <- 800
Z <- 1.96
sigma <- 40
E <- 8

sample_size_quant_fin <- function(N, Z, sigma, E){
  n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
  return(ceiling(n))
}

sample_size_quant_fin(N, Z, sigma, E)
## [1] 86

Interpretation: At least 86 customers are required to estimate the average monthly spending for this finite population.

Exercise 2)

A SaaS company has 1200 users and wants to estimate the average session duration (in minutes). Using a confidence level of 99%, a standard deviation of 25, and a margin of error of 5 minutes, how large should the sample be?

In R:

# Quantitative Variable and Finite Population
N <- 1200
Z <- 2.58
sigma <- 25
E <- 5

sample_size_quant_fin <- function(N, Z, sigma, E){
  n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
  return(ceiling(n))
}

sample_size_quant_fin(N, Z, sigma, E)
## [1] 147

Interpretation: At least 147 users should be sampled to estimate the average session duration with the specified conditions. At least 167 customers should be included in the sample to estimate the average purchase value with the specified precision and confidence level.

Exercise 3)

A gym has 500 members and wants to estimate the average number of weekly visits. Given a confidence level of 90%, a standard deviation of 3 visits, and a margin of error of 1 visit, determine the sample size.

In R:

# Quantitative Variable and Finite Population
N <- 500
Z <- 1.64
sigma <- 3
E <- 1

sample_size_quant_fin <- function(N, Z, sigma, E){
  n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
  return(ceiling(n))
}

sample_size_quant_fin(N, Z, sigma, E)
## [1] 24

Interpretation: At least 24 members are needed to estimate the average number of weekly visits.

Exercise 4)

A delivery company has 300 drivers and wants to estimate the average delivery time (in minutes). With a confidence level of 95%, a standard deviation of 12 minutes, and a margin of error of 3 minutes, calculate the required sample size.

In R:

# Quantitative Variable and Finite Population
N <- 300
Z <- 1.96
sigma <- 12
E <- 3

sample_size_quant_fin <- function(N, Z, sigma, E){
  n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
  return(ceiling(n))
}

sample_size_quant_fin(N, Z, sigma, E)
## [1] 52

Interpretation: At least 52 drivers are required to estimate the average delivery time with the desired precision.

Exercise 5)

A university has 1500 students and wants to estimate the average study hours per week. Using a confidence level of 98%, a standard deviation of 10 hours, and a margin of error of 2 hours, what is the minimum sample size needed?

In R:

# Quantitative Variable and Finite Population
N <- 1500
Z <- 2.33
sigma <- 10
E <- 2

sample_size_quant_fin <- function(N, Z, sigma, E){
  n <- (N * Z^2 * sigma^2) / (E^2 * (N - 1) + Z^2 * sigma^2)
  return(ceiling(n))
}

sample_size_quant_fin(N, Z, sigma, E)
## [1] 125

Interpretation: At least 125 students are needed to estimate the average study hours per week.

4.4.3 Qualitative Variable and Infinite Population

Exercise 1)

A company wants to estimate the percentage of satisfied customers. Assuming a confidence level of 95%, an estimated proportion of 50%, and a margin of error of 5%, what sample size is required?

In R:

# Qualitative Variable and Infinite Population
Z <- 1.96
p <- 0.5
E <- 0.05

sample_size_qual_inf <- function(Z, p, E){
  q <- 1 - p
  n <- (Z^2 * p * q) / (E^2)
  return(ceiling(n))
}

sample_size_qual_inf(Z, p, E)
## [1] 385

Interpretation: At least 385 respondents are required to estimate the percentage of satisfied customers with the stated confidence level and margin of error.

Exercise 2)

A mobile app company wants to estimate the percentage of users who upgrade to premium. With a confidence level of 99%, an estimated proportion of 40%, and a margin of error of 4%, how many users should be surveyed?

In R:

# Qualitative Variable and Infinite Population
Z <- 2.58
p <- 0.4
E <- 0.04

sample_size_qual_inf <- function(Z, p, E){
  q <- 1 - p
  n <- (Z^2 * p * q) / (E^2)
  return(ceiling(n))
}

sample_size_qual_inf(Z, p, E)
## [1] 999

Interpretation: At least 999 respondents are required to estimate the percentage of users who upgrade to premium with the stated confidence level and margin of error. At least 167 customers should be included in the sample to estimate the average purchase value with the specified precision and confidence level.

Exercise 3)

A bank wants to estimate the percentage of customers using mobile banking services. Given a confidence level of 90%, an estimated proportion of 60%, and a margin of error of 3%, determine the required sample size.

In R:

# Qualitative Variable and Infinite Population
Z <- 1.64
p <- 0.6
E <- 0.03

sample_size_qual_inf <- function(Z, p, E){
  q <- 1 - p
  n <- (Z^2 * p * q) / (E^2)
  return(ceiling(n))
}

sample_size_qual_inf(Z, p, E)
## [1] 718

Interpretation: At least 718 respondents are required to estimate the percentage of customers using mobile banking services with the stated confidence level and margin of error.

Exercise 4)

A retailer wants to estimate the percentage of customers who prefer online shopping. Using a 95% confidence level, an estimated proportion of 70%, and a margin of error of 5%, what is the necessary sample size?

In R:

# Qualitative Variable and Infinite Population
Z <- 1.96
p <- 0.7
E <- 0.05

sample_size_qual_inf <- function(Z, p, E){
  q <- 1 - p
  n <- (Z^2 * p * q) / (E^2)
  return(ceiling(n))
}

sample_size_qual_inf(Z, p, E)
## [1] 323

Interpretation: At least 323 respondents are required to estimate the percentage of customers who prefer online shopping with the stated confidence level and margin of error.

Exercise 5)

A streaming platform wants to estimate the percentage of users who churn. With a confidence level of 98%, an estimated proportion of 30%, and a margin of error of 6%, calculate the sample size.

In R:

# Qualitative Variable and Infinite Population
Z <- 2.33
p <- 0.3
E <- 0.06

sample_size_qual_inf <- function(Z, p, E){
  q <- 1 - p
  n <- (Z^2 * p * q) / (E^2)
  return(ceiling(n))
}

sample_size_qual_inf(Z, p, E)
## [1] 317

Interpretation: At least 317 respondents are required to estimate the percentage of users who churn with the stated confidence level and margin of error.

4.4.4 Qualitative Variable and Finite Population

Exercise 1)

A company has 1000 customers and wants to estimate the percentage of satisfied customers. Assuming a confidence level of 95%, an estimated proportion of 50%, and a margin of error of 5%, what is the required sample size?

In R:

# Qualitative Variable and Finite Population
N <- 1000
Z <- 1.96
p <- 0.5
E <- 0.05

sample_size_qual_fin <- function(N, Z, p, E){
  q <- 1 - p
  n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
  return(ceiling(n))
}

sample_size_qual_fin(N, Z, p, E)
## [1] 278

Interpretation: At least 278 respondents are required to estimate the percentage of satisfied customers for this finite population.

Exercise 2)

A bank has 2000 clients and wants to estimate the percentage of customers using online banking. Using a confidence level of 99%, an estimated proportion of 40%, and a margin of error of 4%, how many clients should be included in the sample?

In R:

# Qualitative Variable and Finite Population
N <- 2000
Z <- 2.58
p <- 0.4
E <- 0.04

sample_size_qual_fin <- function(N, Z, p, E){
  q <- 1 - p
  n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
  return(ceiling(n))
}

sample_size_qual_fin(N, Z, p, E)
## [1] 667

Interpretation: At least 667 respondents are required to estimate the percentage of customers using online banking for this finite population. At least 167 customers should be included in the sample to estimate the average purchase value with the specified precision and confidence level.

Exercise 3)

A university has 800 students and wants to estimate the percentage of students attending classes regularly. Given a confidence level of 90%, an estimated proportion of 60%, and a margin of error of 3%, determine the sample size.

In R:

# Qualitative Variable and Finite Population
N <- 800
Z <- 1.64
p <- 0.6
E <- 0.03

sample_size_qual_fin <- function(N, Z, p, E){
  q <- 1 - p
  n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
  return(ceiling(n))
}

sample_size_qual_fin(N, Z, p, E)
## [1] 379

Interpretation: At least 379 respondents are required to estimate the percentage of students attending classes regularly for this finite population.

Exercise 4)

A retail chain has 600 customers and wants to estimate the percentage using loyalty programs. With a confidence level of 95%, an estimated proportion of 70%, and a margin of error of 5%, what is the required sample size?

In R:

# Qualitative Variable and Finite Population
N <- 600
Z <- 1.96
p <- 0.7
E <- 0.05

sample_size_qual_fin <- function(N, Z, p, E){
  q <- 1 - p
  n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
  return(ceiling(n))
}

sample_size_qual_fin(N, Z, p, E)
## [1] 211

Interpretation: At least 211 respondents are required to estimate the percentage of customers using loyalty programs for this finite population.

Exercise 5)

A telecom company has 1500 users and wants to estimate the percentage of customers who will churn. Using a confidence level of 98%, an estimated proportion of 30%, and a margin of error of 6%, calculate the required sample size.

In R:

# Qualitative Variable and Finite Population
N <- 1500
Z <- 2.33
p <- 0.3
E <- 0.06

sample_size_qual_fin <- function(N, Z, p, E){
  q <- 1 - p
  n <- (N * Z^2 * p * q) / (E^2 * (N - 1) + Z^2 * p * q)
  return(ceiling(n))
}

sample_size_qual_fin(N, Z, p, E)
## [1] 262

Interpretation: At least 262 respondents are required to estimate the percentage of customers who will churn for this finite population.