Rafael Yogi Septiadi Putra

RAFAEL YOGI SEPTIADI P.

Data Science Major

Institut Teknologi Sains Bandung
Major Data Science
Student ID 52250019
Lecturer Bakti Siregar, M. Sc., CSD.
Subject Statistika Dasar

1 Introduction

Statistical inference is the process of using sample data to make generalizations or inferences about a population. While descriptive statistics summarize the characteristics of a sample, inferential statistics allow us to test hypotheses and estimate population parameters under uncertainty. (Casella and Berger 2024)


2 Theoretical Framework

2.1 Statistical Hypotheses

Hypothesis testing starts with two opposing statements:

  • Null Hypothesis(\(H_0\)): Assumes no effect or no difference (\(\mu = \mu_0\)).

  • Alternative Hypothesis(\(H_1\)): Assumes an effect or difference exists (\(\mu \neq \mu_0\), \(\mu > \mu_0\), or \(\mu < \mu_0\)).

The Z-test applies when the population standard deviation (\(\sigma\)) is known and sample size is large (\(n \geq 30\)), invoking the Central Limit Theorem (Casella and Berger 2024).

2.2 The Z-Test Logic

The Z-test is used when:

  1. The population standard deviation (\(\sigma\)) is known.

  2. The sample size is large (\(n \geq 30\)), allowing the Central Limit Theorem to assume a normal distribution.


3 Key Formulas

3.1 One-Sample Z-Test for Means

This formula is used to compare a sample mean (\(\bar{x}\)) to a known population mean (\(\mu_0\)).

\[Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\]

Where:

  • \(Z\): The calculated Z-test statistic.

  • \(\bar{x}\): Sample mean.

  • \(\mu_0\): Population mean (under the null hypothesis).

  • \(\sigma\): Population standard deviation.

  • \(n\): Sample size.

  • \(\sigma / \sqrt{n}\): Standard Error (SE) of the mean.

3.2 Two-Sample Z-Test for Means

Used to compare the means of two independent groups when both population variances are known.

\[Z = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\]

3.3 Confidence Interval for the Mean

While not a “test,” it is a core part of inference to estimate where the true population mean lies.

\[CI = \bar{x} \pm Z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right)\]

Where:

  • \(Z_{\alpha/2}\): The critical value (e.g., 1.96 for a 95% confidence level).

4 Decision Rules

After calculating the \(Z\) statistic, we compare it to a Critical Value or calculate a P-value (Ismay and Kim 2025; DScience Labs 2025) :
  1. P-value Approach:
  • If \(P \leq \alpha\): Reject \(H_0\) (Result is statistically significant).

  • If \(P > \alpha\): Fail to reject \(H_0\) (Insufficient evidence).

  1. Critical Value Approach (Z-score):
  • If \(|Z_{calc}| > |Z_{crit}|\): Reject \(H_0\).

  • Common \(Z_{crit}\) for \(\alpha = 0.05\) (Two-tailed): \(\pm 1.96\).


5 Case Study 1

5.1 One-Sample Z-Test (Statistical Hypotheses)

A digital learning platform claims that the average daily study time of its users is 120 minutes. Based on historical records, the population standard deviation is known to be 15 minutes.

A random sample of 64 users shows an average study time of 116 minutes. μ₀ = 120 σ = 15 n = 64 x¯ = 116

5.2 Tasks

  1. Formulate the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
  2. Identify the appropriate statistical test and justify your choice.
  3. Compute the test statistic and p-value using α=0.05.
  4. State the statistical decision.
  5. Interpret the result in a business analytics context.

Given Parameters:

  • Population Mean (\(\mu_0\)): 120

  • Population Standard Deviation (\(\sigma\)): 15

  • Sample Size (\(n\)): 64

  • Sample Mean (\(\bar{x}\)): 116

  • Significance Level (\(\alpha\)): 0.05

5.2.1 Formulate the Null Hypothesis (\(H_0\)) and Alternative Hypothesis (\(H_1\))

To determine if the actual study time differs from the claim, we define:

  • Null Hypothesis (\(H_0\)): \(\mu = 120\) (The true average daily study time is equal to 120 minutes.)

  • Alternative Hypothesis (\(H_1\)): \(\mu \neq 120\) (The true average daily study time is significantly different from 120 minutes.)

5.2.2 Identify the appropriate statistical test and justify your choice

The appropriate test is the One-Sample Z-Test.

Justification:

  1. Known Population Standard Deviation: The value of \(\sigma\) (15 minutes) is provided based on historical records.

  2. Sample Size: The sample size (\(n = 64\)) is greater than 30, which satisfies the Central Limit Theorem, ensuring the sampling distribution of the mean is approximately normal.

  3. Independent Observations: The users are assumed to be randomly and independently sampled.

5.2.3 Compute the test statistic and p-value using \(\alpha=0.05\)

We perform the calculation using the following formulas: Test Statistic (Z-score):

\[Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}\]

P-Value: Since this is a two-tailed test, \(P\text{-value} = 2 \times P(Z < -|Z_{calc}|)\).

Hypothesis Test Results for Mean IQ Scores
Parameter Value
Null Hypothesis (μ₀) 120
Sample Mean (x̄) 116
Standard Deviation (σ) 15
Sample Size (n) 64
Standard Error 1.875
Z-score -2.1333
P-value 0.0329

5.2.4 State the statistical decision

To make a decision, we compare the calculated \(P\text{-value}\) to the significance level \(\alpha\):
  • Calculated \(P\text{-value}\): 0.0328

  • Significance Level (\(\alpha\)): 0.05

Decision Rule: If \(P\text{-value} < \alpha\), then Reject \(H_0\).

Result: Since 0.0328 < 0.05, we Reject the Null Hypothesis (\(H_0\)).

5.2.5 Interpret the result in a business analytics context

There is sufficient statistical evidence at the 95% confidence level to conclude that the average daily study time for users is not 120 minutes.

Business Implications:

  1. Performance Gap: The sample mean (116 minutes) suggests that users are spending significantly less time on the platform than the marketing or operational claim suggests.

  2. Strategic Adjustment: Management should investigate why engagement is lower than historical benchmarks. This could indicate a need for better content, improved user interface, or more effective notification triggers.

  3. Marketing Accuracy: To maintain transparency and brand trust, the platform may need to adjust its external claims or focus on re-engagement campaigns to bring the average back up to the 120-minute target.


6 Case Study 2

6.1 One-Sample T-Test (σ Unknown, Small Sample)

A UX Research Team investigates whether the average task completion time of a new application differs from 10 minutes.

The following data are collected from 10 users:

Task Completion Times for 10 Participants
Participant Completion.Time..minutes.
1 9.2
2 10.5
3 9.8
4 10.1
5 9.6
6 10.3
7 9.9
8 9.7
9 10.0
10 9.5

6.2 Tasks

  1. Define H₀ and H₁ (two-tailed).
  2. Determine the appropriate hypothesis test.
  3. Calculate the t-statistic and p-value at α=0.05.
  4. Make a statistical decision.
  5. Explain how sample size affects inferential reliability.

Data Points (minutes): 9.2, 10.5, 9.8, 10.1, 9.6, 10.3, 9.9, 9.7, 10.0, 9.5

Given Parameters:
  • Hypothesized Mean (\(\mu_0\)): 10.0

  • Significance Level (\(\alpha\)): 0.05

  • Sample Size (\(n\)): 10

6.2.1 Define \(H_0\) and \(H_1\) (two-tailed)

To test if the completion time significantly deviates from the 10-minute benchmark:

  • Null Hypothesis (\(H_0\)): \(\mu = 10\) (The average task completion time is equal to 10 minutes.)

  • Alternative Hypothesis (\(H_1\)): \(\mu \neq 10\) (The average task completion time is not equal to 10 minutes.)

6.2.2 Determine the appropriate hypothesis test

The appropriate test is the One-Sample T-Test.

Justification:
  1. Unknown Population Standard Deviation (\(\sigma\)): We do not have the historical \(\sigma\) for the population; we must estimate it using the sample standard deviation (\(s\)).

  2. Small Sample Size: The sample size is small (\(n = 10 < 30\)), necessitating the use of the T-distribution rather than the Z-distribution.

  3. Normality Assumption: We assume the underlying population of task completion times follows a normal distribution.

6.2.3 Calculate the t-statistic and p-value at \(\alpha=0.05\)

The calculations are performed as follows in R:
T-Test Results for Task Completion Time
Parameter Value
Sample Mean (x̄) 9.86
Sample Std Dev (s) 0.3864
T-Statistic -1.1456
Degrees of Freedom 9
P-value 0.2815

6.2.4 Make a statistical decision

We compare the calculated \(P\text{-value}\) to the significance level \(\alpha\):

  • Calculated \(P\text{-value}\): 0.174

  • Significance Level (\(\alpha\)): 0.05

Result: Since TRUE, we Fail to Reject the Null Hypothesis (\(H_0\)).

There is no statistically significant evidence to suggest that the average task completion time differs from 10 minutes.

6.2.5 Explain how sample size affects inferential reliability

In inferential statistics, the sample size (\(n\)) plays a critical role in the reliability of the results:
  1. Standard Error: As \(n\) increases, the standard error (\(s / \sqrt{n}\)) decreases. A smaller standard error means the sample mean is a more precise estimate of the population mean.

  2. Statistical Power: Small samples (like \(n=10\)) have lower statistical power, meaning they are less likely to detect a true difference if one exists (Type II Error). This often leads to a “Fail to Reject” decision even if there is a slight practical difference.

  3. T-Distribution Shape: With a small sample size, the T-distribution has “fatter tails” than the normal distribution to account for the uncertainty in estimating \(\sigma\). As \(n\) grows, the T-distribution approaches the Z-distribution (Normal).

  4. Margin of Error: Larger samples yield narrower confidence intervals, providing more “reliability” and certainty when generalizing findings from the sample to the broader user base.


7 Case Study 3

7.1 Two-Sample T-Test (A/B Testing)

A product analytics team conducts an A/B test to compare the average session duration (minutes) between two versions of a landing page.

A/B Test Results: Session Duration by Version
Version Sample Size (n) Mean (x̄) Standard Deviation (s)
A 25 4.8 1.2
B 25 5.4 1.4

7.2 Tasks

  1. Formulate the null and alternative hypotheses.
  2. Identify the type of t-test required.
  3. Compute the test statistic and p-value.
  4. Draw a statistical conclusion at α=0.05.
  5. Interpret the result for product decision-making.
Parameters:
  • Significance Level (\(\alpha\)): 0.05

  • Assumption: Independent samples with unequal variances (Welch’s T-test approach).

7.3 Formulate the null and alternative hypotheses

To determine if the change in the landing page affects session duration:
  • Null Hypothesis (\(H_0\)): \(\mu_A = \mu_B\) (There is no difference in the average session duration between Version A and Version B.)

  • Alternative Hypothesis (\(H_1\)): \(\mu_A \neq \mu_B\) (There is a significant difference in the average session duration between Version A and Version B.)

7.3.1 Identify the type of t-test required

The appropriate test is the Independent Two-Sample T-Test.

Justification:
  1. Two Independent Groups: The users seeing Version A are different from the users seeing Version B.

  2. Continuous Data: Session duration is a ratio-scale continuous variable.

  3. Unknown Population Variance: We are using sample standard deviations (\(s_A\) and \(s_B\)) to estimate the population parameters.

  4. Welch’s Adjustment: Given that sample sizes are small and variances may differ, Welch’s T-test is the more robust choice compared to the Pooled-variance T-test.

7.3.2 Compute the test statistic and p-value

We will calculate the results using R logic:
Two-Sample T-Test Results (Welch’s T-Test)
Parameter Value
Version A Mean (x̄₁) 4.8
Version B Mean (x̄₂) 5.4
Sample Size (n) 25
Standard Deviation (s) 1.2 (A), 1.4 (B)
Standard Error 0.3688
T-Statistic -1.627
Degrees of Freedom 46.9
P-value 0.1104

7.3.3 Draw a statistical conclusion at \(\alpha=0.05\)

We compare the calculated \(P\text{-value}\) to our significance threshold:

  • Calculated \(P\text{-value}\): 0.108

  • Significance Level (\(\alpha\)): 0.05

Decision: Since r0.108 > 0.05, we Fail to Reject the Null Hypothesis (\(H_0\)).

The observed difference in means (4.8 vs 5.4) is not statistically significant at the 5% level.

7.3.4 Interpret the result for product decision-making

From a product management and analytics perspective, the result leads to the following conclusions:

1.Insufficient Evidence for Change: Although Version B showed a higher mean session duration (5.4 minutes) than Version A (4.8 minutes), we cannot confidently say this difference isn’t due to random chance.

2.Risk of Implementation: Implementing Version B based solely on this data carries a risk of “Type I Error” (thinking there’s an improvement when there isn’t).

3.Recommendation: - Increase Sample Size: The current sample (\(n=50\) total) might be too small to detect a meaningful effect (low power). It is recommended to run the A/B test longer to collect more data.

  • Review Practical Significance: A 0.6-minute difference might be business-critical. If so, a larger study is justified to verify if this trend holds.

  • Analyze Sub-segments: Check if Version B performed better for specific user segments (e.g., mobile vs. desktop) before discarding the design.


8 Case Study 4

8.1 Chi-Square Test of Independence

An e-commerce company examines whether device type is associated with payment method preference.

Device / Payment E-Wallet Credit Card Cash on Delivery
Mobile 120 80 50
Desktop 60 90 40

8.2 Tasks

  1. State the Null Hypothesis (H₀) and Alternative Hypothesis (H₁).
  2. Identify the appropriate statistical test.
  3. Compute the Chi-Square statistic (χ²).
  4. Determine the p-value at α=0.05.
  5. Interpret the results in terms of digital payment strategy.
Parameters:
  • Significance Level (\(\alpha\)): 0.05

8.2.1 State the Null Hypothesis (\(H_0\)) and Alternative Hypothesis (\(H_1\))

To test the relationship between these two categorical variables:

  • Null Hypothesis (\(H_0\)): Device type and payment method are independent. (There is no association between the device used and the preferred payment method.)

  • Alternative Hypothesis (\(H_1\)): Device type and payment method are dependent. (There is a significant association between the device used and the preferred payment method.)

8.2.2 Identify the appropriate statistical test

The appropriate test is the Chi-Square Test of Independence.

Justification:
  1. Categorical Variables: Both “Device Type” (Mobile, Desktop) and “Payment Method” (E-Wallet, Credit Card, COD) are nominal categorical variables.

  2. Contingency Table: The data is structured in a frequency table (cross-tabulation).

  3. Independence of Observations: Each customer response represents an independent event.

  4. Expected Frequencies: The sample size is large enough such that the expected frequency in each cell is \(\geq 5\).

8.2.3 Compute the Chi-Square statistic (\(\chi^2\))

We will represent the data as a matrix in R and perform the calculation.
Chi-Square Test Results
Statistic Value
Chi-Square Statistic (X²) 13.7736
Degrees of Freedom 2
P-value 0.001021
Significance Level (α) 0.05

8.2.4 Determine the p-value at \(\alpha=0.05\)

Based on the calculation above:

Chi-Square Test Results
Test Component Value
Chi-Square Statistic (X²) 13.7736
Degrees of Freedom 2
P-value 0.001021
Significance Level (α) 0.05
  • Calculated \(P\text{-value}\): 0.0016

  • Significance Level (\(\alpha\)): 0.05

Decision: Since r0.0016 < 0.05, we Reject the Null Hypothesis (\(H_0\)).

8.2.5 Interpret the results in terms of digital payment strategy

The results indicate a statistically significant association between the device used and the payment method preferred.

Strategic Insights:
  1. Mobile-Wallet Synergy: Mobile users show a much higher preference for E-Wallets (120 vs 60). This suggests that “One-tap” or integrated mobile payments are highly effective for the mobile customer segment.

  2. Desktop-Credit Card Preference: Desktop users are more likely to use Credit Cards. This may be due to the ease of typing long card numbers on a physical keyboard or a perception of higher security for large transactions on a computer.

  3. Strategy Recommendations:

  • Mobile UX: Ensure the E-Wallet checkout flow is the primary option (default) on the mobile app/site to reduce friction.

  • Desktop Marketing: Offer credit card-linked promotions (e.g., cashback or points) specifically to users browsing on desktop.

  • COD Stability: Cash on Delivery remains relatively consistent proportionally, suggesting it serves a specific trust-based segment across both platforms.


9 Case Study 5

9.1 Type I and Type II Errors (Conceptual)

A fintech startup is evaluating a new fraud detection algorithm. The effectiveness of the algorithm is tested using the following hypotheses:

  • \(H_0\): The new algorithm does not reduce fraud (No effect).

  • \(H_1\): The new algorithm reduces fraud (Effective).

9.2 Tasks

  1. Explain a Type I Error (α) in this context.
  2. Explain a Type II Error (β) in this context.
  3. Identify which error is more costly from a business perspective.
  4. Discuss how sample size affects Type II Error.
  5. Explain the relationship between α, β, and statistical power.

9.2.1 Explain a Type I Error (\(\alpha\)) in this context

A Type I Error occurs when we reject a true null hypothesis (a “false positive” from the researcher’s perspective).

  • In this context: The startup concludes that the new algorithm is effective at reducing fraud when, in reality, it is not.

  • Consequence: The company may spend significant resources (money, time, and engineering effort) to deploy an algorithm that provides no actual improvement in security.

9.2.2 Explain a Type II Error (\(\beta\)) in this context

A Type II Error occurs when we fail to reject a false null hypothesis (a “false negative”).

  • In this context: The startup concludes that the new algorithm does not reduce fraud, when in reality, it actually is effective.

  • Consequence: The company misses an opportunity to improve its security posture and continues to lose money to fraudulent transactions that could have been prevented.

9.2.3 Identify which error is more costly from a business perspective

In the context of fraud detection, a Type II Error is generally considered more costly.
  • Rationale: While a Type I Error leads to wasted implementation costs, a Type II Error leads to unmitigated financial loss from fraud. Fraud can result in direct monetary theft, legal liabilities, and a massive loss of customer trust.

  • Risk Appetite: Most fintech companies would prefer to accidentally test a “dud” algorithm (Type I) than to dismiss an algorithm that could have stopped a million-dollar security breach (Type II).

9.2.4 Discuss how sample size affects Type II Error

Sample size (\(n\)) has an inverse relationship with the probability of committing a Type II Error (\(\beta\)):
  1. Increased Precision: As the sample size increases, the standard error of the estimate decreases, making the test more sensitive to small differences.

  2. Lowering \(\beta\): A larger sample size provides more evidence. If the algorithm truly works, a larger sample makes it much more likely that the statistical test will detect that effect, thereby reducing the probability of a Type II Error.

  3. Reliability: Small samples often lack the “resolution” to see the impact of the algorithm, leading to a high \(\beta\) (failing to see the truth).

9.2.5 Explain the relationship between \(\alpha\), \(\beta\), and statistical power

These three concepts are mathematically intertwined:
  • Inverse Relationship (\(\alpha\) vs \(\beta\)): There is usually a trade-off. If you make it harder to commit a Type I error (by lowering \(\alpha\) from 0.05 to 0.01), you automatically make it easier to commit a Type II error (\(\beta\) increases), because you are requiring “stronger” evidence to reject \(H_0\).

  • Statistical Power (\(1 - \beta\)): Power is the probability of correctly rejecting a false null hypothesis (detecting an effect that actually exists).

  • The Triad:

    • To increase Power, you can either increase the Sample Size or accept a higher \(\alpha\) (risk of false positive).

    • In business analytics, we aim for high Power (usually 0.80 or higher) to ensure that if our product improvements (like the fraud algorithm) work, our data will actually show it.


10 Case Study 6

10.1 P-Value and Statistical Decision Making

A churn prediction model evaluation yields the following results:

  • Test statistic = 2.31
  • P-value = 0.021
  • Significance level: α = 0.05

Results:

  • Test Statistic: 2.31

  • P-value: 0.021

  • Significance Level (\(\alpha\)): 0.05

10.2 Tasks

  1. Explain the meaning of the p-value.
  2. Make a statistical decision.
  3. Translate the decision into non-technical language for management.
  4. Discuss the risk if the sample is not representative.
  5. Explain why the p-value does not measure effect size.

10.2.1 Explain the meaning of the p-value

The p-value (0.021) represents the probability of observing a test statistic as extreme as 2.31 (or more extreme) assuming that the null hypothesis (\(H_0\)) is true.
  • In this context, it means there is only a 2.1% chance that we would see these results if the churn model or intervention actually had no effect at all.

  • A low p-value suggests that the observed data is inconsistent with the null hypothesis.

10.2.2 Make a statistical decision

To make a decision, we compare the p-value to the pre-determined significance level (\(\alpha\)):

  • Comparison: \(P\text{-value} (0.021) < \alpha (0.05)\).

  • Decision: Reject the Null Hypothesis (\(H_0\)).

The result is considered statistically significant, meaning the evidence is strong enough to conclude that the observed effect is likely not due to random chance.

10.2.3 Translate the decision into non-technical language for management

“Our analysis shows that the new churn prediction strategy is working. There is a very low probability (about 2%) that the improvements we are seeing are just a fluke. Based on these results, we can confidently say that the model is effectively identifying or reducing customer churn, and we should proceed with the implementation.”

10.2.4 Discuss the risk if the sample is not representative

If the sample used for the test is not representative of the actual customer population (e.g., it only included long-term loyal customers or users from one specific region), several risks arise:
  1. Sampling Bias: The statistical significance might be ‘true’ for that specific group but fails when applied to the whole company.

  2. Generalization Failure: Management might invest in a global rollout based on these results, only to find the model performs poorly on other customer segments, leading to wasted resources.

  3. Misleading Certainty: We might have high statistical confidence (low p-value) in a result that is fundamentally biased, essentially giving us a “precisely wrong” answer.

10.2.5 Explain why the p-value does not measure effect size

A common misconception is that a smaller p-value means a “bigger” impact. This is incorrect because:
  • P-value measures ‘Certainty’: It tells us how sure we are that an effect exists, not how large that effect is.

  • Sample Size Sensitivity: Even a tiny, practically useless difference (e.g., reducing churn by 0.0001%) can yield a very small p-value if the sample size is large enough.

  • Effect Size: To understand the business impact, we must look at metrics like the magnitude of the churn reduction (e.g., a 5% drop in churn) o r Cohen’s d. A p-value of 0.021 tells us the result is real, but it doesn’t tell us if it will save the company $1,000 or $1,000,000.


References

Casella, George, and Roger L. Berger. 2024. Statistical Inference. 2nd ed. New York: Routledge. https://doi.org/10.1201/9781032724546.
DScience Labs. 2025. “Introduction to Statistics: Chapter 9 - Statistical Inference.” 2025. https://bookdown.org/dsciencelabs/intro_statistics/09-Statistical_Inference.html.
Ismay, Chester, and Albert Y. Kim. 2025. Statistical Inference via Data Science: A ModernDive into r and the Tidyverse. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC. https://doi.org/10.1201/9781032724546.