Foto

Fityanandra Athar Adyaksa (52250059)


Data Science students at

Enthusiastic about learning

December 21, 2025




Case Study 1

Confidence Interval for Mean, \(\sigma\) Known: An e-commerce platform wants to estimate the average number of daily transactions per user after launching a new feature. Based on large-scale historical data, the population standard deviation is known.

\[ \begin{eqnarray*} \sigma &=& 3.2 \quad \text{(population standard deviation)} \\ n &=& 100 \quad \text{(sample size)} \\ \bar{x} &=& 12.6 \quad \text{(sample mean)} \end{eqnarray*} \]

Tasks

  1. Identify the appropriate statistical test and justify your choice.
  2. Compute the Confidence Intervals for:
    • \(90\%\)
    • \(95\%\)
    • \(99\%\)
  3. Create a comparison visualization of the three confidence intervals.
  4. Interpret the results in a business analytics context.

Calculation

Confidence Interval Formula for Mean (σ Known): \[ CI = \bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \] where:

  • \(\bar{x}\): sample mean
  • \(\sigma\): population standard deviation (known)
  • \(n\): sample size
  • \(z_{\alpha/2}\): critical value from the standard normal distribution for confidence level \(1-\alpha\)
  • Standard Error (SE): \(SE = \frac{\sigma}{\sqrt{n}}\)
  • Margin of Error (ME): \(ME = z_{\alpha/2} \cdot SE\)

Solution:

Given: \(\sigma = 3.2,\quad n = 100,\quad \bar{x} = 12.6\)

  1. Calculate Standard Error (SE): \[ SE = \frac{\sigma}{\sqrt{n}} = \frac{3.2}{\sqrt{100}} = \frac{3.2}{10} = 0.32 \]

  2. Determine critical value \(z_{\alpha/2}\):

    • For 90% CI, \(\alpha = 0.10\), \(z_{0.05} = 1.645\)
    • For 95% CI, \(\alpha = 0.05\), \(z_{0.025} = 1.960\)
    • For 99% CI, \(\alpha = 0.01\), \(z_{0.005} = 2.576\)


  1. Calculate Margin of Error (ME) for each confidence level:

    • 90%: \(ME_{90} = 1.645 \times 0.32 = 0.5264\)
    • 95%: \(ME_{95} = 1.960 \times 0.32 = 0.6272\)
    • 99%: \(ME_{99} = 2.576 \times 0.32 = 0.8243\)


  1. Calculate Upper and Lower Confidence Interval bounds:
    • 90% CI: \[ \begin{aligned} \text{Lower} &= \bar{x} - ME_{90} = 12.6 - 0.5264 = 12.0736 \quad (12.07)\\ \text{Upper} &= \bar{x} + ME_{90} = 12.6 + 0.5264 = 13.1264 \quad (13.13) \end{aligned} \]
    • 95% CI: \[ \begin{aligned} \text{Lower} &= 12.6 - 0.6272 = 11.9728 \quad (11.97)\\ \text{Upper} &= 12.6 + 0.6272 = 13.2272 \quad (13.23) \end{aligned} \]
    • 99% CI: \[ \begin{aligned} \text{Lower} &= 12.6 - 0.8243 = 11.7757 \quad (11.78)\\ \text{Upper} &= 12.6 + 0.8243 = 13.4243 \quad (13.42) \end{aligned} \]
# Visualization for Case 1
library(ggplot2)

ci_data1 <- data.frame(
  Level = factor(c("90%", "95%", "99%"), levels = c("90%", "95%", "99%")),
  Mean = 12.6,
  Lower = c(12.074, 11.973, 11.776),
  Upper = c(13.126, 13.227, 13.424)
)
ggplot(ci_data1, aes(x = Level, y = Mean)) +
  geom_point(size = 3, color = "blue") +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.1, size = 1) +
  geom_hline(yintercept = 12.6, linetype = "dashed", alpha = 0.3) +
  labs(title = "Comparison of Confidence Intervals (σ Known)",
       subtitle = "Higher confidence levels result in wider intervals.",
       x = "Confidence Level", y = "Average Daily Transactions per User") +
  theme_minimal()


Business Interpretation:

With 95% confidence, the team can conclude that the true average daily transactions for the entire user population lie between 11.97 and 13.23. This range provides a solid foundation for business decision-making regarding the success of the new feature. If a business target is 12 transactions/day, this target falls within the range, indicating the feature’s performance meets expectations.



Case Study 2

Confidence Interval for Mean, \(\sigma\) Unknown: A UX Research team analyzes task completion time (in minutes) for a new mobile application. The data are collected from 12 users:

\[ 8.4,\; 7.9,\; 9.1,\; 8.7,\; 8.2,\; 9.0,\; 7.8,\; 8.5,\; 8.9,\; 8.1,\; 8.6,\; 8.3 \]

Tasks:

  1. Identify the appropriate statistical test and explain why.
  2. Compute the Confidence Intervals for:
    • \(90\%\)
    • \(95\%\)
    • \(99\%\)
  3. Visualize the three intervals on a single plot.
  4. Explain how sample size and confidence level influence the interval width.

Calculation

Confidence Interval Formula for Mean (σ Unknown): \[ CI = \bar{x} \pm t_{\alpha/2, df} \cdot \frac{s}{\sqrt{n}} \]

where:

  • \(\bar{x}\): sample mean
  • \(s\): sample standard deviation
  • \(n\): sample size
  • \(df = n - 1\): degrees of freedom
  • \(t_{\alpha/2, df}\): critical value from the Student’s t-distribution
  • Standard Error (SE): \(SE = \frac{s}{\sqrt{n}}\)
  • Margin of Error (ME): \(ME = t_{\alpha/2, df} \cdot SE\)

Solution:

Data: 8.4, 7.9, 9.1, 8.7, 8.2, 9.0, 7.8, 8.5, 8.9, 8.1, 8.6, 8.3

1. Calculate Descriptive Statistics.

  • Sample size: \(n = 12\)

  • Sample mean (\(\bar{x}\)):

    \[ \bar{x} = \frac{8.4 + 7.9 + \ldots + 8.3}{12} = \frac{101.9}{12} = 8.4917 \quad (8.49) \]

  • Sample standard deviation (\(s\)):

    \[ \begin{aligned} s &= \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}} \\ &= \sqrt{\frac{(8.4-8.4917)^2 + (7.9-8.4917)^2 + \ldots + (8.3-8.4917)^2}{11}} \\ &= \sqrt{\frac{1.8292}{11}} = \sqrt{0.1663} = 0.4078 \quad (0.41) \end{aligned} \]

  • Degrees of freedom: \(df = n - 1 = 12 - 1 = 11\)

2. Calculate Standard Error (SE). \[ SE = \frac{s}{\sqrt{n}} = \frac{0.4078}{\sqrt{12}} = \frac{0.4078}{3.4641} = 0.1177 \]

3. Determine critical value \(t_{\alpha/2, df}\).

  • For 90% CI, \(\alpha=0.10\), \(t_{0.05, 11} = 1.796\)
  • For 95% CI, \(\alpha=0.05\), \(t_{0.025, 11} = 2.201\)
  • For 99% CI, \(\alpha=0.01\), \(t_{0.005, 11} = 3.106\)

4. Calculate Margin of Error (ME).

  • 90%: \(ME_{90} = 1.796 \times 0.1177 = 0.2114\)
  • 95%: \(ME_{95} = 2.201 \times 0.1177 = 0.2591\)
  • 99%: \(ME_{99} = 3.106 \times 0.1177 = 0.3656\)

5. Calculate Upper and Lower CI bounds.

  • 90% CI: \(8.4917 \pm 0.2114 \rightarrow (8.2803, 8.7031) \approx (8.28, 8.70)\)
  • 95% CI: \(8.4917 \pm 0.2591 \rightarrow (8.2326, 8.7508) \approx (8.23, 8.75)\)
  • 99% CI: \(8.4917 \pm 0.3656 \rightarrow (8.1261, 8.8573) \approx (8.13, 8.86)\)
# Visualization for Case 2
ci_data2 <- data.frame(
  Level = factor(c("90%", "95%", "99%"), levels = c("90%", "95%", "99%")),
  Mean = 8.4917,
  Lower = c(8.2803, 8.2326, 8.1261),
  Upper = c(8.7031, 8.7508, 8.8573)
)
ggplot(ci_data2, aes(x = Level, y = Mean)) +
  geom_point(size = 3, color = "darkgreen") +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.1, size = 1, color = "darkgreen") +
  labs(title = "Case 2: CI for Task Completion Time (σ Unknown)",
       subtitle = "n=12: Small sample size and higher confidence widen the interval.",
       x = "Confidence Level", y = "Time (minutes)") +
  theme_minimal() +
  ylim(8.0, 9.0)


Influence of Sample Size and Confidence Level:

  • Confidence Level: To increase certainty that our interval contains the true population mean, we must widen the interval (e.g., 99% CI is wider than 95% CI).

  • Sample Size: With a small sample (n=12), our estimate of population variability (s) is less precise. The t-distribution, used when σ is unknown, has “heavier tails” than the normal distribution, resulting in larger critical values and a larger margin of error compared to knowing σ (z-test).
    Larger samples reduce the SE (\(s/\sqrt{n}\)), narrowing the interval.



Case Study 3

Confidence Interval for a Proportion, A/B Testing: A data science team runs an A/B test on a new Call-To-Action (CTA) button design. The experiment yields:

\[ \begin{eqnarray*} n &=& 400 \quad \text{(total users)} \\ x &=& 156 \quad \text{(users who clicked the CTA)} \end{eqnarray*} \]

Tasks:

  1. Compute the sample proportion \(\hat{p}\).
  2. Compute Confidence Intervals for the proportion at:
    • \(90\%\)
    • \(95\%\)
    • \(99\%\)
  3. Visualize and compare the three intervals.
  4. Explain how confidence level affects decision-making in product experiments.

Calculation

Confidence Interval Formula for a Proportion: \[ CI = \hat{p} \pm z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \] where:

  • \(\hat{p}\): sample proportion (\(\hat{p} = x/n\))
  • \(x\): number of “successes” in the sample
  • \(n\): total sample size
  • \(z_{\alpha/2}\): critical value from the standard normal distribution
  • Standard Error (SE): \(SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
  • Margin of Error (ME): \(ME = z_{\alpha/2} \cdot SE\)

Solution:

Given: \(n = 400\), \(x = 156\)

  1. Calculate Sample Proportion (\(\hat{p}\)): \[ \hat{p} = \frac{x}{n} = \frac{156}{400} = 0.39 \]

  2. Calculate Standard Error (SE): \[ SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = \sqrt{\frac{0.39 \times (1 - 0.39)}{400}} = \sqrt{\frac{0.39 \times 0.61}{400}} = \sqrt{\frac{0.2379}{400}} = \sqrt{0.00059475} = 0.02439 \]

  3. Determine critical value \(z_{\alpha/2}\):

    • For 90% CI, \(z_{0.05} = 1.645\)
    • For 95% CI, \(z_{0.025} = 1.960\)
    • For 99% CI, \(z_{0.005} = 2.576\)


  1. Calculate Margin of Error (ME):

    • 90%: \(ME_{90} = 1.645 \times 0.02439 = 0.04012\)
    • 95%: \(ME_{95} = 1.960 \times 0.02439 = 0.04780\)
    • 99%: \(ME_{99} = 2.576 \times 0.02439 = 0.06283\)


  1. Calculate Upper and Lower CI bounds:

    • 90% CI: \(0.39 \pm 0.0401 \rightarrow (0.3499, 0.4301) \approx (0.350, 0.430)\)
    • 95% CI: \(0.39 \pm 0.0478 \rightarrow (0.3422, 0.4378) \approx (0.342, 0.438)\)
    • 99% CI: \(0.39 \pm 0.0628 \rightarrow (0.3272, 0.4528) \approx (0.327, 0.453)\)
# Visualization for Case 3
ci_data3 <- data.frame(
  Level = factor(c("90%", "95%", "99%"), levels = c("90%", "95%", "99%")),
  Proportion = 0.39,
  Lower = c(0.350, 0.342, 0.327),
  Upper = c(0.430, 0.438, 0.453)
)
ggplot(ci_data3, aes(x = Level, y = Proportion)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.1, size = 1) +
  scale_y_continuous(labels = scales::percent, limits = c(0.30, 0.47)) +
  labs(title = "Case 3: CI for Click Proportion (A/B Test)",
       x = "Confidence Level", y = "Click Proportion (CTR)") +
  theme_minimal()


Impact of Confidence Level on Product Decision-Making:

The confidence level reflects the team’s risk tolerance for drawing incorrect conclusions. In A/B testing, a 95% CI (industry standard) means we accept a 5% risk of concluding a difference exists (e.g., new design is better) when it actually doesn’t (Type I Error). If the 95% CI for the difference in proportions between two variants does not include zero, we declare a statistically significant winner. Choosing a 99% CI reduces the chance of false positives but makes detecting real differences harder (increases false negatives). The decision should consider the cost of each error type.



Case Study 4

Precision Comparison (Z-Test vs t-Test): Two data teams measure API latency (in milliseconds) under different conditions.

\[\begin{eqnarray*} \text{Team A:} \\ n &=& 36 \quad \text{(sample size)} \\ \bar{x} &=& 210 \quad \text{(sample mean)} \\ \sigma &=& 24 \quad \text{(known population standard deviation)} \\[6pt] \text{Team B:} \\ n &=& 36 \quad \text{(sample size)} \\ \bar{x} &=& 210 \quad \text{(sample mean)} \\ s &=& 24 \quad \text{(sample standard deviation)} \end{eqnarray*}\]

Tasks

  1. Identify the statistical test used by each team.
  2. Compute Confidence Intervals for 90%, 95%, and 99%.
  3. Create a visualization comparing all intervals.
  4. Explain why the interval widths differ, even with similar data.

Calculation

Team A (σ known) uses z-test: \[ CI = \bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \] Team B (σ unknown) uses t-test: \[ CI = \bar{x} \pm t_{\alpha/2, df} \cdot \frac{s}{\sqrt{n}}, \quad \text{with } df = n-1 \]

Solution:

Given for both teams: \(n = 36\), \(\bar{x} = 210\)

  • Team A: \(\sigma = 24\) (known)
  • Team B: \(s = 24\) (estimated from sample)

1. Calculate Standard Error (SE) for both teams (same). \[ SE = \frac{24}{\sqrt{36}} = \frac{24}{6} = 4 \]

2. Determine critical values.

  • Team A (z-test):
    • 90% CI: \(z_{0.05} = 1.645\)
    • 95% CI: \(z_{0.025} = 1.960\)
    • 99% CI: \(z_{0.005} = 2.576\)
  • Team B (t-test, df=35):
    • 90% CI: \(t_{0.05, 35} \approx 1.690\)
    • 95% CI: \(t_{0.025, 35} \approx 2.030\)
    • 99% CI: \(t_{0.005, 35} \approx 2.724\)

3. Calculate Margin of Error (ME).

  • Team A:
    • 90%: \(ME = 1.645 \times 4 = 6.580\)
    • 95%: \(ME = 1.960 \times 4 = 7.840\)
    • 99%: \(ME = 2.576 \times 4 = 10.304\)
  • Team B:
    • 90%: \(ME = 1.690 \times 4 = 6.760\)
    • 95%: \(ME = 2.030 \times 4 = 8.120\)
    • 99%: \(ME = 2.724 \times 4 = 10.896\)

4. Calculate Confidence Interval.

  • Team A (z-test):
    • 90% CI: \(210 \pm 6.580 \rightarrow (203.420, 216.580)\)
    • 95% CI: \(210 \pm 7.840 \rightarrow (202.160, 217.840)\)
    • 99% CI: \(210 \pm 10.304 \rightarrow (199.696, 220.304)\)
  • Team B (t-test):
    • 90% CI: \(210 \pm 6.760 \rightarrow (203.240, 216.760)\)
    • 95% CI: \(210 \pm 8.120 \rightarrow (201.880, 218.120)\)
    • 99% CI: \(210 \pm 10.896 \rightarrow (199.104, 220.896)\)
# Visualization for Case 4
ci_data4 <- data.frame(
  Team = rep(c("Team A (z, σ known)", "Team B (t, σ unknown)"), each=3),
  Level = rep(c("90%", "95%", "99%"), 2),
  Mean = 210,
  Lower = c(203.42, 202.16, 199.70, 203.24, 201.88, 199.10),
  Upper = c(216.58, 217.84, 220.30, 216.76, 218.12, 220.90)
)
ci_data4$Level <- factor(ci_data4$Level, levels = c("90%", "95%", "99%"))

ggplot(ci_data4, aes(x = Level, y = Mean, color = Team)) +
  geom_point(position = position_dodge(width=0.5)) +
  geom_errorbar(aes(ymin = Lower, ymax = Upper), width=0.2,
                position = position_dodge(width=0.5)) +
  labs(title = "Case 4: CI Precision Comparison - Z-Test vs. T-Test",
       subtitle = "Intervals for t-test are wider due to uncertainty from estimating σ.",
       x = "Confidence Level", y = "API Latency (ms)", color = "Method / Team") +
  theme_minimal()


Why Do Interval Widths Differ? Although the mean, sample size, and variance estimate (24) are the same, the CI for Team B (t-test) is always wider than for Team A (z-test) at the same confidence level. This difference occurs because:

  1. Source of Uncertainty: Team B must estimate the population standard deviation (σ) using the sample statistic (s). This estimation introduces additional uncertainty.

  2. Sampling Distribution: To accommodate this extra uncertainty, the t-distribution used by Team B has heavier tails than the standard normal distribution used by Team A. This results in larger critical values (e.g., 2.030 vs 1.960 for 95% CI) for the same degrees of freedom, thus creating a larger margin of error.



Case Study 5

One-Sided Confidence Interval: A Software as a Service (SaaS) company wants to ensure that at least 70% of weekly active users utilize a premium feature.

From the experiment:

\[ \begin{eqnarray*} n &=& 250 \quad \text{(total users)} \\ x &=& 185 \quad \text{(active premium users)} \end{eqnarray*} \]

Management is only interested in the lower bound of the estimate.

Tasks:

  1. Identify the type of Confidence Interval and the appropriate test.
  2. Compute the one-sided lower Confidence Interval at:
    • \(90\%\)
    • \(95\%\)
    • \(99\%\)
  3. Visualize the lower bounds for all confidence levels.
  4. Determine whether the 70% target is statistically satisfied.

Calculation

One-Sided Lower Bound Formula for a Proportion: \[ LB = \hat{p} - z_{\alpha} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \] where:

  • \(LB\): One-sided lower bound
  • \(z_{\alpha}\): critical value from the standard normal distribution for area \(\alpha\) in the upper tail. Note this is not \(z_{\alpha/2}\).

Solution: Given: \(n = 250\), \(x = 185\)

  1. Calculate Sample Proportion (\(\hat{p}\)): \[ \hat{p} = \frac{x}{n} = \frac{185}{250} = 0.74 \]

  2. Calculate Standard Error (SE): \[ SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = \sqrt{\frac{0.74 \times 0.26}{250}} = \sqrt{\frac{0.1924}{250}} = \sqrt{0.0007696} = 0.02774 \]

  3. Determine critical value \(z_{\alpha}\) for one-sided CI:

    • For 90% lower bound, \(\alpha = 0.10\), \(z_{0.10} = 1.282\)
    • For 95% lower bound, \(\alpha = 0.05\), \(z_{0.05} = 1.645\)
    • For 99% lower bound, \(\alpha = 0.01\), \(z_{0.01} = 2.326\)


  1. Calculate Lower Bound (LB):
    • 90% LB: \(LB_{90} = 0.74 - (1.282 \times 0.02774) = 0.74 - 0.03556 = 0.70444\)
    • 95% LB: \(LB_{95} = 0.74 - (1.645 \times 0.02774) = 0.74 - 0.04563 = 0.69437\)
    • 99% LB: \(LB_{99} = 0.74 - (2.326 \times 0.02774) = 0.74 - 0.06453 = 0.67547\)
    Rounded:
    • 90% Lower Bound: 0.704 (70.4%)
    • 95% Lower Bound: 0.694 (69.4%)
    • 99% Lower Bound: 0.675 (67.5%)
# Visualization for Case 5
ci_data5 <- data.frame(
  Level = factor(c("90%", "95%", "99%"), levels = c("90%", "95%", "99%")),
  Proportion = 0.74,
  Lower_Bound = c(0.704, 0.694, 0.675)
)
ggplot(ci_data5, aes(x = Level, y = Proportion)) +
  geom_point(size=3, color="purple") +
  geom_segment(aes(xend=Level, y=Lower_Bound, yend=Proportion),
               arrow = arrow(length = unit(0.2, "cm")), size=1, color="purple") +
  geom_hline(yintercept=0.70, linetype="dashed", color="red") +
  geom_text(aes(label=paste0(round(Lower_Bound*100,1),"%")),
            y=ci_data5$Lower_Bound - 0.01, size=3.5) +
  scale_y_continuous(labels=scales::percent, limits=c(0.66, 0.75)) +
  labs(title="Case 5: One-Sided Lower Bound for Premium User Proportion",
       subtitle="Arrow points from lower bound to sample proportion (0.74).\nRed line: business target 70%.",
       x="Confidence Level", y="Proportion of Active Premium Users") +
  theme_minimal()


Is the 70% Target Statistically Met? The conclusion depends on the confidence level (or risk tolerance) set by management:

  • At 90% confidence, the lower bound is 70.4%, which is higher than the 70% target. This means we can be 90% confident that the true proportion of users utilizing the premium feature is at least 70.4%, this meeting the target.

  • At 95% confidence, the lower bound is 69.4%, which is lower than the 70% target. Therefore, we cannot state with 95% confidence that the target has been met.

  • At 99% confidence, our confidence is even lower (lower bound 67.5%).

The company can claim the target is met with 90% confidence. However, if a more stringent standard (95% or 99%) is required, the current data does not provide sufficient evidence to support that claim. The final decision should consider the risk of an erroneous claim (e.g., if the true proportion is actually below 70%).