Carol Dupino Pereira
NIM: 52250051
Mahasiswa Sains Data ITSB
R Programming
Data Science
Statistics
Case Study 1
Confidence Interval for Mean, \(\sigma\) Known: An
e-commerce platform wants to estimate the
average number of daily transactions per user after
launching a new feature. Based on large-scale historical data, the
population standard deviation is known.
\[
\begin{eqnarray*}
\sigma &=& 3.2 \quad \text{(population standard deviation)} \\
n &=& 100 \quad \text{(sample size)} \\
\bar{x} &=& 12.6 \quad \text{(sample mean)}
\end{eqnarray*}
\]
Tasks
- Identify the appropriate statistical test and
justify your choice.
- Compute the Confidence Intervals for:
- \(90\%\)
- \(95\%\)
- \(99\%\)
- Create a comparison visualization of the three
confidence intervals.
- Interpret the results in a business analytics context.
Appropriate
Statistical Test and Justification
The appropriate statistical test for constructing the confidence
interval for the population mean (\(\mu\)) is the Z-Interval for the Population
Mean.
Justification:
Population Standard Deviation (\(\sigma\)) is Known: This is the primary
condition that dictates the use of the Z-distribution (Standard Normal
Distribution) over the \(t\)-distribution.
Large Sample Size (\(n\)): With
a sample size of \(n=100\) (which is
\(\ge 30\)), the Central Limit Theorem
ensures that the sampling distribution of the sample mean (\(\bar{x}\)) is approximately normal, even if
the underlying population distribution is not perfectly normal. This
further validates the use of the Z-statistic.
The formula used is: \[
\text{Confidence Interval} = \bar{x} \pm Z_{\alpha/2} \left(
\frac{\sigma}{\sqrt{n}} \right)
\]
Confidence Interval
Computations
Given Parameters: \[\begin{eqnarray*}
\sigma &=& 3.2 \\
n &=& 100 \\
\bar{x} &=& 12.6
\end{eqnarray*}\]
Standard Error (SE): \[\begin{eqnarray*}\text{SE} =
\frac{\sigma}{\sqrt{n}} = \frac{3.2}{\sqrt{100}} = \frac{3.2}{10} = 0.32
\end{eqnarray*}\]
The computed confidence intervals are summarized in the table
below:
Two-Sided Confidence Interval Menggunakan Zα/2
| 90% |
1.645 |
0.526 |
12.074 |
13.126 |
| 95% |
1.960 |
0.627 |
11.973 |
13.227 |
| 99% |
2.576 |
0.824 |
11.776 |
13.424 |
Comparison
Visualization
The plot below visually compares the three confidence intervals,
demonstrating how the interval width increases as the confidence level
increases.
Interpretation in a
Business Analytics Context
The confidence intervals provide a range of plausible values for the
true average number of daily transactions per user (μ) after the new
feature launch.
- 90% Confidence Interval (12.074,13.126):
- We are 90% confident that the true average number of daily
transactions per user is between 12.074 and 13.126. This is the most
precise (narrowest) estimate.
- 95% Confidence Interval (11.973,13.227):
- This is the standard interval used in most scientific and business
contexts. We are 95% confident that the true average is between 11.973
and 13.227. The margin of error is 0.627 transactions.
- 99% Confidence Interval (11.776,13.424):
- This is the most reliable (highest confidence) estimate. We are 99%
confident that the true average is between 11.776 and 13.424.
Business Insight: The key takeaway is the trade-off between
Confidence and Precision:
To be more confident (e.g., 99% confidence), the e-commerce
platform must accept a wider interval (lower precision). This means the
true average could be as low as 11.776 or as high as 13.424.
To have a more precise estimate (e.g., 90% confidence), the
platform must accept a lower confidence level.
Since all intervals are entirely above 11.776 transactions, the
platform can be highly confident that the new feature has resulted in an
average transaction rate per user that is significantly higher than, for
instance, 11.0 (if that were a benchmark). The 95% CI is a good balance,
suggesting the new feature has likely resulted in an average between
approximately 12.0 and 13.2 daily transactions per user.
Case Study 2
Confidence Interval for Mean, \(\sigma\) Unknown: A UX
Research team analyzes task completion time (in
minutes) for a new mobile application. The data are collected
from 12 users:
\[
8.4,\; 7.9,\; 9.1,\; 8.7,\; 8.2,\; 9.0,\;
7.8,\; 8.5,\; 8.9,\; 8.1,\; 8.6,\; 8.3
\]
Tasks:
- Identify the appropriate statistical test and
explain why.
- Compute the Confidence Intervals for:
- \(90\%\)
- \(95\%\)
- \(99\%\)
- Visualize the three intervals on a single plot.
- Explain how sample size and confidence level
influence the interval width.
Identify the
appropriate statistical test and explain why.
Appropriate Statistical Test: Confidence Interval for the Mean using
the \(t\)-distribution (often referred
to as a \(t\)-interval).
Explanation:The \(t\)-distribution
is the appropriate choice for constructing the confidence interval for
the population mean (\(\mu\)) for two
primary reasons:
Population Standard Deviation is Unknown (\(\sigma\) unknown): When \(\sigma\) is unknown, we must use the sample
standard deviation (\(s\)) as an
estimate.
Small Sample Size (\(n <
30\)): The sample size is \(n=12\). For small samples with an unknown
population standard deviation, the \(t\)-distribution provides a more accurate
model of the sampling distribution of the mean than the standard normal
(\(z\)) distribution.
The formula for the confidence interval for the mean is: \[
\bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}
\]
where \(t_{\alpha/2, n-1}\) is the
critical \(t\)-value with \(n-1\) degrees of freedom.
Compute the
Confidence Intervals
Sample Statistics
-
Sample Size (n): 12
-
Degrees of Freedom (df): 11
-
Sample Mean (x̄): 8.4500 minutes
-
Sample Standard Deviation (s): 0.4079 minutes
-
Standard Error (s / √n): 0.1179 minutes
The computed confidence intervals for the population mean task
completion time (\(\mu\)) are:
Confidence Interval Waktu (dalam Menit)
| 90% |
8.2435 |
8.6565 |
| 95% |
8.1912 |
8.7088 |
| 99% |
8.0772 |
8.8228 |
Note: The critical
t-values used in
the calculation were:
-
t0.05, 11 (90% CI): 1.7959
-
t0.025, 11 (95% CI): 2.2010
-
t0.005, 11 (99% CI): 3.1058
Visualize the three
intervals on a single plot.
The three confidence intervals are visualized below. The red dot
represents the sample mean (\(\bar{x} =
8.45\) minutes), and the horizontal lines represent the interval
for each confidence level.The plot shows the three confidence
intervals:
Factors Influencing
Interval Width
The width of a confidence interval is determined by the Margin of
Error (\(ME = t^ \cdot
\frac{s}{\sqrt{n}}\)).
As the confidence level increases (e.g., from 90% to 99%), the
interval becomes wider.
Logic: To be more certain that the interval contains the true
population mean, we must encompass a larger range of possible values.
This is reflected in a larger critical value (\(t^*\)).
Effect: As the sample size increases, the interval becomes narrower
(more precise).
Logic: Increasing \(n\) reduces the
Standard Error (\(\frac{s}{\sqrt{n}}\)). With more data, our
estimate of the mean becomes more stable and reliable, allowing for a
tighter range of estimation.
Case Study 3
Confidence Interval for a Proportion, A/B Testing: A
data science team runs an A/B test on a new
Call-To-Action (CTA) button design. The experiment yields:
\[
\begin{eqnarray*}
n &=& 400 \quad \text{(total users)} \\
x &=& 156 \quad \text{(users who clicked the CTA)}
\end{eqnarray*}
\]
Tasks:
- Compute the sample proportion \(\hat{p}\).
- Compute Confidence Intervals for the proportion at:
- \(90\%\)
- \(95\%\)
- \(99\%\)
- Visualize and compare the three intervals.
- Explain how confidence level affects decision-making in product
experiments.
The given data is: \[
\begin{eqnarray*}
n &=& 400 \quad \text{(total users)} \\
x &=& 156 \quad \text{(users who clicked the CTA)}
\end{eqnarray*}
\]
Compute the Sample
Proportion (\(\hat{p}\))
The sample proportion \(\hat{p}\) is
the point estimate for the true population proportion.
\[
\hat{p} = \frac{x}{n} = \frac{156}{400} = 0.3900
\]
The sample proportion of users who clicked the new CTA design is
\(39.00\%\).
Compute Confidence
Intervals
The confidence intervals (CIs) are calculated using the formula:
\[
\text{CI} = \hat{p} \pm Z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1 -
\hat{p})}{n}}
\]
The standard error (\(\text{SE}\))
is: \[
\text{SE} = \sqrt{\frac{0.3900(1 - 0.3900)}{400}} \approx 0.0243
\]
Confidence Interval untuk Berbagai Tingkat
Kepercayaan
| 90% |
1.6449 |
0.0401 |
[0.3499, 0.4301] |
| 95% |
1.9600 |
0.0478 |
[0.3422, 0.4378] |
| 99% |
2.5758 |
0.0628 |
[0.3272, 0.4528] |
Visualize and Compare
the Three Intervals
The chart below visualizes how the width of the confidence interval
increases with the confidence level. The sample proportion (\(\hat{p}=0.3900\)) is the center of all
three intervals (marked by the diamond).
Explain How
Confidence Level Affects Decision-Making
The confidence level directly impacts the precision and certainty of
your result, which is crucial in product experimentation like A/B
testing.
Hubungan Confidence Level dengan Lebar Interval, Presisi, dan
Risiko Error
| Higher (99%) |
Wider |
Less precise |
More certain |
Lower risk of Type I error |
| Lower (90%) |
Narrower |
More precise |
Less certain |
Higher risk of Type I error |
- Defining a Winner (Statistical Significance):
In A/B testing, a common decision rule is to declare a “winner” if
the confidence interval of the difference between the two variants does
not include zero.
- Higher Confidence (\(\mathbf{99\%}\)):
The interval is very wide, making it harder to exclude a null
hypothesis (e.g., that the new design is no different than the old one).
It requires a much larger difference in performance to achieve
statistical significance. While this is the safest level, it often leads
to inconclusive results, requiring longer testing times.
- Lower Confidence (\(\mathbf{90\%}\)):
The interval is narrower, making it easier to achieve statistical
significance. However, this increases the risk of a Type I error (a
False Positive)—declaring the new CTA a winner when it is actually no
better, or even worse, than the original.
Most data science and product teams default to a \(95\%\) confidence level (corresponding to a
\(\alpha=0.05\) significance level).
This is considered a good balance, offering a reasonable level of
certainty (\(95\%\) sure the true value
is in the range) without requiring an excessively large sample size or
long test duration that would be needed for a \(99\%\) confidence level.
Case Study 4
Precision Comparison (Z-Test vs t-Test): Two data
teams measure API latency (in milliseconds) under
different conditions.
\[\begin{eqnarray*}
\text{Team A:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
\sigma &=& 24 \quad \text{(known population standard deviation)}
\\[6pt]
\text{Team B:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
s &=& 24 \quad \text{(sample standard deviation)}
\end{eqnarray*}\]
Tasks
- Identify the statistical test used by each team.
- Compute Confidence Intervals for 90%, 95%, and
99%.
- Create a visualization comparing all intervals.
- Explain why the interval widths differ, even with
similar data.
Statistical Test
Identification
The choice of statistical test for the mean depends on whether the
population standard deviation (\(\sigma\)) is known and the sample size
(\(n\)).
Pemilihan Uji Statistik Berdasarkan Informasi Simpangan
Baku
| Team A |
σ = 24 (Known population SD) |
Z-Test (or Z-interval) |
Since the population standard deviation (σ) is known
and the sample size (n = 36) is large (n ≥ 30), the Z-distribution is
appropriate. |
| Team B |
s = 24 (Sample SD) |
t-Test (or t-interval) |
Since the population standard deviation (σ) is
unknown and only the sample standard deviation (s) is available, the
t-distribution must be used. |
Confidence Interval
Computation
The formula for the Confidence Interval (CI) for the population mean
(\(\mu\)) is:
Team A (Z-Interval): \[
\text{CI} = \bar{x} \pm Z^* \left(\frac{\sigma}{\sqrt{n}}\right)
\]
Team B (t-Interval): \[
\text{CI} = \bar{x} \pm t^* \left(\frac{s}{\sqrt{n}}\right)
\]
Common Parameters: \[\begin{eqnarray*}
\text{Team A:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
\sigma &=& 24 \quad \text{(known population standard deviation)}
\\[6pt]
\text{Team B:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
s &=& 24 \quad \text{(sample standard deviation)}
\\[6pt]\end{eqnarray*}\]
Standard Error (SE): \[\begin{eqnarray*}\text{SE} =
\frac{\sigma}{\sqrt{n}} = \frac{24}{\sqrt{36}} = \frac{24}{6} = 4
\end{eqnarray*}\]
Team A (Z-Interval): \(\sigma\) is knownWe use the critical
Z-values (\(Z\)) for the specified
confidence levels.
Confidence Interval dengan Margin of Error (Z* × 4)
| 90% |
1.645 |
1.645 × 4 = 6.58 |
[203.42, 216.58] |
13.16 |
| 95% |
1.960 |
1.960 × 4 = 7.84 |
[202.16, 217.84] |
15.68 |
| 99% |
2.576 |
2.576 × 4 = 10.30 |
[199.70, 220.30] |
20.60 |
Team B (t-Interval): \(\sigma\) is unknownWe use the critical
t-values (\(t\)) with degrees of
freedom (\(df\)) \(=n-1=36-1=35\)
Confidence Interval Menggunakan t-Distribution (df =
35)
| 90% |
1.690 |
1.690 × 4 = 6.76 |
[203.24, 216.76] |
13.52 |
| 95% |
2.030 |
2.030 × 4 = 8.12 |
[201.88, 218.12] |
16.24 |
| 99% |
2.724 |
2.724 × 4 = 10.90 |
[199.10, 220.90] |
21.80 |
Interval Comparison
Visualization
The visualization would show that for every confidence level:
The t-intervals (Team B) are slightly wider than the Z-intervals
(Team A). All intervals are centered at the sample mean of 210 ms. The
width of the intervals increases as the confidence level increases
(e.g., the 99% interval is widest, the 90% is narrowest).
Explanation of
Interval Width Difference
The interval widths differ because of the underlying probability
distributions used: the Standard Normal (Z) Distribution versus the
Student’s t-Distribution.
\(\sigma\) Known (Team A \(\rightarrow\) Z-Test)
The Z-test is used when the population standard deviation (\(\sigma\)) is known.
Since \(\sigma\) is a fixed,
known value, the estimate of the standard error (\(\sigma/\sqrt{n}\)) is highly certain and
does not add extra variability to the analysis.
The critical \(Z^\) values are
fixed based on the confidence level.
\(\sigma\) Unknown (Team B \(\rightarrow\) t-Test)
The t-test is used when the population standard deviation (\(\sigma\)) is unknown, and we must
substitute the sample standard deviation (\(s\)) as an estimate.
The sample standard deviation (\(s\)) is itself an estimate that varies from
sample to sample. This introduces an extra source of uncertainty into
the standard error estimate.
To account for this added uncertainty, the t-distribution has
heavier tails (more spread out) than the Z-distribution.
This results in larger critical values (\(t > Z\)) and, consequently, a larger
Margin of Error (ME) and wider confidence intervals for the t-test
compared to the Z-test at the same confidence level.
In summary: The t-test requires a wider interval (is less precise) to
achieve the same confidence level as the Z-test because it must
compensate for the additional uncertainty introduced by estimating the
population standard deviation (\(\sigma\)) with the sample standard
deviation (\(s\)).
Case Study 5
One-Sided Confidence Interval: A Software as
a Service (SaaS) company wants to ensure that at least
70% of weekly active users utilize a premium feature.
From the experiment:
\[
\begin{eqnarray*}
n &=& 250 \quad \text{(total users)} \\
x &=& 185 \quad \text{(active premium users)}
\end{eqnarray*}
\]
Management is only interested in the lower bound of
the estimate.
Tasks:
- Identify the type of Confidence Interval and the
appropriate test.
- Compute the one-sided lower Confidence Interval at:
- \(90\%\)
- \(95\%\)
- \(99\%\)
- Visualize the lower bounds for all confidence levels.
- Determine whether the 70% target is statistically
satisfied.
The given data is: \[\begin{eqnarray*}
n &=& 250 \quad \text{(total users)} \\
x &=& 185 \quad \text{(active premium users)} \\
\hat{p} &=& \frac{x}{n} = \frac{185}{250} = 0.74
\end{eqnarray*}\] The target proportion to ensure is \(p_0 = 0.70\).
The standard error of the sample proportion (\(\hat{p}\)) is: \[
SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} =
\sqrt{\frac{0.74(1-0.74)}{250}} \approx 0.0277
\]
Identify the Type of
Confidence Interval and the Appropriate Test
Type of Confidence Interval: One-Sided Lower Confidence Interval for
a Population Proportion. The company is only interested in the lower
bound to ensure the feature usage is at least \(70\%\).
Appropriate Test/Method: The appropriate method is using the Z-test
for a Population Proportion (or the Normal Approximation method for
confidence intervals) because the sample size is large enough to satisfy
the normal approximation conditions (\(n\hat{p} = 185 > 10\) and \(n(1-\hat{p}) = 65 > 10\)).
Compute the One-Sided
Lower Confidence Interval
The formula for the one-sided lower confidence bound is: \[
\text{Lower Bound} = \hat{p} - Z_{1-\alpha} \cdot SE
\]
One-Sided Confidence Interval (Lower Bound)
| 90% |
0.10 |
1.282 |
0.7044 |
| 95% |
0.05 |
1.645 |
0.6944 |
| 99% |
0.01 |
2.326 |
0.6755 |
Detailed Results:
One-Sided Confidence Interval (Lower Bound)
| 90% |
1.281552 |
0.704448 |
| 95% |
1.644854 |
0.694369 |
| 99% |
2.326348 |
0.675463 |
Visualize the Lower
Bounds for All Confidence Levels
The following plot illustrates the calculated lower bounds against
the \(70\%\) target.(A bar chart titled
‘One-Sided Lower Confidence Bounds for Premium Feature Usage’ is
displayed. The x-axis shows confidence levels (90%, 95%, 99%), and the
y-axis shows the Lower Bound (CI). A horizontal dashed red line
indicates the target proportion of 0.70. The bars show lower bounds of
0.704
Determine Whether the
\(70\%\) Target is Statistically
Satisfied
The \(70\%\) target is statistically
satisfied at a given confidence level if the calculated Lower Bound is
\(\geq 0.70\).
At \(99\%\) Confidence:
Summary: The company can be \(90\%\)
confident that the true proportion of weekly active users utilizing a
premium feature is at least \(70\%\).
However, they cannot make this claim at the stricter \(95\%\) or \(99\%\) confidence levels.
