Carol Dupino Pereira
NIM: 52250051
Mahasiswa Sains Data ITSB
R Programming
Data Science
Statistics
Case Study 1
Confidence Interval for Mean, \(\sigma\) Known: An
e-commerce platform wants to estimate the
average number of daily transactions per user after
launching a new feature. Based on large-scale historical data, the
population standard deviation is known.
\[
\begin{eqnarray*}
\sigma &=& 3.2 \quad \text{(population standard deviation)} \\
n &=& 100 \quad \text{(sample size)} \\
\bar{x} &=& 12.6 \quad \text{(sample mean)}
\end{eqnarray*}
\]
Tasks
- Identify the appropriate statistical test and
justify your choice.
- Compute the Confidence Intervals for:
- \(90\%\)
- \(95\%\)
- \(99\%\)
- Create a comparison visualization of the three
confidence intervals.
- Interpret the results in a business analytics context.
Appropriate
Statistical Test and Justification
The appropriate statistical test for constructing the confidence
interval for the population mean (\(\mu\)) is the Z-Interval for the Population
Mean.
Justification:
Population Standard Deviation (\(\sigma\)) is Known: This is the primary
condition that dictates the use of the Z-distribution (Standard Normal
Distribution) over the \(t\)-distribution.
Large Sample Size (\(n\)): With
a sample size of \(n=100\) (which is
\(\ge 30\)), the Central Limit Theorem
ensures that the sampling distribution of the sample mean (\(\bar{x}\)) is approximately normal, even if
the underlying population distribution is not perfectly normal. This
further validates the use of the Z-statistic.
The formula used is: \[
\text{Confidence Interval} = \bar{x} \pm Z_{\alpha/2} \left(
\frac{\sigma}{\sqrt{n}} \right)
\]
Confidence Interval
Computations
Given Parameters: \[\begin{eqnarray*}
\sigma &=& 3.2 \\
n &=& 100 \\
\bar{x} &=& 12.6
\end{eqnarray*}\]
Standard Error (SE): \[\begin{eqnarray*}\text{SE} =
\frac{\sigma}{\sqrt{n}} = \frac{3.2}{\sqrt{100}} = \frac{3.2}{10} = 0.32
\end{eqnarray*}\]
The computed confidence intervals are summarized in the table
below:
Two-Sided Confidence Interval Menggunakan Zα/2
| 90% |
1.645 |
0.526 |
12.074 |
13.126 |
| 95% |
1.960 |
0.627 |
11.973 |
13.227 |
| 99% |
2.576 |
0.824 |
11.776 |
13.424 |
Comparison
Visualization
The plot below visually compares the three confidence intervals,
demonstrating how the interval width increases as the confidence level
increases.
Interpretation in a
Business Analytics Context
The confidence intervals provide a range of plausible values for the
true average number of daily transactions per user (μ) after the new
feature launch.
- 90% Confidence Interval (12.074,13.126):
- We are 90% confident that the true average number of daily
transactions per user is between 12.074 and 13.126. This is the most
precise (narrowest) estimate.
- 95% Confidence Interval (11.973,13.227):
- This is the standard interval used in most scientific and business
contexts. We are 95% confident that the true average is between 11.973
and 13.227. The margin of error is 0.627 transactions.
- 99% Confidence Interval (11.776,13.424):
- This is the most reliable (highest confidence) estimate. We are 99%
confident that the true average is between 11.776 and 13.424.
Business Insight: The key takeaway is the trade-off between
Confidence and Precision:
To be more confident (e.g., 99% confidence), the e-commerce
platform must accept a wider interval (lower precision). This means the
true average could be as low as 11.776 or as high as 13.424.
To have a more precise estimate (e.g., 90% confidence), the
platform must accept a lower confidence level.
Since all intervals are entirely above 11.776 transactions, the
platform can be highly confident that the new feature has resulted in an
average transaction rate per user that is significantly higher than, for
instance, 11.0 (if that were a benchmark). The 95% CI is a good balance,
suggesting the new feature has likely resulted in an average between
approximately 12.0 and 13.2 daily transactions per user.
Case Study 2
Confidence Interval for Mean, \(\sigma\) Unknown: A UX
Research team analyzes task completion time (in
minutes) for a new mobile application. The data are collected
from 12 users:
\[
8.4,\; 7.9,\; 9.1,\; 8.7,\; 8.2,\; 9.0,\;
7.8,\; 8.5,\; 8.9,\; 8.1,\; 8.6,\; 8.3
\]
Tasks:
- Identify the appropriate statistical test and
explain why.
- Compute the Confidence Intervals for:
- \(90\%\)
- \(95\%\)
- \(99\%\)
- Visualize the three intervals on a single plot.
- Explain how sample size and confidence level
influence the interval width.
Identify the
appropriate statistical test and explain why.
Appropriate Statistical Test: Confidence Interval for the Mean using
the \(t\)-distribution (often referred
to as a \(t\)-interval).
Explanation:The \(t\)-distribution
is the appropriate choice for constructing the confidence interval for
the population mean (\(\mu\)) for two
primary reasons:
Population Standard Deviation is Unknown (\(\sigma\) unknown): When \(\sigma\) is unknown, we must use the sample
standard deviation (\(s\)) as an
estimate.
Small Sample Size (\(n <
30\)): The sample size is \(n=12\). For small samples with an unknown
population standard deviation, the \(t\)-distribution provides a more accurate
model of the sampling distribution of the mean than the standard normal
(\(z\)) distribution.
The formula for the confidence interval for the mean is: \[
\bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}
\]
where \(t_{\alpha/2, n-1}\) is the
critical \(t\)-value with \(n-1\) degrees of freedom.
Compute the
Confidence Intervals
Sample Statistics
-
Sample Size (n): 12
-
Degrees of Freedom (df): 11
-
Sample Mean (x̄): 8.4500 minutes
-
Sample Standard Deviation (s): 0.4079 minutes
-
Standard Error (s / √n): 0.1179 minutes
The computed confidence intervals for the population mean task
completion time (\(\mu\)) are:
Confidence Interval Waktu (dalam Menit)
| 90% |
8.2435 |
8.6565 |
| 95% |
8.1912 |
8.7088 |
| 99% |
8.0772 |
8.8228 |
Note: The critical
t-values used in
the calculation were:
-
t0.05, 11 (90% CI): 1.7959
-
t0.025, 11 (95% CI): 2.2010
-
t0.005, 11 (99% CI): 3.1058
Visualize the three
intervals on a single plot.
The three confidence intervals are visualized below. The red dot
represents the sample mean (\(\bar{x} =
8.45\) minutes), and the horizontal lines represent the interval
for each confidence level.The plot shows the three confidence
intervals:
Factors Influencing
Interval Width
The width of a confidence interval is determined by the Margin of
Error (\(ME = t^ \cdot
\frac{s}{\sqrt{n}}\)).
As the confidence level increases (e.g., from 90% to 99%), the
interval becomes wider.
Logic: To be more certain that the interval contains the true
population mean, we must encompass a larger range of possible values.
This is reflected in a larger critical value (\(t^*\)).
Effect: As the sample size increases, the interval becomes narrower
(more precise).
Logic: Increasing \(n\) reduces the
Standard Error (\(\frac{s}{\sqrt{n}}\)). With more data, our
estimate of the mean becomes more stable and reliable, allowing for a
tighter range of estimation.
Case Study 3
Confidence Interval for a Proportion, A/B Testing: A
data science team runs an A/B test on a new
Call-To-Action (CTA) button design. The experiment yields:
\[
\begin{eqnarray*}
n &=& 400 \quad \text{(total users)} \\
x &=& 156 \quad \text{(users who clicked the CTA)}
\end{eqnarray*}
\]
Tasks:
- Compute the sample proportion \(\hat{p}\).
- Compute Confidence Intervals for the proportion at:
- \(90\%\)
- \(95\%\)
- \(99\%\)
- Visualize and compare the three intervals.
- Explain how confidence level affects decision-making in product
experiments.
The given data is: \[
\begin{eqnarray*}
n &=& 400 \quad \text{(total users)} \\
x &=& 156 \quad \text{(users who clicked the CTA)}
\end{eqnarray*}
\]
Compute the Sample
Proportion (\(\hat{p}\))
The sample proportion \(\hat{p}\) is
the point estimate for the true population proportion.
\[
\hat{p} = \frac{x}{n} = \frac{156}{400} = 0.3900
\]
The sample proportion of users who clicked the new CTA design is
\(39.00\%\).
Compute Confidence
Intervals
The confidence intervals (CIs) are calculated using the formula:
\[
\text{CI} = \hat{p} \pm Z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1 -
\hat{p})}{n}}
\]
The standard error (\(\text{SE}\))
is: \[
\text{SE} = \sqrt{\frac{0.3900(1 - 0.3900)}{400}} \approx 0.0243
\]
Confidence Interval untuk Berbagai Tingkat
Kepercayaan
| 90% |
1.6449 |
0.0401 |
[0.3499, 0.4301] |
| 95% |
1.9600 |
0.0478 |
[0.3422, 0.4378] |
| 99% |
2.5758 |
0.0628 |
[0.3272, 0.4528] |
Visualize and Compare
the Three Intervals
The chart below visualizes how the width of the confidence interval
increases with the confidence level. The sample proportion (\(\hat{p}=0.3900\)) is the center of all
three intervals (marked by the diamond).
Explain How
Confidence Level Affects Decision-Making
The confidence level directly impacts the precision and certainty of
your result, which is crucial in product experimentation like A/B
testing.
Hubungan Confidence Level dengan Lebar Interval, Presisi, dan
Risiko Error
| Higher (99%) |
Wider |
Less precise |
More certain |
Lower risk of Type I error |
| Lower (90%) |
Narrower |
More precise |
Less certain |
Higher risk of Type I error |
- Defining a Winner (Statistical Significance):
In A/B testing, a common decision rule is to declare a “winner” if
the confidence interval of the difference between the two variants does
not include zero.
- Higher Confidence (\(\mathbf{99\%}\)):
The interval is very wide, making it harder to exclude a null
hypothesis (e.g., that the new design is no different than the old one).
It requires a much larger difference in performance to achieve
statistical significance. While this is the safest level, it often leads
to inconclusive results, requiring longer testing times.
- Lower Confidence (\(\mathbf{90\%}\)):
The interval is narrower, making it easier to achieve statistical
significance. However, this increases the risk of a Type I error (a
False Positive)—declaring the new CTA a winner when it is actually no
better, or even worse, than the original.
Most data science and product teams default to a \(95\%\) confidence level (corresponding to a
\(\alpha=0.05\) significance level).
This is considered a good balance, offering a reasonable level of
certainty (\(95\%\) sure the true value
is in the range) without requiring an excessively large sample size or
long test duration that would be needed for a \(99\%\) confidence level.
Case Study 4
Precision Comparison (Z-Test vs t-Test): Two data
teams measure API latency (in milliseconds) under
different conditions.
\[\begin{eqnarray*}
\text{Team A:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
\sigma &=& 24 \quad \text{(known population standard deviation)}
\\[6pt]
\text{Team B:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
s &=& 24 \quad \text{(sample standard deviation)}
\end{eqnarray*}\]
Tasks
- Identify the statistical test used by each team.
- Compute Confidence Intervals for 90%, 95%, and
99%.
- Create a visualization comparing all intervals.
- Explain why the interval widths differ, even with
similar data.
Statistical Test
Identification
The choice of statistical test for the mean depends on whether the
population standard deviation (\(\sigma\)) is known and the sample size
(\(n\)).
Pemilihan Uji Statistik Berdasarkan Informasi Simpangan
Baku
| Team A |
σ = 24 (Known population SD) |
Z-Test (or Z-interval) |
Since the population standard deviation (σ) is known
and the sample size (n = 36) is large (n ≥ 30), the Z-distribution is
appropriate. |
| Team B |
s = 24 (Sample SD) |
t-Test (or t-interval) |
Since the population standard deviation (σ) is
unknown and only the sample standard deviation (s) is available, the
t-distribution must be used. |
Confidence Interval
Computation
The formula for the Confidence Interval (CI) for the population mean
(\(\mu\)) is:
Team A (Z-Interval): \[
\text{CI} = \bar{x} \pm Z^* \left(\frac{\sigma}{\sqrt{n}}\right)
\]
Team B (t-Interval): \[
\text{CI} = \bar{x} \pm t^* \left(\frac{s}{\sqrt{n}}\right)
\]
Common Parameters: \[\begin{eqnarray*}
\text{Team A:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
\sigma &=& 24 \quad \text{(known population standard deviation)}
\\[6pt]
\text{Team B:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
s &=& 24 \quad \text{(sample standard deviation)}
\\[6pt]\end{eqnarray*}\]
Standard Error (SE): \[\begin{eqnarray*}\text{SE} =
\frac{\sigma}{\sqrt{n}} = \frac{24}{\sqrt{36}} = \frac{24}{6} = 4
\end{eqnarray*}\]
Team A (Z-Interval): \(\sigma\) is knownWe use the critical
Z-values (\(Z\)) for the specified
confidence levels.
Confidence Interval dengan Margin of Error (Z* × 4)
| 90% |
1.645 |
1.645 × 4 = 6.58 |
[203.42, 216.58] |
13.16 |
| 95% |
1.960 |
1.960 × 4 = 7.84 |
[202.16, 217.84] |
15.68 |
| 99% |
2.576 |
2.576 × 4 = 10.30 |
[199.70, 220.30] |
20.60 |
Team B (t-Interval): \(\sigma\) is unknownWe use the critical
t-values (\(t\)) with degrees of
freedom (\(df\)) \(=n-1=36-1=35\)
Confidence Interval Menggunakan t-Distribution (df =
35)
| 90% |
1.690 |
1.690 × 4 = 6.76 |
[203.24, 216.76] |
13.52 |
| 95% |
2.030 |
2.030 × 4 = 8.12 |
[201.88, 218.12] |
16.24 |
| 99% |
2.724 |
2.724 × 4 = 10.90 |
[199.10, 220.90] |
21.80 |
Interval Comparison
Visualization
The visualization would show that for every confidence level:
The t-intervals (Team B) are slightly wider than the Z-intervals
(Team A). All intervals are centered at the sample mean of 210 ms. The
width of the intervals increases as the confidence level increases
(e.g., the 99% interval is widest, the 90% is narrowest).
Explanation of
Interval Width Difference
The interval widths differ because of the underlying probability
distributions used: the Standard Normal (Z) Distribution versus the
Student’s t-Distribution.
\(\sigma\) Known (Team A \(\rightarrow\) Z-Test)
The Z-test is used when the population standard deviation (\(\sigma\)) is known.
Since \(\sigma\) is a fixed,
known value, the estimate of the standard error (\(\sigma/\sqrt{n}\)) is highly certain and
does not add extra variability to the analysis.
The critical \(Z^\) values are
fixed based on the confidence level.
\(\sigma\) Unknown (Team B \(\rightarrow\) t-Test)
The t-test is used when the population standard deviation (\(\sigma\)) is unknown, and we must
substitute the sample standard deviation (\(s\)) as an estimate.
The sample standard deviation (\(s\)) is itself an estimate that varies from
sample to sample. This introduces an extra source of uncertainty into
the standard error estimate.
To account for this added uncertainty, the t-distribution has
heavier tails (more spread out) than the Z-distribution.
This results in larger critical values (\(t > Z\)) and, consequently, a larger
Margin of Error (ME) and wider confidence intervals for the t-test
compared to the Z-test at the same confidence level.
In summary: The t-test requires a wider interval (is less precise) to
achieve the same confidence level as the Z-test because it must
compensate for the additional uncertainty introduced by estimating the
population standard deviation (\(\sigma\)) with the sample standard
deviation (\(s\)).
Case Study 5
One-Sided Confidence Interval: A Software as
a Service (SaaS) company wants to ensure that at least
70% of weekly active users utilize a premium feature.
From the experiment:
\[
\begin{eqnarray*}
n &=& 250 \quad \text{(total users)} \\
x &=& 185 \quad \text{(active premium users)}
\end{eqnarray*}
\]
Management is only interested in the lower bound of
the estimate.
Tasks:
- Identify the type of Confidence Interval and the
appropriate test.
- Compute the one-sided lower Confidence Interval at:
- \(90\%\)
- \(95\%\)
- \(99\%\)
- Visualize the lower bounds for all confidence levels.
- Determine whether the 70% target is statistically
satisfied.
The given data is: \[\begin{eqnarray*}
n &=& 250 \quad \text{(total users)} \\
x &=& 185 \quad \text{(active premium users)} \\
\hat{p} &=& \frac{x}{n} = \frac{185}{250} = 0.74
\end{eqnarray*}\] The target proportion to ensure is \(p_0 = 0.70\).
The standard error of the sample proportion (\(\hat{p}\)) is: \[
SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} =
\sqrt{\frac{0.74(1-0.74)}{250}} \approx 0.0277
\]
Identify the Type of
Confidence Interval and the Appropriate Test
Type of Confidence Interval: One-Sided Lower Confidence Interval for
a Population Proportion. The company is only interested in the lower
bound to ensure the feature usage is at least \(70\%\).
Appropriate Test/Method: The appropriate method is using the Z-test
for a Population Proportion (or the Normal Approximation method for
confidence intervals) because the sample size is large enough to satisfy
the normal approximation conditions (\(n\hat{p} = 185 > 10\) and \(n(1-\hat{p}) = 65 > 10\)).
Compute the One-Sided
Lower Confidence Interval
The formula for the one-sided lower confidence bound is: \[
\text{Lower Bound} = \hat{p} - Z_{1-\alpha} \cdot SE
\]
One-Sided Confidence Interval (Lower Bound)
| 90% |
0.10 |
1.282 |
0.7044 |
| 95% |
0.05 |
1.645 |
0.6944 |
| 99% |
0.01 |
2.326 |
0.6755 |
Detailed Results:
One-Sided Confidence Interval (Lower Bound)
| 90% |
1.281552 |
0.704448 |
| 95% |
1.644854 |
0.694369 |
| 99% |
2.326348 |
0.675463 |
Visualize the Lower
Bounds for All Confidence Levels
The following plot illustrates the calculated lower bounds against
the \(70\%\) target.(A bar chart titled
‘One-Sided Lower Confidence Bounds for Premium Feature Usage’ is
displayed. The x-axis shows confidence levels (90%, 95%, 99%), and the
y-axis shows the Lower Bound (CI). A horizontal dashed red line
indicates the target proportion of 0.70. The bars show lower bounds of
0.704
Determine Whether the
\(70\%\) Target is Statistically
Satisfied
The \(70\%\) target is statistically
satisfied at a given confidence level if the calculated Lower Bound is
\(\geq 0.70\).
At \(99\%\) Confidence:
Summary: The company can be \(90\%\)
confident that the true proportion of weekly active users utilizing a
premium feature is at least \(70\%\).
However, they cannot make this claim at the stricter \(95\%\) or \(99\%\) confidence levels.
---
title: "STUDY CASES"            # Main title of the document
subtitle: "Confidence Interval~ Week 13"  # Subtitle or topic for week 2
author: "Carol Dupino Pereira"      # Replace with your full name
date:  "`r format(Sys.Date(), '%B %d, %Y')`" # Auto displays the current date
output:                         # Output section defines the format and layout 
  rmdformats::readthedown:      # https://github.com/juba/rmdformats
    self_contained: true        # Embeds all resources (CSS, JS, images) 
    thumbnails: true            # Displays image thumbnails in the doc
    lightbox: true              # Enables click to enlarge images
    gallery: true               # Groups images into an interactive gallery
    number_sections: true       # Automatically numbers all sections
    lib_dir: libs               # Directory where JavaScript/CSS libraries
    df_print: "paged"           # Displays data frames as interactive paged 
    code_folding: "show"        # Allows folding/unfolding R code blocks 
    code_download: yes          # Adds a button to download all R code
    css : aaaa.css
---


```{r profile, echo=FALSE}
library(htmltools)

HTML('
<div style="width: 400px; height: 250px; background: linear-gradient(135deg, #ffffff 0%, #f0f0f0 100%); border: 2px solid #2c3e50; border-radius: 15px; box-shadow: 0 8px 20px rgba(0,0,0,0.15); padding: 20px; display: flex; align-items: center; gap: 20px; font-family: Arial, sans-serif; margin: 20px auto; overflow: hidden;">
  <div style="flex-shrink: 0;">
    <img src="foto_1.jpg" style="width: 120px; height: 120px; border-radius: 10px; object-fit: cover; border: 4px solid #2c3e50; box-shadow: 0 4px 10px rgba(0,0,0,0.2);">
  </div>
  <div style="flex: 1; overflow: hidden;">
    <div style="background: #ecf0f1; border: 1px solid #bdc3c7; border-radius: 8px; padding: 8px; margin-bottom: 10px;">
      <h1 style="color: #2c3e50; font-size: 18px; margin: 0 0 5px 0; font-weight: bold;">Carol Dupino Pereira</h1>
      <h2 style="color: #34495e; font-size: 14px; margin: 0; font-weight: normal;">NIM: 52250051</h2>
    </div>
    <p style="color: #7f8c8d; font-size: 12px; margin: 0 0 15px 0;">Mahasiswa Sains Data ITSB</p>
    
    <div style="display: flex; flex-wrap: wrap; gap: 8px;">
      <span style="background: #3498db; color: white; padding: 4px 10px; border-radius: 15px; font-size: 10px; font-weight: bold;">R Programming</span>
      <span style="background: #e74c3c; color: white; padding: 4px 10px; border-radius: 15px; font-size: 10px; font-weight: bold;">Data Science</span>
      <span style="background: #2ecc71; color: white; padding: 4px 10px; border-radius: 15px; font-size: 10px; font-weight: bold;">Statistics</span>
    </div>
  </div>
</div>
')
```

---

##   Case Study 1

**Confidence Interval for Mean, $\sigma$ Known:** An **e-commerce platform** wants to estimate the **average number of daily transactions per user** after launching a new feature. Based on large-scale historical data, the **population standard deviation** is known.

$$
\begin{eqnarray*}
\sigma &=& 3.2 \quad \text{(population standard deviation)} \\
n &=& 100 \quad \text{(sample size)} \\
\bar{x} &=& 12.6 \quad \text{(sample mean)}
\end{eqnarray*}
$$

**Tasks**

1. Identify the **appropriate statistical test** and justify your choice.
2. Compute the Confidence Intervals for:
   - $90\%$
   - $95\%$
   - $99\%$
3. Create a **comparison visualization** of the three confidence intervals.
4. Interpret the results in a business analytics context.

---

###    Appropriate Statistical Test and Justification

The appropriate statistical test for constructing the confidence interval for the population mean ($\mu$) is the Z-Interval for the Population Mean.

Justification:

- Population Standard Deviation ($\sigma$) is Known: This is the primary condition that dictates the use of the Z-distribution (Standard Normal Distribution) over the $t$-distribution.

- Large Sample Size ($n$): With a sample size of $n=100$ (which is $\ge 30$), the Central Limit Theorem ensures that the sampling distribution of the sample mean ($\bar{x}$) is approximately normal, even if the underlying population distribution is not perfectly normal. This further validates the use of the Z-statistic.

The formula used is:
$$
\text{Confidence Interval} = \bar{x} \pm Z_{\alpha/2} \left( \frac{\sigma}{\sqrt{n}} \right)
$$

###   Confidence Interval Computations

Given Parameters:
\begin{eqnarray*}
\sigma &=& 3.2 \\
n &=& 100 \\
\bar{x} &=& 12.6
\end{eqnarray*}

Standard Error (SE):
\begin{eqnarray*}\text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{3.2}{\sqrt{100}} = \frac{3.2}{10} = 0.32
\end{eqnarray*}

The computed confidence intervals are summarized in the table below:


```{r,echo=FALSE,message=FALSE,warning=FALSE}
# Membuat tabel confidence interval dua sisi
ci_table_two_sided <- data.frame(
  Confidence_Level = c("90%", "95%", "99%"),
  Z_alpha_over_2 = c(1.645, 1.960, 2.576),
  Margin_of_Error = c(0.526, 0.627, 0.824),
  Lower_Bound = c(12.074, 11.973, 11.776),
  Upper_Bound = c(13.126, 13.227, 13.424)
)

# Menampilkan tabel
library(knitr)

kable(
  ci_table_two_sided,
  caption = "Two-Sided Confidence Interval Menggunakan Zα/2",
  align = "c"
)

```

###    Comparison Visualization

The plot below visually compares the three confidence intervals, demonstrating how the interval width increases as the confidence level increases.

```{r,echo=FALSE,message=FALSE,warning=FALSE}
# Load necessary libraries
library(plotly)

# Provided data
conf_levels <- c(0.90, 0.95, 0.99)
z_values <- c(1.645, 1.960, 2.576)
me_values <- c(0.526, 0.627, 0.824)
lb_values <- c(12.074, 11.973, 11.776)
ub_values <- c(13.126, 13.227, 13.424)
mean_val <- (lb_values + ub_values) / 2  # All approximately 12.6

# Define x values for distributions (standardized scale)
x <- seq(-4, 4, 0.01)

# Z-distribution (Standard Normal)
y_z <- dnorm(x)
fig_z <- plot_ly(x = x, y = y_z, type = 'scatter', mode = 'lines', name = 'Z-Distribution (Standard Normal)') %>%
  layout(title = 'Z-Distribution with Critical Values',
         xaxis = list(title = 'Z-Score'),
         yaxis = list(title = 'Density'))

# Add critical values for Z-distribution
for (i in 1:3) {
  fig_z <- fig_z %>% 
    add_trace(x = c(-z_values[i], z_values[i]), 
              y = c(0, 0), 
              type = 'scatter', 
              mode = 'markers', 
              marker = list(size = 8, color = 'red'), 
              name = paste0(conf_levels[i]*100, '% Critical Z-Values'))
}

# T-distribution (assuming df = 30 for illustration, as df is not provided)
df <- 30
y_t <- dt(x, df)
t_crit <- qt(1 - (1 - conf_levels)/2, df)  # Critical t-values
fig_t <- plot_ly(x = x, y = y_t, type = 'scatter', mode = 'lines', name = paste('T-Distribution (df =', df, ')')) %>%
  layout(title = paste('T-Distribution (df =', df, ') with Critical Values'),
         xaxis = list(title = 'T-Score'),
         yaxis = list(title = 'Density'))

# Add critical values for T-distribution
for (i in 1:3) {
  fig_t <- fig_t %>% 
    add_trace(x = c(-t_crit[i], t_crit[i]), 
              y = c(0, 0), 
              type = 'scatter', 
              mode = 'markers', 
              marker = list(size = 8, color = 'blue'), 
              name = paste0(conf_levels[i]*100, '% Critical T-Values'))
}

# Confidence Intervals visualization (as error bars on the mean)
fig_ci <- plot_ly(x = paste0(conf_levels*100, '%'), 
                  y = mean_val, 
                  type = 'scatter', 
                  mode = 'markers', 
                  error_y = list(type = 'data', array = me_values), 
                  name = 'Confidence Intervals') %>%
  layout(title = 'Confidence Intervals for Mean',
         xaxis = list(title = 'Confidence Level'),
         yaxis = list(title = 'Mean Estimate'))

# Combine into subplots: Z-dist, T-dist, and CI
fig_combined <- subplot(fig_z, fig_t, fig_ci, nrows = 3, shareX = FALSE, titleY = TRUE) %>%
  layout(title = 'Visualizations of Z-Distribution, T-Distribution, and Confidence Intervals')

# Display the plot
fig_combined
```


###   Interpretation in a Business Analytics Context

The confidence intervals provide a range of plausible values for the true average number of daily transactions per user (μ) after the new feature launch.

1. 90% Confidence Interval (12.074,13.126):

- We are 90% confident that the true average number of daily transactions per user is between 12.074 and 13.126. This is the most precise (narrowest) estimate.

2. 95% Confidence Interval (11.973,13.227):

- This is the standard interval used in most scientific and business contexts. We are 95% confident that the true average is between 11.973 and 13.227. The margin of error is 0.627 transactions.

3. 99% Confidence Interval (11.776,13.424):

- This is the most reliable (highest confidence) estimate. We are 99% confident that the true average is between 11.776 and 13.424.

Business Insight: The key takeaway is the trade-off between Confidence and Precision:

- To be more confident (e.g., 99% confidence), the e-commerce platform must accept a wider interval (lower precision). This means the true average could be as low as 11.776 or as high as 13.424.

- To have a more precise estimate (e.g., 90% confidence), the platform must accept a lower confidence level.

Since all intervals are entirely above 11.776 transactions, the platform can be highly confident that the new feature has resulted in an average transaction rate per user that is significantly higher than, for instance, 11.0 (if that were a benchmark). The 95% CI is a good balance, suggesting the new feature has likely resulted in an average between approximately 12.0 and 13.2 daily transactions per user.

---


##   Case Study 2 

**Confidence Interval for Mean, $\sigma$ Unknown:** A **UX Research team** analyzes **task completion time (in minutes)** for a new mobile application. The data are collected from **12 users**:

$$
8.4,\; 7.9,\; 9.1,\; 8.7,\; 8.2,\; 9.0,\;
7.8,\; 8.5,\; 8.9,\; 8.1,\; 8.6,\; 8.3
$$

**Tasks:**

1. Identify the **appropriate statistical test** and explain why.
2. Compute the Confidence Intervals for:
   - $90\%$
   - $95\%$
   - $99\%$
3. Visualize the three intervals on a single plot.
4. Explain how **sample size and confidence level** influence the interval width.

---


###    Identify the appropriate statistical test and explain why.

Appropriate Statistical Test: Confidence Interval for the Mean using the $t$-distribution (often referred to as a $t$-interval).

Explanation:The $t$-distribution is the appropriate choice for constructing the confidence interval for the population mean ($\mu$) for two primary reasons:

1. Population Standard Deviation is Unknown ($\sigma$ unknown): When $\sigma$ is unknown, we must use the sample standard deviation ($s$) as an estimate.

2. Small Sample Size ($n < 30$): The sample size is $n=12$. For small samples with an unknown population standard deviation, the $t$-distribution provides a more accurate model of the sampling distribution of the mean than the standard normal ($z$) distribution.

The formula for the confidence interval for the mean is:
$$
\bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}
$$

where $t_{\alpha/2, n-1}$ is the critical $t$-value with $n-1$ degrees of freedom.

###   Compute the Confidence Intervals


<div class="stat-box">
  <h3>Sample Statistics</h3>
  <ul>
    <li><strong>Sample Size (n):</strong> 12</li>
    <li><strong>Degrees of Freedom (df):</strong> 11</li>
    <li><strong>Sample Mean (x̄):</strong> 8.4500 minutes</li>
    <li><strong>Sample Standard Deviation (s):</strong> 0.4079 minutes</li>
    <li><strong>Standard Error (s / √n):</strong> 0.1179 minutes</li>
  </ul>
</div>


The computed confidence intervals for the population mean task completion time ($\mu$) are:
```{r,echo=FALSE,message=FALSE,warning=FALSE}
# Membuat tabel confidence interval (menit)
ci_table_minutes <- data.frame(
  Confidence_Level = c("90%", "95%", "99%"),
  Lower_Bound_min = c(8.2435, 8.1912, 8.0772),
  Upper_Bound_min = c(8.6565, 8.7088, 8.8228)
)

# Menampilkan tabel
library(knitr)

kable(
  ci_table_minutes,
  caption = "Confidence Interval Waktu (dalam Menit)",
  digits = 4,
  align = "c"
)

```

<div style="
  border: 2px solid #2c7be5;
  background-color: #f1f6ff;
  padding: 15px;
  border-radius: 8px;
  margin-top: 15px;
  margin-bottom: 15px;
  font-size: 14px;
">
  <strong>Note:</strong><br><br>
  The critical <em>t</em>-values used in the calculation were:
  <ul style="margin-top: 8px;">
    <li>
      <em>t</em><sub>0.05, 11</sub> (90% CI): <strong>1.7959</strong>
    </li>
    <li>
      <em>t</em><sub>0.025, 11</sub> (95% CI): <strong>2.2010</strong>
    </li>
    <li>
      <em>t</em><sub>0.005, 11</sub> (99% CI): <strong>3.1058</strong>
    </li>
  </ul>
</div>

###   Visualize the three intervals on a single plot.

The three confidence intervals are visualized below. The red dot represents the sample mean ($\bar{x} = 8.45$ minutes), and the horizontal lines represent the interval for each confidence level.The plot shows the three confidence intervals:

```{r,echo=FALSE}
# Install and load necessary packages if not already installed
# install.packages("plotly")
library(plotly)

# Define the range of x values for the plots
x <- seq(-4, 4, length.out = 1000)

# Z-distribution (Standard Normal)
z_density <- dnorm(x, mean = 0, sd = 1)

# T-distributions with different degrees of freedom
t_df5 <- dt(x, df = 5)
t_df10 <- dt(x, df = 10)
t_df30 <- dt(x, df = 30)

# Create the plot using plotly
plot <- plot_ly() %>%
  # Add Z-distribution trace
  add_trace(x = x, y = z_density, type = 'scatter', mode = 'lines', 
            name = 'Z-distribution (Normal)', line = list(color = 'blue', width = 3)) %>%
  # Add T-distribution traces
  add_trace(x = x, y = t_df5, type = 'scatter', mode = 'lines', 
            name = 'T-distribution (df=5)', line = list(color = 'red', width = 2, dash = 'dash')) %>%
  add_trace(x = x, y = t_df10, type = 'scatter', mode = 'lines', 
            name = 'T-distribution (df=10)', line = list(color = 'green', width = 2, dash = 'dot')) %>%
  add_trace(x = x, y = t_df30, type = 'scatter', mode = 'lines', 
            name = 'T-distribution (df=30)', line = list(color = 'orange', width = 2, dash = 'dashdot')) %>%
  # Layout settings
  layout(title = 'Comparison of Z and T Distributions',
         xaxis = list(title = 'Value'),
         yaxis = list(title = 'Density'),
         legend = list(title = list(text = 'Distributions')))

# Display the plot
plot
```

```{r,echo=FALSE}
# Install and load necessary packages if not already installed
# install.packages("plotly")
library(plotly)

# Data for confidence intervals
confidence_levels <- c("90%", "95%", "99%")
lower_bounds <- c(8.2435, 8.1912, 8.0772)
upper_bounds <- c(8.6565, 8.7088, 8.8228)
sample_mean <- 8.45

# Assumed parameters based on data (scale ≈ 0.116, df=10 for t)
scale <- 0.116  # Approximate standard error (s / sqrt(n))
df <- 10  # Degrees of freedom for t-distribution

# Define x range around the mean
x <- seq(sample_mean - 4*scale, sample_mean + 4*scale, length.out = 1000)

# Z-distribution (Normal) density
z_density <- dnorm(x, mean = sample_mean, sd = scale)

# T-distribution density
t_density <- dt((x - sample_mean)/scale, df = df) / scale

# Create the plot
plot <- plot_ly()

# Add Z-distribution trace
plot <- plot %>% add_trace(x = x, y = z_density, type = 'scatter', mode = 'lines', 
                           name = 'Z-distribution (Normal)', line = list(color = 'blue', width = 2))

# Add T-distribution trace
plot <- plot %>% add_trace(x = x, y = t_density, type = 'scatter', mode = 'lines', 
                           name = 'T-distribution (df=10)', line = list(color = 'red', width = 2))

# Add shaded areas for confidence intervals (for Z-distribution)
for (i in 1:length(confidence_levels)) {
  # Find indices for shading
  idx <- x >= lower_bounds[i] & x <= upper_bounds[i]
  x_shade <- x[idx]
  y_shade <- z_density[idx]
  
  plot <- plot %>% add_trace(x = c(x_shade, rev(x_shade)), y = c(y_shade, rep(0, length(y_shade))), 
                             type = 'scatter', mode = 'lines', fill = 'tozeroy', 
                             fillcolor = 'rgba(0,0,255,0.3)', line = list(color = 'transparent'),
                             name = paste(confidence_levels[i], 'CI (Z)'), showlegend = TRUE)
}

# Add shaded areas for confidence intervals (for T-distribution)
for (i in 1:length(confidence_levels)) {
  # Find indices for shading
  idx <- x >= lower_bounds[i] & x <= upper_bounds[i]
  x_shade <- x[idx]
  y_shade <- t_density[idx]
  
  plot <- plot %>% add_trace(x = c(x_shade, rev(x_shade)), y = c(y_shade, rep(0, length(y_shade))), 
                             type = 'scatter', mode = 'lines', fill = 'tozeroy', 
                             fillcolor = 'rgba(255,0,0,0.3)', line = list(color = 'transparent'),
                             name = paste(confidence_levels[i], 'CI (T)'), showlegend = TRUE)
}

# Add vertical lines for sample mean
plot <- plot %>% add_trace(x = rep(sample_mean, 2), y = c(0, max(c(z_density, t_density))), 
                           type = 'scatter', mode = 'lines', 
                           line = list(color = 'black', width = 2, dash = 'dash'), 
                           name = 'Sample Mean (8.45 min)')

# Layout settings
plot <- plot %>% layout(title = 'Distribusi Z dan T dengan Interval',
                        xaxis = list(title = 'Time (menit)', range = c(min(x), max(x))),
                        yaxis = list(title = 'Density'),
                        legend = list(title = list(text = 'Legend')))

# Display the plot
plot
```

###    Factors Influencing Interval Width

The width of a confidence interval is determined by the Margin of Error ($ME = t^ \cdot \frac{s}{\sqrt{n}}$).

- Confidence Level:Effect:

As the confidence level increases (e.g., from 90% to 99%), the interval becomes wider.

Logic: To be more certain that the interval contains the true population mean, we must encompass a larger range of possible values. This is reflected in a larger critical value ($t^*$).

- Sample Size ($n$):

Effect: As the sample size increases, the interval becomes narrower (more precise).

Logic: Increasing $n$ reduces the Standard Error ($\frac{s}{\sqrt{n}}$). With more data, our estimate of the mean becomes more stable and reliable, allowing for a tighter range of estimation.

---

##   Case Study 3 

**Confidence Interval for a Proportion, A/B Testing:** A data science team runs an **A/B test** on a new *Call-To-Action (CTA)* button design. The experiment yields:

$$
\begin{eqnarray*}
n &=& 400 \quad \text{(total users)} \\
x &=& 156 \quad \text{(users who clicked the CTA)}
\end{eqnarray*}
$$

**Tasks:**

1. Compute the **sample proportion** $\hat{p}$.
2. Compute Confidence Intervals for the proportion at:
   - $90\%$
   - $95\%$
   - $99\%$
3. Visualize and compare the three intervals.
4. Explain how confidence level affects decision-making in product experiments.

---

The given data is:
$$
\begin{eqnarray*}
n &=& 400 \quad \text{(total users)} \\
x &=& 156 \quad \text{(users who clicked the CTA)}
\end{eqnarray*}
$$

###    Compute the Sample Proportion ($\hat{p}$)

The sample proportion $\hat{p}$ is the point estimate for the true population proportion.

$$
\hat{p} = \frac{x}{n} = \frac{156}{400} = 0.3900
$$

The sample proportion of users who clicked the new CTA design is $39.00\%$.

###  Compute Confidence Intervals

The confidence intervals (CIs) are calculated using the formula:
$$
\text{CI} = \hat{p} \pm Z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}
$$

The standard error ($\text{SE}$) is:
$$
\text{SE} = \sqrt{\frac{0.3900(1 - 0.3900)}{400}} \approx 0.0243
$$
```{r,echo=FALSE,message=FALSE,warning=FALSE}
# Membuat tabel confidence interval
ci_table <- data.frame(
  Confidence_Level = c("90%", "95%", "99%"),
  Z_score = c(1.6449, 1.9600, 2.5758),
  Margin_of_Error = c(0.0401, 0.0478, 0.0628),
  Confidence_Interval = c(
    "[0.3499, 0.4301]",
    "[0.3422, 0.4378]",
    "[0.3272, 0.4528]"
  )
)

# Menampilkan tabel
library(knitr)

kable(
  ci_table,
  caption = "Confidence Interval untuk Berbagai Tingkat Kepercayaan",
  align = "c"
)



```


###  Visualize and Compare the Three Intervals

```{r,echo=FALSE,message=FALSE,warning=FALSE}
# Load library
library(plotly)

# Data distribusi Z
z <- seq(-4, 4, length.out = 1000)
density <- dnorm(z)

# Membuat plot dasar
p <- plot_ly(
  x = ~z,
  y = ~density,
  type = "scatter",
  mode = "lines",
  line = list(width = 2),
  name = "Distribusi Normal Standar"
)

# Menambahkan garis Z untuk Confidence Level
p <- p %>%
  add_segments(
    x = -1.6449, xend = -1.6449,
    y = 0, yend = dnorm(-1.6449),
    line = list(dash = "dash", width = 2),
    name = "90% CI"
  ) %>%
  add_segments(
    x = 1.6449, xend = 1.6449,
    y = 0, yend = dnorm(1.6449),
    line = list(dash = "dash", width = 2),
    showlegend = FALSE
  ) %>%
  add_segments(
    x = -1.9600, xend = -1.9600,
    y = 0, yend = dnorm(-1.9600),
    line = list(dash = "dot", width = 2),
    name = "95% CI"
  ) %>%
  add_segments(
    x = 1.9600, xend = 1.9600,
    y = 0, yend = dnorm(1.9600),
    line = list(dash = "dot", width = 2),
    showlegend = FALSE
  ) %>%
  add_segments(
    x = -2.5758, xend = -2.5758,
    y = 0, yend = dnorm(-2.5758),
    line = list(dash = "longdash", width = 2),
    name = "99% CI"
  ) %>%
  add_segments(
    x = 2.5758, xend = 2.5758,
    y = 0, yend = dnorm(2.5758),
    line = list(dash = "longdash", width = 2),
    showlegend = FALSE
  )

# Layout
p <- p %>%
  layout(
    title = "Distribusi Z dengan Confidence Interval",
    xaxis = list(title = "Z-score"),
    yaxis = list(title = "Density"),
    legend = list(orientation = "h", x = 0.1, y = -0.2)
  )

# Tampilkan plot
p

```
The chart below visualizes how the width of the confidence interval increases with the confidence level.
The sample proportion ($\hat{p}=0.3900$) is the center of all three intervals (marked by the diamond).

###  Explain How Confidence Level Affects Decision-Making

The confidence level directly impacts the precision and certainty of your result, which is crucial in product experimentation like A/B testing.
```{r,echo=FALSE}
# Membuat tabel hubungan confidence level
ci_relation <- data.frame(
  Confidence_Level = c("Higher (99%)", "Lower (90%)"),
  Interval_Width = c("Wider", "Narrower"),
  Precision = c("Less precise", "More precise"),
  Certainty = c("More certain", "Less certain"),
  Risk_of_Type_I_Error = c(
    "Lower risk of Type I error",
    "Higher risk of Type I error"
  )
)

# Menampilkan tabel
library(knitr)

kable(
  ci_relation,
  caption = "Hubungan Confidence Level dengan Lebar Interval, Presisi, dan Risiko Error",
  align = "c"
)

```

-  1. Defining a Winner (Statistical Significance):

In A/B testing, a common decision rule is to declare a "winner" if the confidence interval of the difference between the two variants does not include zero.

-  2. Higher Confidence ($\mathbf{99\%}$):

The interval is very wide, making it harder to exclude a null hypothesis (e.g., that the new design is no different than the old one).
It requires a much larger difference in performance to achieve statistical significance. While this is the safest level, it often leads to inconclusive results, requiring longer testing times.

-  3. Lower Confidence ($\mathbf{90\%}$):

The interval is narrower, making it easier to achieve statistical significance. However, this increases the risk of a Type I error (a False Positive)—declaring the new CTA a winner when it is actually no better, or even worse, than the original.

-  4. Product Standard:

Most data science and product teams default to a $95\%$ confidence level (corresponding to a $\alpha=0.05$ significance level). This is considered a good balance, offering a reasonable level of certainty ($95\%$ sure the true value is in the range) without requiring an excessively large sample size or long test duration that would be needed for a $99\%$ confidence level.
```{r,echo=FALSE}
# Install necessary packages if not already installed
# install.packages("plotly")

# Load the plotly library
library(plotly)

# Define the range of x values for the plot
x <- seq(-4, 4, length.out = 1000)

# Compute the density of the standard normal distribution (Z-distribution)
z_density <- dnorm(x, mean = 0, sd = 1)

# Compute the density of the t-distribution for different degrees of freedom
t_df_5 <- dt(x, df = 5)  # t-distribution with 5 df
t_df_10 <- dt(x, df = 10)  # t-distribution with 10 df
t_df_30 <- dt(x, df = 30)  # t-distribution with 30 df

# Create the plot using plotly
plot <- plot_ly() %>%
  # Add the Z-distribution (normal)
  add_trace(x = x, y = z_density, type = 'scatter', mode = 'lines', 
            name = 'Z-distribution (Normal)', line = list(color = 'blue', width = 3)) %>%
  # Add t-distributions with different df
  add_trace(x = x, y = t_df_5, type = 'scatter', mode = 'lines', 
            name = 't-distribution (df=5)', line = list(color = 'red', width = 2, dash = 'dash')) %>%
  add_trace(x = x, y = t_df_10, type = 'scatter', mode = 'lines', 
            name = 't-distribution (df=10)', line = list(color = 'green', width = 2, dash = 'dot')) %>%
  add_trace(x = x, y = t_df_30, type = 'scatter', mode = 'lines', 
            name = 't-distribution (df=30)', line = list(color = 'orange', width = 2, dash = 'dashdot')) %>%
  # Layout settings
  layout(title = 'Comparison of Z-distribution and t-distributions',
         xaxis = list(title = 'x'),
         yaxis = list(title = 'Density'),
         legend = list(x = 0.7, y = 0.9))

# Display the plot
plot
```

---

##   Case Study 4 

**Precision Comparison (Z-Test vs t-Test):** Two data teams measure **API latency (in milliseconds)** under different conditions.

\begin{eqnarray*}
\text{Team A:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
\sigma &=& 24 \quad \text{(known population standard deviation)} \\[6pt]

\text{Team B:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
s &=& 24 \quad \text{(sample standard deviation)}
\end{eqnarray*}

**Tasks**

1. Identify the statistical test used by each team.
2. Compute Confidence Intervals for **90%, 95%, and 99%**.
3. Create a visualization comparing all intervals.
4. Explain why the **interval widths differ**, even with similar data.

---


###   Statistical Test Identification

The choice of statistical test for the mean depends on whether the population standard deviation ($\sigma$) is known and the sample size ($n$).

```{r,echo=FALSE}
# Membuat tabel pemilihan uji
test_selection <- data.frame(
  Team = c("Team A", "Team B"),
  Given_Standard_Deviation = c(
    "σ = 24 (Known population SD)",
    "s = 24 (Sample SD)"
  ),
  Test_Used = c(
    "Z-Test (or Z-interval)",
    "t-Test (or t-interval)"
  ),
  Justification = c(
    "Since the population standard deviation (σ) is known and the sample size (n = 36) is large (n ≥ 30), the Z-distribution is appropriate.",
    "Since the population standard deviation (σ) is unknown and only the sample standard deviation (s) is available, the t-distribution must be used."
  )
)

# Menampilkan tabel
library(knitr)

kable(
  test_selection,
  caption = "Pemilihan Uji Statistik Berdasarkan Informasi Simpangan Baku",
  align = "c"
)
```
###    Confidence Interval Computation

The formula for the Confidence Interval (CI) for the population mean ($\mu$) is:

Team A (Z-Interval): $$
\text{CI} = \bar{x} \pm Z^* \left(\frac{\sigma}{\sqrt{n}}\right)
$$

Team B (t-Interval): $$
\text{CI} = \bar{x} \pm t^* \left(\frac{s}{\sqrt{n}}\right)
$$

Common Parameters:
\begin{eqnarray*}
\text{Team A:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
\sigma &=& 24 \quad \text{(known population standard deviation)} \\[6pt]

\text{Team B:} \\
n &=& 36 \quad \text{(sample size)} \\
\bar{x} &=& 210 \quad \text{(sample mean)} \\
s &=& 24 \quad \text{(sample standard deviation)}
\\[6pt]\end{eqnarray*}


Standard Error (SE):
\begin{eqnarray*}\text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{24}{\sqrt{36}} = \frac{24}{6} = 4
\end{eqnarray*}


**Team A (Z-Interval): $\sigma$ is knownWe use the critical Z-values ($Z$) for the specified confidence levels.**

```{r,echo=FALSE}
# Membuat tabel confidence interval
ci_table_zstar <- data.frame(
  Confidence_Level = c("90%", "95%", "99%"),
  Z_Critical_Value = c(1.645, 1.960, 2.576),
  Margin_of_Error = c(
    "1.645 × 4 = 6.58",
    "1.960 × 4 = 7.84",
    "2.576 × 4 = 10.30"
  ),
  Confidence_Interval = c(
    "[203.42, 216.58]",
    "[202.16, 217.84]",
    "[199.70, 220.30]"
  ),
  Interval_Width = c(13.16, 15.68, 20.60)
)

# Menampilkan tabel
library(knitr)

kable(
  ci_table_zstar,
  caption = "Confidence Interval dengan Margin of Error (Z* × 4)",
  align = "c"
)
```

**Team B (t-Interval): $\sigma$ is unknownWe use the critical t-values ($t$) with degrees of freedom ($df$) $=n-1=36-1=35$**

```{r,echo=FALSE}
# Membuat tabel confidence interval berbasis t
ci_table_tstar <- data.frame(
  Confidence_Level = c("90%", "95%", "99%"),
  t_Critical_Value_df35 = c(1.690, 2.030, 2.724),
  Margin_of_Error = c(
    "1.690 × 4 = 6.76",
    "2.030 × 4 = 8.12",
    "2.724 × 4 = 10.90"
  ),
  Confidence_Interval = c(
    "[203.24, 216.76]",
    "[201.88, 218.12]",
    "[199.10, 220.90]"
  ),
  Interval_Width = c(13.52, 16.24, 21.80)
)

# Menampilkan tabel
library(knitr)

kable(
  ci_table_tstar,
  caption = "Confidence Interval Menggunakan t-Distribution (df = 35)",
  align = "c"
)

```


###    Interval Comparison Visualization

The visualization would show that for every confidence level:

The t-intervals (Team B) are slightly wider than the Z-intervals (Team A).
All intervals are centered at the sample mean of 210 ms.
The width of the intervals increases as the confidence level increases (e.g., the 99% interval is widest, the 90% is narrowest).

```{r,echo=FALSE,message=FALSE,warning=FALSE}
# Load necessary libraries
library(plotly)
library(dplyr)

# Data from the case
mean_val <- 210
n <- 36
sigma <- 24  # For Team A (Z-distribution)
s <- 24      # For Team B (t-distribution)
df <- n - 1  # 35 for t-distribution

# Confidence levels
conf_levels <- c(0.90, 0.95, 0.99)

# Z-scores and t-scores
z_scores <- qnorm((1 + conf_levels) / 2)
t_scores <- qt((1 + conf_levels) / 2, df)

# Calculate intervals
intervals <- data.frame(
  conf = rep(conf_levels, each = 2),
  type = rep(c("Z (Team A)", "t (Team B)"), times = length(conf_levels)),
  lower = numeric(length(conf_levels) * 2),
  upper = numeric(length(conf_levels) * 2)
)

for (i in 1:length(conf_levels)) {
  conf <- conf_levels[i]
  z_margin <- z_scores[i] * (sigma / sqrt(n))
  t_margin <- t_scores[i] * (s / sqrt(n))
  intervals$lower[(i-1)*2 + 1] <- mean_val - z_margin
  intervals$upper[(i-1)*2 + 1] <- mean_val + z_margin
  intervals$lower[(i-1)*2 + 2] <- mean_val - t_margin
  intervals$upper[(i-1)*2 + 2] <- mean_val + t_margin
}

# Create the interval comparison plot using Plotly
fig_intervals <- plot_ly(
  data = intervals,
  x = ~lower,
  y = ~factor(conf, levels = sort(conf_levels, decreasing = TRUE)),
  color = ~type,
  colors = c("Z (Team A)" = "blue", "t (Team B)" = "red"),
  type = "scatter",
  mode = "lines+markers",
  line = list(width = 4),
  marker = list(size = 8)
) %>%
  add_trace(
    x = ~upper,
    y = ~factor(conf, levels = sort(conf_levels, decreasing = TRUE)),
    color = ~type,
    mode = "lines+markers",
    line = list(width = 4),
    marker = list(size = 8),
    showlegend = FALSE
  ) %>%
  layout(
    title = "Confidence Interval Comparison: Z (Team A) vs t (Team B)",
    xaxis = list(title = "API Latency (ms)", range = c(195, 225)),
    yaxis = list(title = "Confidence Level"),
    annotations = list(
      list(x = mean_val, y = 0.5, text = "Sample Mean: 210 ms", showarrow = FALSE, xref = "x", yref = "paper", font = list(size = 12))
    )
  )

# Now, create a distribution plot for Z and t (sampling distributions of the mean)
x_vals <- seq(190, 230, length.out = 1000)
se <- sigma / sqrt(n)  # Standard error for both (since sigma = s = 24)
z_density <- dnorm(x_vals, mean = mean_val, sd = se)
t_density <- dt((x_vals - mean_val) / se, df) / se  # Scaled t-density

dist_data <- data.frame(
  x = rep(x_vals, 2),
  density = c(z_density, t_density),
  type = rep(c("Z Distribution (Team A)", "t Distribution (Team B)"), each = length(x_vals))
)

fig_dist <- plot_ly(
  data = dist_data,
  x = ~x,
  y = ~density,
  color = ~type,
  colors = c("Z Distribution (Team A)" = "blue", "t Distribution (Team B)" = "red"),
  type = "scatter",
  mode = "lines",
  line = list(width = 2)
) %>%
  layout(
    title = "Sampling Distributions: Z vs t",
    xaxis = list(title = "Sample Mean (ms)", range = c(190, 230)),
    yaxis = list(title = "Density")
  )

# Combine both plots into a subplot for a comprehensive visualization
fig <- subplot(fig_dist, fig_intervals, nrows = 2, shareX = FALSE, titleY = TRUE) %>%
  layout(
    title = "Distribution and Interval Comparison: Z vs t Distributions",
    showlegend = TRUE
  )

# Display the plot
fig
```

###    Explanation of Interval Width Difference

The interval widths differ because of the underlying probability distributions used: the Standard Normal (Z) Distribution versus the Student's t-Distribution.

####    $\sigma$ Known (Team A $\rightarrow$ Z-Test)

- The Z-test is used when the population standard deviation ($\sigma$) is known.

- Since $\sigma$ is a fixed, known value, the estimate of the standard error ($\sigma/\sqrt{n}$) is highly certain and does not add extra variability to the analysis.

- The critical $Z^$ values are fixed based on the confidence level.

####    $\sigma$ Unknown (Team B $\rightarrow$ t-Test)

- The t-test is used when the population standard deviation ($\sigma$) is unknown, and we must substitute the sample standard deviation ($s$) as an estimate.

- The sample standard deviation ($s$) is itself an estimate that varies from sample to sample. This introduces an extra source of uncertainty into the standard error estimate.

- To account for this added uncertainty, the t-distribution has heavier tails (more spread out) than the Z-distribution.

- This results in larger critical values ($t > Z$) and, consequently, a larger Margin of Error (ME) and wider confidence intervals for the t-test compared to the Z-test at the same confidence level.

In summary: The t-test requires a wider interval (is less precise) to achieve the same confidence level as the Z-test because it must compensate for the additional uncertainty introduced by estimating the population standard deviation ($\sigma$) with the sample standard deviation ($s$).


---

##    Case Study 5

**One-Sided Confidence Interval:** A **Software as a Service (SaaS)** company wants to ensure that **at least 70% of weekly active users** utilize a premium feature.

From the experiment:

$$
\begin{eqnarray*}
n &=& 250 \quad \text{(total users)} \\
x &=& 185 \quad \text{(active premium users)}
\end{eqnarray*}
$$

Management is only interested in the **lower bound** of the estimate.

**Tasks:**

1. Identify the **type of Confidence Interval** and the appropriate test.
2. Compute the **one-sided lower Confidence Interval** at:
   - $90\%$
   - $95\%$
   - $99\%$
3. Visualize the lower bounds for all confidence levels.
4. Determine whether the **70% target** is statistically satisfied.

---

The given data is:
\begin{eqnarray*}
n &=& 250 \quad \text{(total users)} \\
x &=& 185 \quad \text{(active premium users)} \\


\hat{p} &=& \frac{x}{n} = \frac{185}{250} = 0.74
\end{eqnarray*}
The target proportion to ensure is $p_0 = 0.70$. 

The standard error of the sample proportion ($\hat{p}$) is:
$$
SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = \sqrt{\frac{0.74(1-0.74)}{250}} \approx 0.0277
$$


###    Identify the Type of Confidence Interval and the Appropriate Test

 
Type of Confidence Interval: One-Sided Lower Confidence Interval for a Population Proportion. The company is only interested in the lower bound to ensure the feature usage is at least $70\%$.

Appropriate Test/Method: The appropriate method is using the Z-test for a Population Proportion (or the Normal Approximation method for confidence intervals) because the sample size is large enough to satisfy the normal approximation conditions ($n\hat{p} = 185 > 10$ and $n(1-\hat{p}) = 65 > 10$).

###    Compute the One-Sided Lower Confidence Interval

The formula for the one-sided lower confidence bound is:
$$
\text{Lower Bound} = \hat{p} - Z_{1-\alpha} \cdot SE
$$

```{r,echo=FALSE,message=FALSE,warning=FALSE}
# Membuat tabel one-sided confidence interval
one_sided_ci <- data.frame(
  Confidence_Level = c("90%", "95%", "99%"),
  Alpha = c(0.10, 0.05, 0.01),
  Z_1_minus_Alpha = c(1.282, 1.645, 2.326),
  Lower_Bound = c(0.7044, 0.6944, 0.6755)
)

# Menampilkan tabel
library(knitr)

kable(
  one_sided_ci,
  caption = "One-Sided Confidence Interval (Lower Bound)",
  align = "c"
)

```

**Detailed Results:**
```{r,echo=FALSE}
# Membuat tabel one-sided confidence interval (versi presisi)
one_sided_ci_precise <- data.frame(
  Confidence_Level = c("90%", "95%", "99%"),
  Z_1_minus_Alpha = c(1.281552, 1.644854, 2.326348),
  Lower_Bound_CI = c(0.704448, 0.694369, 0.675463)
)

# Menampilkan tabel
library(knitr)

kable(
  one_sided_ci_precise,
  caption = "One-Sided Confidence Interval (Lower Bound)",
  digits = 6,
  align = "c"
)

```

###    Visualize the Lower Bounds for All Confidence Levels

The following plot illustrates the calculated lower bounds against the $70\%$ target.(A bar chart titled 'One-Sided Lower Confidence Bounds for Premium Feature Usage' is displayed. The x-axis shows confidence levels (90%, 95%, 99%), and the y-axis shows the Lower Bound (CI). A horizontal dashed red line indicates the target proportion of 0.70. The bars show lower bounds of 0.704


---

```{r,echo=FALSE}
# Load necessary libraries
library(plotly)

# Given data from the plot description
conf_levels <- c(0.90, 0.95, 0.99)
lower_bounds <- c(0.7044, 0.6944, 0.6755)
target <- 0.70

# Reverse-engineer p_hat and SE from the lower bounds
# Assuming lower bound = p_hat - z * SE for one-sided lower CI
z_scores <- qnorm(conf_levels)
diffs <- lower_bounds[1] - lower_bounds  # Differences from 90% to others
z_diffs <- z_scores - z_scores[1]
SE <- diffs[2] / z_diffs[2]  # Approximate SE from 90% to 95%
p_hat <- lower_bounds[1] + z_scores[1] * SE  # Calculate p_hat

# Verify: p_hat should be consistent
# For 90%: 0.7044 + 1.282 * SE ≈ p_hat
# For 95%: 0.6944 + 1.645 * SE ≈ p_hat
# For 99%: 0.6755 + 2.326 * SE ≈ p_hat
# SE ≈ 0.0277, p_hat ≈ 0.74

SE <- 0.0277
p_hat <- 0.74
n <- round(p_hat * (1 - p_hat) / SE^2)  # Approximate n ≈ 251
df <- n - 1  # Degrees of freedom for t-distribution

# Create data for Z-distribution (sampling distribution: normal with mean=p_hat, sd=SE)
x_z <- seq(p_hat - 4*SE, p_hat + 4*SE, length.out = 1000)
y_z <- dnorm(x_z, mean = p_hat, sd = SE)

# Create data for T-distribution (t with df, scaled to match mean and sd approx)
x_t <- seq(p_hat - 4*SE, p_hat + 4*SE, length.out = 1000)
y_t <- dt((x_t - p_hat)/SE, df) / SE  # Scale to match sd

# Create plotly figure for Z-distribution (sampling distribution)
fig_z <- plot_ly(x = x_z, y = y_z, type = 'scatter', mode = 'lines', name = 'Sampling Distribution (Z-approx)',
                 line = list(color = 'blue')) %>%
  layout(title = paste0('Sampling Distribution for Proportion (Z-approximation, p̂ = ', p_hat, ', SE = ', round(SE, 4), ')'),
         xaxis = list(title = 'Proportion'),
         yaxis = list(title = 'Density'))

# Add horizontal line for target
fig_z <- fig_z %>% add_trace(x = c(min(x_z), max(x_z)), y = c(0, 0), type = 'scatter', mode = 'lines', 
                             line = list(color = 'red', dash = 'dash'), name = paste0('Target: ', target))

# Add vertical lines for lower bounds
for (i in 1:length(conf_levels)) {
  fig_z <- fig_z %>% add_trace(x = c(lower_bounds[i], lower_bounds[i]), y = c(0, dnorm(lower_bounds[i], p_hat, SE)), 
                               type = 'scatter', mode = 'lines', line = list(color = 'green', dash = 'dot'),
                               name = paste0(conf_levels[i]*100, '% Lower Bound: ', lower_bounds[i]))
}

```
----


----
```{r,echo=FALSE}

# Create plotly figure for T-distribution
fig_t <- plot_ly(x = x_t, y = y_t, type = 'scatter', mode = 'lines', name = paste0('Sampling Distribution (T, df = ', df, ')'),
                 line = list(color = 'green')) %>%
  layout(title = paste0('Sampling Distribution for Proportion (T-distribution, df = ', df, ')'),
         xaxis = list(title = 'Proportion'),
         yaxis = list(title = 'Density'))

# Add horizontal line for target
fig_t <- fig_t %>% add_trace(x = c(min(x_t), max(x_t)), y = c(0, 0), type = 'scatter', mode = 'lines', 
                             line = list(color = 'red', dash = 'dash'), name = paste0('Target: ', target))

# Add vertical lines for lower bounds
for (i in 1:length(conf_levels)) {
  fig_t <- fig_t %>% add_trace(x = c(lower_bounds[i], lower_bounds[i]), y = c(0, dt((lower_bounds[i] - p_hat)/SE, df) / SE), 
                               type = 'scatter', mode = 'lines', line = list(color = 'green', dash = 'dot'),
                               name = paste0(conf_levels[i]*100, '% Lower Bound: ', lower_bounds[i]))
}

# Display the plots
fig_z
fig_t
```
---

###    Determine Whether the $70\%$ Target is Statistically Satisfied

The $70\%$ target is statistically satisfied at a given confidence level if the calculated Lower Bound is $\geq 0.70$.

####    At $90\%$ Confidence:

- Lower Bound:$0.7044$

- Conclusion: Statistically Satisfied. Since $0.7044 > 0.70$, we are $90\%$ confident that the true proportion of weekly active users utilizing the premium feature is at least $70.44\%$.

####    At $95\%$ Confidence:

- Lower Bound: $0.6944$

- Conclusion: NOT Statistically Satisfied. Since $0.6944 < 0.70$, we cannot be $95\%$ confident that the true proportion is at least $70\%$.

####    At $99\%$ Confidence:

- Lower Bound: $0.6755$

- Conclusion: NOT Statistically Satisfied. Since $0.6755 < 0.70$, we cannot be $99\%$ confident that the true proportion is at least $70\%$.

Summary: The company can be $90\%$ confident that the true proportion of weekly active users utilizing a premium feature is at least $70\%$. However, they cannot make this claim at the stricter $95\%$ or $99\%$ confidence levels.
```{r,echo=FALSE,mensage=FALSE}
# Load library
library(plotly)

# Fungsi untuk membuat plot distribusi z
plot_z_distribution <- function() {
  # Data untuk distribusi normal standar
  x <- seq(-4, 4, length.out = 1000)
  y <- dnorm(x, mean = 0, sd = 1)
  
  # Z-scores untuk confidence levels (one-tailed)
  z_90 <- qnorm(0.90)  # ≈ 1.282
  z_95 <- qnorm(0.95)  # ≈ 1.645
  z_99 <- qnorm(0.99)  # ≈ 2.326
  
  # Buat plot dengan plotly
  p <- plot_ly() %>%
    add_trace(x = x, y = y, type = 'scatter', mode = 'lines', name = 'Distribusi Z',
              line = list(color = 'blue')) %>%
    # Area untuk 90% confidence
    add_trace(x = x[x <= z_90], y = y[x <= z_90], type = 'scatter', mode = 'lines', fill = 'tozeroy',
              fillcolor = 'rgba(0, 255, 0, 0.3)', line = list(color = 'green'), name = '90% Confidence (Satisfied)') %>%
    # Area untuk 95% confidence
    add_trace(x = x[x <= z_95], y = y[x <= z_95], type = 'scatter', mode = 'lines', fill = 'tozeroy',
              fillcolor = 'rgba(255, 255, 0, 0.3)', line = list(color = 'yellow'), name = '95% Confidence (Not Satisfied)') %>%
    # Area untuk 99% confidence
    add_trace(x = x[x <= z_99], y = y[x <= z_99], type = 'scatter', mode = 'lines', fill = 'tozeroy',
              fillcolor = 'rgba(255, 0, 0, 0.3)', line = list(color = 'red'), name = '99% Confidence (Not Satisfied)') %>%
    # Garis vertikal untuk z-scores
    add_trace(x = c(z_90, z_90), y = c(0, dnorm(z_90)), type = 'scatter', mode = 'lines',
              line = list(color = 'green', dash = 'dash'), name = paste('Z 90%:', round(z_90, 3))) %>%
    add_trace(x = c(z_95, z_95), y = c(0, dnorm(z_95)), type = 'scatter', mode = 'lines',
              line = list(color = 'yellow', dash = 'dash'), name = paste('Z 95%:', round(z_95, 3))) %>%
    add_trace(x = c(z_99, z_99), y = c(0, dnorm(z_99)), type = 'scatter', mode = 'lines',
              line = list(color = 'red', dash = 'dash'), name = paste('Z 99%:', round(z_99, 3))) %>%
    layout(title = 'Distribusi Z (Normal Standar) untuk Confidence Levels',
           xaxis = list(title = 'Z-Score'),
           yaxis = list(title = 'Density'),
           annotations = list(
             list(x = z_90, y = dnorm(z_90) + 0.05, text = 'Lower Bound: 0.7044 (Satisfied)', showarrow = FALSE),
             list(x = z_95, y = dnorm(z_95) + 0.05, text = 'Lower Bound: 0.6944 (Not Satisfied)', showarrow = FALSE),
             list(x = z_99, y = dnorm(z_99) + 0.05, text = 'Lower Bound: 0.6755 (Not Satisfied)', showarrow = FALSE)
           ))
  
  return(p)
}

# Fungsi untuk membuat plot distribusi t
plot_t_distribution <- function(df = 30) {
  # Data untuk distribusi t
  x <- seq(-4, 4, length.out = 1000)
  y <- dt(x, df = df)
  
  # T-scores untuk confidence levels (one-tailed)
  t_90 <- qt(0.90, df = df)
  t_95 <- qt(0.95, df = df)
  t_99 <- qt(0.99, df = df)
  
  # Buat plot dengan plotly
  p <- plot_ly() %>%
    add_trace(x = x, y = y, type = 'scatter', mode = 'lines', name = paste('Distribusi T (df =', df, ')'),
              line = list(color = 'purple')) %>%
    # Area untuk 90% confidence
    add_trace(x = x[x <= t_90], y = y[x <= t_90], type = 'scatter', mode = 'lines', fill = 'tozeroy',
              fillcolor = 'rgba(0, 255, 0, 0.3)', line = list(color = 'green'), name = '90% Confidence (Satisfied)') %>%
    # Area untuk 95% confidence
    add_trace(x = x[x <= t_95], y = y[x <= t_95], type = 'scatter', mode = 'lines', fill = 'tozeroy',
              fillcolor = 'rgba(255, 255, 0, 0.3)', line = list(color = 'yellow'), name = '95% Confidence (Not Satisfied)') %>%
    # Area untuk 99% confidence
    add_trace(x = x[x <= t_99], y = y[x <= t_99], type = 'scatter', mode = 'lines', fill = 'tozeroy',
              fillcolor = 'rgba(255, 0, 0, 0.3)', line = list(color = 'red'), name = '99% Confidence (Not Satisfied)') %>%
    # Garis vertikal untuk t-scores
    add_trace(x = c(t_90, t_90), y = c(0, dt(t_90, df)), type = 'scatter', mode = 'lines',
              line = list(color = 'green', dash = 'dash'), name = paste('T 90%:', round(t_90, 3))) %>%
    add_trace(x = c(t_95, t_95), y = c(0, dt(t_95, df)), type = 'scatter', mode = 'lines',
              line = list(color = 'yellow', dash = 'dash'), name = paste('T 95%:', round(t_95, 3))) %>%
    add_trace(x = c(t_99, t_99), y = c(0, dt(t_99, df)), type = 'scatter', mode = 'lines',
              line = list(color = 'red', dash = 'dash'), name = paste('T 99%:', round(t_99, 3))) %>%
    layout(title = paste('Distribusi T (df =', df, ') untuk Confidence Levels'),
           xaxis = list(title = 'T-Score'),
           yaxis = list(title = 'Density'),
           annotations = list(
             list(x = t_90, y = dt(t_90, df) + 0.05, text = 'Lower Bound: 0.7044 (Satisfied)', showarrow = FALSE),
             list(x = t_95, y = dt(t_95, df) + 0.05, text = 'Lower Bound: 0.6944 (Not Satisfied)', showarrow = FALSE),
             list(x = t_99, y = dt(t_99, df) + 0.05, text = 'Lower Bound: 0.6755 (Not Satisfied)', showarrow = FALSE)
           ))
  
  return(p)
}

# Jalankan dan tampilkan plot
plot_z <- plot_z_distribution()
plot_t <- plot_t_distribution(df = 30)  # Ubah df jika diketahui ukuran sampel

# Tampilkan plot (bisa dijalankan satu per satu atau gabungkan)
plot_z
plot_t
```

