Members

Column


The Vived

Rafael

Rafael Yogi Septiadi Putra

RAFAEL YOGI SEPTIADI P.

Data Science Major

Data Science
Major Data Science
Data Science
Student ID 52250019
Data Science
Lecturer Bakti Siregar, M. Sc., CSD.
Subject Basic Statistics

Paskalis

Paskalis Farelnata Zamasi

PASKALIS FARENATA ZAMASI

Data Science Major

Data Science
Major Data Science
Data Science
Student ID 52250043
Data Science
Lecturer Bakti Siregar, M. Sc., CSD.
Subject Basic Statistics

Aya

Risky Nurhidayah

Risky Nurhidayah

Data Science Major

Data Science
Major Data Science
Data Science
Student ID 52250030
Data Science
Lecturer Bakti Siregar, M. Sc., CSD.
Subject Basic Statistics

Aidil

M.Fitrah Aidil Harahap

M. FITRAH AIDIL HARAHAP

Data Science Major

Data Science
Major Data Science
Data Science
Student ID 52250031
Data Science
Lecturer Bakti Siregar, M. Sc., CSD.
Subject Basic Statistics

Dhefio

Dhefio Alim Muzakki

DHEFIO ALIM MUZAKKI

Data Science Major

Data Science
Major Data Science
Data Science
Student ID 52250014
Data Science
Lecturer Bakti Siregar, M. Sc., CSD.
Subject Basic Statistics

Summary of Basic Statistics

Column


Chapter 1

Interpretation
The first mind map illustrates the framework for descriptive statistical analysis of customer data, focusing on three main aspects: customer characteristics, satisfaction levels, and purchasing patterns. Customer characteristics, such as age, monthly income, gender, region, and education level, are analyzed as an initial step to understand the distribution and composition of the data. Next, customer satisfaction levels are analyzed through satisfaction scores and comparisons between customer groups. In addition, customer purchasing patterns, particularly the number of monthly purchases and comparisons between card levels, are analyzed using measures of central tendency and dispersion. This analysis uses descriptive statistics and is limited by the absence of population inference or hypothesis testing.

Chapter 2

Interpretation
The second mind map focuses on the analysis of numerical variables in customer data, namely age, monthly income, number of purchases, and satisfaction score. The analysis is conducted through statistical summaries such as mean, median, minimum, maximum, and standard deviation to describe typical values and the degree of data variation. Additionally, this mind map emphasizes the identification of initial patterns and trends, such as dominant age groups, income patterns, and purchase intensity. The analysis also includes the detection of anomalies or outliers that may affect the results of statistical summaries and the overall interpretation of the data.

Chapter 3

Interpretation
Basic visualization is the cornerstone of effective data communication. By utilizing the ggplot2 framework in R, we move away from static, rigid charting toward a “Grammar of Graphics” approach. This allows for a modular construction of plots where data, aesthetics, and geometric objects are layered to reveal hidden truths within a dataset.

The three most relevant visualizations identified—Histograms, Boxplots, and Scatter Plots—each serve a distinct analytical purpose. Histograms are essential for understanding the distribution of a single variable, allowing us to see if data is concentrated in one area or spread out. Boxplots provide a powerful summary of the “Five-Number Summary,” making it the professional standard for comparing groups and identifying outliers that could skew further mathematical modeling. Finally, Scatter Plots allow for the exploration of relationships, helping us visualize whether two variables move together (correlation) or are independent.

In conclusion, these techniques fall under the umbrella of Descriptive Statistics. They aim to solve the problem of data opacity by providing a clear, visual summary of the data’s central tendency, dispersion, and outliers. Mastering these basics ensures that any subsequent complex analysis is built on a solid, well-understood foundation of the underlying data structure.

Chapter 4

Interpretation
The analysis of Central Tendency for the Customer Purchase Data highlights the fundamental importance of looking beyond simple averages. In statistics, the primary problem we solve is summarization without distortion.

For the Age variable, the balance between the Mean and Median tells us our “typical” customer profile is stable and middle-aged. However, the Total Purchase variable tells a different story. Because the Mean is nearly double the Median, we can conclude that the dataset is heavily influenced by high-value transactions. This skewness suggests that the business relies on a small number of big spenders to drive the average up, while the majority of customers (the Median and Mode) are spending significantly less.

By applying Descriptive Statistics, we transform 200 rows of chaos into a clear narrative: the store has a mature average audience but thrives on rare, high-ticket electronics purchases that shift the financial center of gravity.

Chapter 5

Interpretation
Statistical dispersion is used to describe how spread out data values are from the average. In the analysis of customers’ monthly income, dispersion helps explain the differences in economic conditions among customers. Range is the simplest measure of data dispersion. It shows the difference between the highest and lowest values. Formula: Max − Min. A large range indicates a significant gap between the highest and lowest monthly incomes. Variance measures the average of the squared differences from the mean. It shows how widely the data values are spread, but the result is expressed in squared units, making it less intuitive to interpret directly. Standard deviation is the square root of variance and represents the average distance of data values from the mean. A higher standard deviation indicates greater variability in customers’ monthly income. Interpretation: The results show that customers’ monthly income has a high level of dispersion. This means the income values vary widely and are not concentrated around the mean, reflecting diverse economic backgrounds among customers.

Chapter 6

Interpretation
Essential of Probability is a mathematical concept used to measure how likely an event is to occur. It helps analyze uncertainty and make predictions based on data. Basic Components of Probability consists of three main components: an experiment, a sample space, and an event. An experiment produces an outcome, the sample space includes all possible outcomes, and an event is a specific subset of those outcomes. Probability is a numerical value between 0 and 1 that represents the likelihood of an event occurring. A probability of 0 means the event is impossible, while a probability of 1 means the event is certain. In data analysis, probability is used to identify patterns and trends within dataset. Determining a Relevant Event A relevant event must be clearly defined and meaningful within the dataset. In this case, the selected event is a randomly chosen customer having a Gold Card, which represents an important customer category. Probability Calculation Since the data comes from real observations, empirical probability is used.Probability is calculated by dividing the number of favorable outcomes by the total number of outcomes. Interpretation of the Result : A probability value of 0.25 indicates a 25% chance that a randomly selected customer has a Gold Card. This means that approximately 25 out of every 100 customers belong to the Gold Card category.

Chapter 7

Interpretation
Age Variable This figure shows the results of the probability distribution analysis for the customer age variable based on descriptive statistics. The age distribution tends to be symmetrical and relatively flat (platykurtic). This is supported by three main indicators, namely the closeness of the mean (40.94) and median (41.00) values, a skewness value very close to zero (-0.0487), and a negative kurtosis value (-1.0997). These findings indicate that customer ages are evenly distributed around the average value, without skewness toward younger or older ages, and show a fairly wide variation in age that is not concentrated in any particular group.

Chapter 8

Interpretation
Explains the process of calculating the 95% confidence interval for the average monthly income of customers. With a total of 735 observations, the estimated mean is Rp11,795,425.19 and a relatively large standard deviation, indicating considerable income variation. Based on the calculation of the standard error, margin of error, and critical value z = 1.96, the confidence interval is obtained between Rp11,414,386.44 and Rp12,176,463.94. Statistically, this result indicates that the average monthly income of the customer population is estimated to fall within this range. Practically, the rather wide interval reflects the heterogeneity of customer income levels.

Chapter 9

Interpretation
Describes the application of a one-tailed hypothesis test (right-tailed t-test) to test whether the average age of customers is greater than 40 years. Using data from 735 customers, the average age was found to be approximately 41.06 years with a t-test statistic of 2.2063. This value exceeds the critical t value (1.645) at a 5% significance level, leading to the rejection of the null hypothesis. This result indicates that, statistically, the average age of customers is significantly higher than 40 years. These findings have important business implications, particularly for determining marketing strategies and developing products that align better with the dominant customer age segment.

Chapter 10

Interpretation
Explains the use of the Kruskal–Wallis nonparametric test to compare satisfaction levels among more than two groups or regions. This test is used when the data are not normally distributed and are measured on an ordinal or continuous scale. The process starts by formulating the hypothesis that the medians of all groups are equal, then the data are combined and ranked, followed by calculating the sum of ranks for each group to obtain the H statistic. The H value is then compared to the critical Chi-Square value at a 0.05 significance level to make a decision. If H is greater than the critical value, the null hypothesis is rejected, indicating that there is a significant difference in satisfaction levels between the groups.

Dataset

Table

All About Basic Visualizations

Column


Pie-Chart

Interpretation
This donut chart illustrates the distribution of membership card levels among customers, revealing that Classic Card dominates with approximately 25% of users, followed by Silver Card at 20%, Gold Card at 18%, and Platinum Card at 15%. This suggests a tiered structure where lower-tier cards are more prevalent, potentially indicating broader accessibility or marketing focus on entry-level memberships. The interactive features allow users to hover for exact percentages and select slices for emphasis, enhancing data exploration.

Bar-Chart

Interpretation
This bar chart highlights the top 10 job categories by average user satisfaction score, with Government Employee leading at approximately 4.85, followed by Entrepreneur at 4.82. The data indicates that professional roles like Government Employee and Freelancer report higher satisfaction, possibly due to job stability or alignment with gym services. Interactive elements include data labels showing exact averages and color gradients for visual appeal, allowing deeper insights into job-related satisfaction trends.

Line-Chart

Interpretation
This longitudinal analysis of customer satisfaction from 2015 to 2024 reveals a remarkably resilient service delivery model characterized by a strong Mean Reversion property, where monthly ratings consistently gravitate toward the global average despite minor operational fluctuations. From a professional standpoint, the process is strictly “In-Control,” as all data points remain within the Upper and Lower Control Limits (UCL/LCL), confirming that the observed variance is merely “common cause noise” rather than systemic failure. This statistical stability provides stakeholders with empirical assurance that the gym’s quality control measures are robust, ensuring that the customer experience remains standardized and shielded from the pressures of membership growth over nearly a decade of operation.

Central_Tendency

Interpretation
The distribution morphometry indicates a [Input: Positive/Negative] skew, suggesting that while the median purchase volume is stable, there is a significant cohort of high-frequency buyers driving the mean upward. The Rug Plot at the base validates the empirical density, confirming that our smoothing model aligns perfectly with actual transactional clusters. This visualization proves that our customer base is not monolithic, but composed of distinct behavioral segments that can be targeted differently

Statistical Dispersion

Interpretation
The Economic DNA report identifies a high-density “equilibrium” in the middle-income segment while highlighting a significant positive skew toward high-net-worth outliers. This structural duality suggests that while the mass-market core provides foundational stability, the Elite Segment (ruby-colored outliers) represents a distinct, high-value tier requiring a specialized VVIP strategy. Strategically, the organization should maintain standardized excellence for the majority while engineering bespoke, premium offerings to capture the disproportionate purchasing power at the distribution’s upper extremity.

Probability Distributions

Interpretation
The Age Distribution Analysis identifies a bimodal demographic, with primary engagement peaks among working professionals (30–40) and a secondary mature segment (50–60). Despite a stable central tendency near the early 40s, the Cumulative Distribution reveals a steady progression toward an older membership base. The density “valleys” between these peaks suggest a need to move away from broad marketing in favor of a bifurcated strategy that specifically targets high-performance careerists and longevity-focused adults.

Confidence Interval

## Column

Case Study

Do customers in the Professional job category have a higher level of satisfaction than the average customer population?

  • Objective

To estimate the average customer satisfaction with the Professional job category (μ) based on sample data, with a 95% confidence level.

  • Sample Data & Statistics

Number of respondents (n) = 120 Mean satisfaction (x̄) = 7.85 Sample standard deviation (s) = 1.25

  • Confidence Level

Confidence Level = 95%

Significance level (α) = 0.05

Method

  • Method Used

The t-distribution is used because the population standard deviation is unknown. The general formula is:

\[ CI = \bar{x} \pm t_{\alpha/2, df} \cdot \left( \frac{s}{\sqrt{n}} \right) \] Degrees of freedom (df) = n - 1 = 119

  • Calculation

Critical t-value: t_{0.025,119} ≈ 1.98

Standard Error (SE):

\[ SE = \frac{s}{\sqrt{n}} = \frac{1.25}{\sqrt{120}} \approx 0.114 \]

Margin of Error (ME):

\[ ME = t \cdot SE = 1.98 \times 0.114 \approx 0.23 \] Confidence Interval:

\[ CI = \bar{x} \pm ME = 7.85 \pm 0.23 \Rightarrow [7.62,\ 8.08] \]

Interpretation

  • Interpretation

The average customer satisfaction score for Professionals ranges from 7.62 to 8.08 with a 95% confidence level. This means that if this study were repeated multiple times, approximately 95% of the resulting interval would encompass the true average satisfaction score for the Professional population.

  • Additional Notes

The general population mean (7.62) falls right at the lower end of the interval.

This indicates that Professional customers tend to have higher satisfaction, although the difference is not significant.

Statistical Inference

## Column

Case Study

Do customers in the Professional job category have a higher level of satisfaction than the average customer population?

  • Hypothesis:

\[H_0: \mu_{\text{Prof}} = \mu_{\text{Pop}} \quad \text{vs} \quad H_1: \mu_{\text{Prof}} > \mu_{\text{Pop}}\] one-way test because the focus is “higher”.

Data and Assumptions

  • Sample Statistics (Professional \[n = 120, \bar{x} = 7.85, s = 1.25.\]

  • Population mean (of the entire sample) \(\mu_{\text{Pop}} = 7.62.\)

  • Key assumptions:

    • Independence among respondents.

    • Random sample of Professional customers.

    • The t-approximation is valid because n is large enough; robust to minor deviations from normality.

Test Selection

  • Test Selection and Test Statistics

Test Used: One-Sample t-test against the reference value \(\mu_{\text{Pop}} = 7.62\).

Test Statistics:

\[t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{7.85 - 7.62}{1.25/\sqrt{120}}\]

\[s/\sqrt{n} = \frac{1.25}{\sqrt{120}} \approx 0.114, \quad t \approx \frac{0.23}{0.114} \approx 2.02\]

Degrees of freedom:\(\text{df} = n - 1 = 119\)

Test decision (α = 0.05)

  • One-tailed test: compare t with \(t_{0.95, 119} \approx 1.66\)

  • Since \(2.02 > 1.66, \text{ Reject } H_0\)

  • p-value (one-tailed): \(\approx 0.022 \to < 0.05\)

  • \(( \text{If two-tailed: } p \approx 0.045, \text{ remains significant at } 5\%).\)

Estimate

Estimated effects and practical significance

  • Cohen’s d:

\[d = \frac{\bar{x} - \mu_0}{s} = \frac{0.23}{1.25} \approx 0.18\]

ESmall effect—statistically significant, but the increase in satisfaction is moderate in practical terms.

  • Confidence interval for the difference in means (two-tailed, 95%):

\[( \bar{x} - \mu_0 ) \pm t_{0.975, 119} \cdot \frac{s}{\sqrt{n}} = 0.23 \pm 1.98 \cdot 0.114 \\\]

\[\approx 0.23 \pm 0.23 \implies [0.00, 0.46]\] The lower bound is close to zero—consistent with a small effect.

Robustness check (brief)

  • Mild normality: with n = 120, the t-test is quite robust.

  • Outliers: check the boxplot/histogram of Professional satisfaction scores; if there are strong outliers, consider a nonparametric test (discussed separately).

  • Equality of variance is not relevant for the one-sample t-test, but is important when moving on to comparing two independent groups.

Ineferential

  • Inferential Conclusion

With a one-tailed test of alpha = 0.05, the average satisfaction of Professional customers is significantly higher than the population average (t = approx 2.02, p = approx 0.022). While significant, the effect size is small (d = approx 0.18), so practical implications need to be considered in conjunction with the business context (e.g., service segmentation or loyalty programs targeting Professionals).

  • Interpretation

Mean customer satisfaction among Professionals is significantly higher than the population average. However, the effect size is relatively small (Cohen’s d ≈ 0.18), so while statistically significant, this increase in satisfaction is not significant in practical terms.

Nonparametric Methods

## Column

Case Study

Do customers in the Professional job category have a higher level of satisfaction than the average customer population?

  • Method Used

The method used was the Wilcoxon Signed-Rank Test, a nonparametric test used to compare the median of a sample with a specific reference value. This test was chosen because customer satisfaction data is not assumed to be normally distributed.

  • Hypotesis

H_0:\(\ \tilde{X}{\text{Professional}} \le \tilde{X}{\text{Population}}\)

H_1:\(\ \tilde{X}{\text{Professional}} > \tilde{X}{\text{Population}}\)

\(\alpha = 0{,}05\)

Calculation and Conclution

  • Calculation Steps

The difference is calculated using

\[d_i = X_i - \tilde{X}_0\]

Observations with \(d_i = 0\) are excluded from the Wilcoxon Signed-Rank analysis, and then the absolute values of \(|d_i|\) are ranked. The sum of the positive signed ranks is used as the Wilcoxon test statistic.

  • Calculation Results

\[\tilde{X}_{\text{Population}} = 3{,}7383\]

\[\tilde{X}_{\text{Professional}} = 4{,}5465\]

\[W = 7\]

\[p\text{-value} = 0{,}3125\]

  • Test Decision

\[p\text{-value} > \alpha \Rightarrow H_0 \text{fail to reject}\]

  • Conclusion

Based on the Wilcoxon Signed-Rank test, the statistical value obtained was \(W = 7\) with \(p\text{-value} = 0{.}3125\). Since \(p\text{-value} > \alpha = 0{.}05\), there is insufficient statistical evidence to state that customers in the Professional job category have a higher level of satisfaction than the overall customer population.