Interpretation
Basic visualization is the cornerstone of effective data communication.
By utilizing the ggplot2 framework in R, we move away from static, rigid
charting toward a “Grammar of Graphics” approach. This allows for a
modular construction of plots where data, aesthetics, and geometric
objects are layered to reveal hidden truths within a dataset.
The three most relevant visualizations identified—Histograms, Boxplots, and Scatter Plots—each serve a distinct analytical purpose. Histograms are essential for understanding the distribution of a single variable, allowing us to see if data is concentrated in one area or spread out. Boxplots provide a powerful summary of the “Five-Number Summary,” making it the professional standard for comparing groups and identifying outliers that could skew further mathematical modeling. Finally, Scatter Plots allow for the exploration of relationships, helping us visualize whether two variables move together (correlation) or are independent.
In conclusion, these techniques fall under the umbrella of Descriptive Statistics. They aim to solve the problem of data opacity by providing a clear, visual summary of the data’s central tendency, dispersion, and outliers. Mastering these basics ensures that any subsequent complex analysis is built on a solid, well-understood foundation of the underlying data structure.Interpretation
The analysis of Central Tendency for the Customer Purchase Data
highlights the fundamental importance of looking beyond simple averages.
In statistics, the primary problem we solve is summarization without
distortion.
For the Age variable, the balance between the Mean and Median tells us our “typical” customer profile is stable and middle-aged. However, the Total Purchase variable tells a different story. Because the Mean is nearly double the Median, we can conclude that the dataset is heavily influenced by high-value transactions. This skewness suggests that the business relies on a small number of big spenders to drive the average up, while the majority of customers (the Median and Mode) are spending significantly less.
By applying Descriptive Statistics, we transform 200 rows of chaos into a clear narrative: the store has a mature average audience but thrives on rare, high-ticket electronics purchases that shift the financial center of gravity.Interpretation
The distribution morphometry indicates a [Input: Positive/Negative]
skew, suggesting that while the median purchase volume is stable, there
is a significant cohort of high-frequency buyers driving the mean
upward. The Rug Plot at the base validates the empirical density,
confirming that our smoothing model aligns perfectly with actual
transactional clusters. This visualization proves that our customer base
is not monolithic, but composed of distinct behavioral segments that can
be targeted differently
Interpretation
The Age Distribution Analysis identifies a bimodal
demographic, with primary engagement peaks among working professionals
(30–40) and a secondary mature segment (50–60). Despite a stable central
tendency near the early 40s, the Cumulative Distribution reveals a
steady progression toward an older membership base. The density
“valleys” between these peaks suggest a need to move away from broad
marketing in favor of a bifurcated strategy that specifically targets
high-performance careerists and longevity-focused adults.
Do customers in the Professional job category have a higher level of satisfaction than the average customer population?
To estimate the average customer satisfaction with the Professional job category (μ) based on sample data, with a 95% confidence level.
Number of respondents (n) = 120 Mean satisfaction (x̄) = 7.85 Sample standard deviation (s) = 1.25
Confidence Level = 95%
Significance level (α) = 0.05
The t-distribution is used because the population standard deviation is unknown. The general formula is:
\[ CI = \bar{x} \pm t_{\alpha/2, df} \cdot \left( \frac{s}{\sqrt{n}} \right) \] Degrees of freedom (df) = n - 1 = 119
Critical t-value: t_{0.025,119} ≈ 1.98
Standard Error (SE):
\[ SE = \frac{s}{\sqrt{n}} = \frac{1.25}{\sqrt{120}} \approx 0.114 \]
Margin of Error (ME):
\[ ME = t \cdot SE = 1.98 \times 0.114 \approx 0.23 \] Confidence Interval:
\[ CI = \bar{x} \pm ME = 7.85 \pm 0.23 \Rightarrow [7.62,\ 8.08] \]
The average customer satisfaction score for Professionals ranges from 7.62 to 8.08 with a 95% confidence level. This means that if this study were repeated multiple times, approximately 95% of the resulting interval would encompass the true average satisfaction score for the Professional population.
The general population mean (7.62) falls right at the lower end of the interval.
This indicates that Professional customers tend to have higher satisfaction, although the difference is not significant.
Do customers in the Professional job category have a higher level of satisfaction than the average customer population?
\[H_0: \mu_{\text{Prof}} = \mu_{\text{Pop}} \quad \text{vs} \quad H_1: \mu_{\text{Prof}} > \mu_{\text{Pop}}\] one-way test because the focus is “higher”.
Data and Assumptions
Sample Statistics (Professional \[n = 120, \bar{x} = 7.85, s = 1.25.\]
Population mean (of the entire sample) \(\mu_{\text{Pop}} = 7.62.\)
Key assumptions:
Independence among respondents.
Random sample of Professional customers.
The t-approximation is valid because n is large enough; robust to minor deviations from normality.
Test Used: One-Sample t-test against the reference value \(\mu_{\text{Pop}} = 7.62\).
Test Statistics:
\[t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{7.85 - 7.62}{1.25/\sqrt{120}}\]
\[s/\sqrt{n} = \frac{1.25}{\sqrt{120}} \approx 0.114, \quad t \approx \frac{0.23}{0.114} \approx 2.02\]
Degrees of freedom:\(\text{df} = n - 1 = 119\)
Test decision (α = 0.05)
One-tailed test: compare t with \(t_{0.95, 119} \approx 1.66\)
Since \(2.02 > 1.66, \text{ Reject } H_0\)
p-value (one-tailed): \(\approx 0.022 \to < 0.05\)
\(( \text{If two-tailed: } p \approx 0.045, \text{ remains significant at } 5\%).\)
Estimated effects and practical significance
\[d = \frac{\bar{x} - \mu_0}{s} = \frac{0.23}{1.25} \approx 0.18\]
ESmall effect—statistically significant, but the increase in satisfaction is moderate in practical terms.
\[( \bar{x} - \mu_0 ) \pm t_{0.975, 119} \cdot \frac{s}{\sqrt{n}} = 0.23 \pm 1.98 \cdot 0.114 \\\]
\[\approx 0.23 \pm 0.23 \implies [0.00, 0.46]\] The lower bound is close to zero—consistent with a small effect.
Robustness check (brief)
Mild normality: with n = 120, the t-test is quite robust.
Outliers: check the boxplot/histogram of Professional satisfaction scores; if there are strong outliers, consider a nonparametric test (discussed separately).
Equality of variance is not relevant for the one-sample t-test, but is important when moving on to comparing two independent groups.
With a one-tailed test of alpha = 0.05, the average satisfaction of Professional customers is significantly higher than the population average (t = approx 2.02, p = approx 0.022). While significant, the effect size is small (d = approx 0.18), so practical implications need to be considered in conjunction with the business context (e.g., service segmentation or loyalty programs targeting Professionals).
Mean customer satisfaction among Professionals is significantly higher than the population average. However, the effect size is relatively small (Cohen’s d ≈ 0.18), so while statistically significant, this increase in satisfaction is not significant in practical terms.
Do customers in the Professional job category have a higher level of satisfaction than the average customer population?
The method used was the Wilcoxon Signed-Rank Test, a nonparametric test used to compare the median of a sample with a specific reference value. This test was chosen because customer satisfaction data is not assumed to be normally distributed.
H_0:\(\ \tilde{X}{\text{Professional}} \le \tilde{X}{\text{Population}}\)
H_1:\(\ \tilde{X}{\text{Professional}} > \tilde{X}{\text{Population}}\)
\(\alpha = 0{,}05\)
The difference is calculated using
\[d_i = X_i - \tilde{X}_0\]
Observations with \(d_i = 0\) are excluded from the Wilcoxon Signed-Rank analysis, and then the absolute values of \(|d_i|\) are ranked. The sum of the positive signed ranks is used as the Wilcoxon test statistic.
\[\tilde{X}_{\text{Population}} = 3{,}7383\]
\[\tilde{X}_{\text{Professional}} = 4{,}5465\]
\[W = 7\]
\[p\text{-value} = 0{,}3125\]
\[p\text{-value} > \alpha \Rightarrow H_0 \text{fail to reject}\]
Based on the Wilcoxon Signed-Rank test, the statistical value obtained was \(W = 7\) with \(p\text{-value} = 0{.}3125\). Since \(p\text{-value} > \alpha = 0{.}05\), there is insufficient statistical evidence to state that customers in the Professional job category have a higher level of satisfaction than the overall customer population.