2024-10-09

An Inro to Statistics in Data Analytics

In a data-driven world, brimming with raw data, a Data Analyst transforms available raw data into meaningful, actionable insights using statistical methods and models.

Analysts identify patterns, predict trends, and test hypotheses using Statistics, then makes informed decisions based on data.

Statistics Is Used in Data Analytics to:

  • Test Hypothesis
  • Create Probability Distributions
  • Build Algorithms
  • Improve Business Insights

Hypothesis Testing

Analysts can determine if the insights derived from data are by chance or
can be related to a unique cause using hypothesis testing.

Scenario:
Suppose we have quality test scores from two product lines:
one group used the traditional formulation,
and the other group used a new sustainable formula.
Based on our Null Hypothesis, \(H_0\), and our Alternative Hypothesis, \(H_1\),
We can test if the new method significantly improves scores.

Hypothesis Testing

Steps in Hypothesis Testing

  1. Formulate Hypotheses
    Null Hypothesis \(H_0\): Sustainable formula does not improve test scores
    (mean difference = 0).
    Alternative Hypothesis \(H_1\): Sustainable formula improves test scores
    (mean difference > 0).

  2. Collect Data

  3. Choose a Significance Level
    Commonly used significance level: \(\alpha = 0.05\)

  4. Perform the Test
    We’ll use a one-tailed t-test to compare the means of the two groups.

# Data
traditional <- c(78, 85, 82, 88, 75, 79, 83, 91, 77, 84)
sustainable_formula <- c(85, 89, 92, 95, 88, 90, 94, 96, 91,
    93)

# Perform t-test
t_test_result <- t.test(sustainable_formula, traditional, alternative = "greater")

# Print results
print(t_test_result)
Welch Two Sample t-test

data: sustainable_formula and traditional t = 4.7259, df = 15.77, p-value = 0.0001186 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 5.735182 Inf sample estimates: mean of x mean of y 91.3 82.2

The results from this Hypothesis Test tell us that the true difference in means
is greater than 0. That is, there evidence to suggest that Sustainable formula
improves test scores.

Creating Probability Distributions

A probability distribution measures the chances of various outcomes under specific conditions.

traditional_df <- data.frame(Value = traditional, Group = "Traditional")
sustainable_formula_df <- data.frame(Value = sustainable_formula,
    Group = "Sustainable Formula")


ggplot(traditional_df, aes(x = Value, fill = Group)) + geom_density(alpha = 0.5) +
    labs(title = "Probability Distribution of Traditional Formula Group",
        x = "Value", y = "Density")

We can see that the probability distribution of the Traditional formula is right skewed
because more of the scores are lower than the mean score.

ggplot(sustainable_formula_df, aes(x = Value, fill = Group)) +
    geom_density(alpha = 0.5) + labs(title = "Probability Distributions of Sustainable Formula Group",
    x = "Value", y = "Density")

We can see that the probability distribution of the sustainable formula is left skewed
because more of the scores are higher than the mean score.

Improving Business Intelligence

With understanding the probability distribution of the two formula groups,
we can combine them and compare.

traditional_density <- density(traditional)
sustainable_formula_density <- density(sustainable_formula)
plot_ly() %>%
    add_lines(x = ~traditional_density$x, y = ~traditional_density$y,
        name = "Traditional", line = list(color = "blue"), fill = "tozeroy",
        fillcolor = "blue") %>%
    add_lines(x = ~sustainable_formula_density$x, y = ~sustainable_formula_density$y,
        name = "Sustainable Formula", line = list(color = "red"),
        fill = "tozeroy", fillcolor = "red") %>%
    layout(title = "Probability Distributions", xaxis = list(title = "Value"),
        yaxis = list(title = "Density"))

We can see that the sustainable formula has a substantially higher average,
higher median and higher mode.
The company want to consider moving forward with the new sustainable formula.

Statistics in Data Analytics

Harnessing the power of statistics, a Data Analyst can conduct hypothesis
testing, create and analyze probability distributions, and formulate
business insights based on these statistical processes.