16 November 2024

Why Statistics?

  • 147 zettabytes (21 zeroes!) of data will be generated in 2024 (Duarte, 2024)
  • Without the tools necessary to understand it, gathering and storing it is useless
  • Statistics is “the science concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data” (UC Irvine, n.d.)
  • Inferential statistics allows us to draw conclusions about a population and generalize our conclusions
    • Population measures = Parameters and Sample measures = Statistics

Figure: (Swain, 2023)

What is Hypothesis Testing?

  • Hypothesis = an assumption about the data. Do you think you know what the data is telling you?
  • Hypothesis Testing = the method used to (1) evaluate results and (2) determine if they are meaningful (Pace, 2024). Did you get the results because of a specific cause?

Figure: (Srivastava, 2023)

Inference and Probability

  1. Sample the Population (Collect the Data)
  2. Analyze and Summarize
  3. Draw Conclusions by estimating sampling error, comparing results to the alpha, and setting the probability
  4. State your conclusion, e.g., We are 90% sure the percentage of people who [do this] is within 2% of 45%.

Figure: (Open Learning Initiative, n.d.)

Demonstration

  • Hypothesis testing capabilities can be demonstrated using the ChickWeight database in RStudio
    • 578 rows (Observations)
    • 4 columns (variables)
      • Weight - numeric; body weight in grams
      • Time - numeric vector; days since birth with a measurement made
      • Chick - ordered factor; unique identifier for each chick
      • Diet - factor; which one of 4 diets the chick received
head(ChickWeight)
Grouped Data: weight ~ Time | Chick
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

Analysis of Variance (ANOVA)

\(\displaystyle F = {MS_t \over MS_e} = {Mean \; Squared \; Treatment \over Mean \; Squared \; Error} = {Variance \; Between \; Groups \over Varriance \; Within \; Groups}\)

  • ANOVA tests for statistically significant differences among 3 or more groups
  • A high F-factor (F) indicates a significant difference
anova_result = aov(weight ~ as.factor(Diet), data = ChickWeight)
summary(anova_result)
                 Df  Sum Sq Mean Sq F value   Pr(>F)    
as.factor(Diet)   3  155863   51954   10.81 6.43e-07 ***
Residuals       574 2758693    4806                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The ANOVA test on ChickWeight (summary above) has a F-Value of 10.81
  • The result suggests we should reject the null hypothesis (there is a difference)

3D Scatter Plot

  • A scatter plot can be used to see the difference(s) indicated by the ANOVA

Line Diagram

  • A line diagram can graphically represent the data in a different way.

Bar Chart

  • A bar chart visually compares the catagories rather than showing the trend over time.

Conclusion

  • Hypothesis testing can seem daunting, especially if you just look at Kumar’s formula (2023)

\(\displaystyle z = \dfrac{\overline{x} - \mu}{\sigma \over \sqrt{n}}\)

  • However, this presentation introduced you to a number of painless tools you can use to evaluate data without worrying about the math
  • The ANOVA for ChickWeight told us there was a difference and graphing the data illustrated it
  • Whether you are numerically or visually driven, these tools can help you dissect your data and make it meaningful

References

Duarte, F. (June 13, 2024). Amount of Data Created Daily (2024). Exploding Topics. https://explodingtopics.com/blog/data-generated-per-day

Kumar, M. (November 2, 2023). Hypothesis Testing Formula, Definition, Solved Examples. PW. https://www.pw.live/exams/school/hypothesis-testing-formula/#:~:text=Hypothesis%20Testing%20Formula%3A%20z%3D%20%E2%80%8B,n%20is%20the%20sample%20size

Open Learning Initiative. (n.d.) Module 7: Linking Probability to Statistical Inference. https://courses.lumenlearning.com/wm-concepts-statistics/chapter/wim-linking-probability-to-statistical-inference/

Pace, K. (Sep 8, 2023). How the Field of Statistics Is Used in Data Analytics. Western Governors University. https://www.wgu.edu/blog/how-field-statistics-used-data-analytics2309.html#:~:text=Statistics%20enables%20data%20scientists%20to,on%20a%20sample%20of%20data

Srivastava, A. (Oct 23, 2023). Hypothesis Testing: Your Data’s Silent Judge. Medium.com. https://medium.com/@akashsri306/hypothesis-testing-your-datas-silent-judge-6b2503cbdf92

Swain, A. (Mar 10, 2023). The What And Why Of Hypothesis Testing. Medium.com. https://medium.com/@avijitswain11/the-what-and-why-of-hypothesis-testing-bd4f6b7f2005#:~:text=If%20we%20want%20to%20use,make%20conclusions%20about%20the%20population

UC Irvine (n.d.). What is Statistics? Department of Statistics. https://www.stat.uci.edu/what-is-statistics/