GE142 - Course Introduction

6. Statistics and Probability

Dr Robert Batzinger
Instructor Emeritus

8/15/22

1 Rubber band Bungee

1.1 No Name corrected

1.2 Ye Huiwen, Nang Num Oo, et. al

1.3 Tony, Zen Miko, et. al

1.4 James, et. al

1.5 Nichana, Sirasin, et al

1.6 Chev, Pam, Tan

Chev, Pam, Tan

2 Basic Statistics and Probability

2.1 Descriptive statistics

  • summarizes and organizes characteristics of a data set

  • provide understanding of the distribution, central tendency, and variability of the data

  • uses tables or graphs to visualize the data

- Additional details:

-   https://www.scribbr.com/statistics/descriptive-statistics/)

-   https://www.investopedia.com/terms/d/descriptive_statistics.asp

2.2 Dataset

  • A collection of data that is
  • A random sampling
  • Accuracy represents the population being studied

Mercury content of fish from Florida rivers or lakes (measured as mg/kg)

\[\begin{matrix} 1.230 &1.330 &0.040 &0.044 &1.200 &0.270 \\ 0.490 &0.190 &0.830 &0.810 &0.710 &0.500 \\ 0.490 &1.160 &0.050 &0.150 &0.190 &0.770 \\ 1.080 &0.980 &0.630 &0.560 &0.410 &0.730 \\ 0.590 &0.340 &0.750 &0.870 &0.560 &0.170 \\ 0.180 &0.190 &0.040 &0.490 &1.100 &0.160 \\ 0.100 &0.210 &0.860 &0.520 &0.650 &0.270 \\ 0.940 &0.400 &0.430 &0.250 &0.270 &\\ \end{matrix}\]

2.3 Frequency distribution:

  • summarizes the frequency of each value or category of a variable.
  • Works for categorical (qualitative) and numerical (quantitative) variables.

2.4 Common Scales Modeling used for modeling

2.5 Descriptive statistics()

  • summarizes and organizes characteristics of a data set

  • provide understanding of the distribution, central tendency, and variability of the data

  • uses tables or graphs to visualize the data

  • Additional details:

    • https://www.scribbr.com/statistics/descriptive-statistics/)

    • https://www.investopedia.com/terms/d/descriptive_statistics.asp

2.6 Dataset

  • A collection of data that is
  • A random sampling
  • Accuracy represents the population being studied

Mercury content of fish from Florida rivers or lakes (measured as mg/kg)

\[\begin{matrix} 1.230 &1.330 &0.040 &0.044 &1.200 &0.270 \\ 0.490 &0.190 &0.830 &0.810 &0.710 &0.500 \\ 0.490 &1.160 &0.050 &0.150 &0.190 &0.770 \\ 1.080 &0.980 &0.630 &0.560 &0.410 &0.730 \\ 0.590 &0.340 &0.750 &0.870 &0.560 &0.170 \\ 0.180 &0.190 &0.040 &0.490 &1.100 &0.160 \\ 0.100 &0.210 &0.860 &0.520 &0.650 &0.270 \\ 0.940 &0.400 &0.430 &0.250 &0.270 &\\ \end{matrix}\]

2.7 Frequency distribution:

  • summarizes the frequency of each value or category of a variable.
  • Works for categorical (qualitative) and numerical (quantitative) variables.

2.8 Visualization

Bar chart

Pie chart

Histogram

2.9 Visualization

Box Plot

Violin Plot

2.10 Basic description

Quantitative Stats

  • Mean, Median, Mode
  • Distribution

Qualitative Stats

  • Distribution/ Clusters
  • Tabulation

\[\small\begin{matrix} & Heavy & Light & Non- \\ Factors & Smoker & Smoker & smoker\\ Cancer & 20 & 9 & 5 \\ Cancerfree & 40 & 30 & 60 \\ \end{matrix}\]

2.11 Basic description

2.12 Measures of variability:

  • Range: minimum to maximum

  • Quartiles values: \[x[n/4], x[n/2], x[2n/4]\]

  • Standard deviation: \[\sigma = \sqrt{\sum (\bar x - x)^2 / (n - 1)}\]

  • Standard error of the mean: \[SE = \sigma/\sqrt{n}\]

2.13 Comparisons of subpopulations

Qualitative

  • t- test:

\[t = \frac{diff}{SE}\]

Quantitative

  • Chi Square test:

\[\chi^2 = \sum\frac{(obs - exp)^2}{exp}\]

3 t-test

\[\displaystyle{t=\frac{difference}{standard\ error}}\]

3.1 t test - One sample

\[{\displaystyle t={\frac {{\bar {x}}-\mu _{0}}{ \frac{s}{\sqrt {n}}}} = \frac{\Delta x}{\frac{\sigma}{\sqrt{n}}}}\]

\[df = n - 1\]

3.2 2 Sample of equal size equal variance and size

\[{\displaystyle t={\frac {{\bar {X}}_{1}-{\bar {X}}_{2}}{s_{p}{\sqrt {\frac {2}{n}}}}}}={\frac {{\bar {X}}_{1}-{\bar {X}}_{2}}{\sqrt {\frac {s_{1}^{2}+s_{2}^{2}}{n}}}}\]

\[{\displaystyle s_{p}={\sqrt {\frac {s_{1}^{2}+s_{2}^{2}}{2}}}}\] \[df= 2n - 2\]

3.3 2 Sample t-test

\[t = \frac{\bar x_1 - \bar x_2}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

\[s_p = \sqrt{\frac{(n_1-1)s_1^2 +(n_2-1)s_2^2}{n_1+n_2-2}}\]

\[df = n_1 + n_2 -2\]

3.4 Welch’s t-test

\[t = \frac{\quad\bar x_1 - \bar x_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

\[{\displaystyle \mathrm {d.f.} ={\frac {\left({\frac {s_{1}^{2}}{n_{1}}}+{\frac {s_{2}^{2}}{n_{2}}}\right)^{2}}{{\frac {\left(s_{1}^{2}/n_{1}\right)^{2}}{n_{1}-1}}+{\frac {\left(s_{2}^{2}/n_{2}\right)^{2}}{n_{2}-1}}}}} \]

4 Minimum sample size

\[\displaystyle \left[t = \frac{\Delta x}{\frac{\sigma}{\sqrt{n}}}\right] \quad \longrightarrow\quad \left[n = \left(\frac{t \sigma}{\Delta x}\right)^2\right]\] \[n = \left(\frac{t\sigma}{M}\right)^2\]

5 Statistical issues

Correlation does not equal casuation

Type I and Type II error

5.1 Decision making:

  • Comparison of values on a shared standard
  • Statistically significant differences
  • Weighted scores to change emphasis
  • Backpack algorithm

5.2 shortcomings of datasets

  • Biased data: usually due to sampling methods
  • Biased weighting: choosing weights to force an outcome
  • Shifting values over time: grading system is not standardized or permenant
  • Incomplete data about characteristics: missing key aspects of the problem
  • Incomplete understanding of the relationships between factors: factors are not independant and influence each other

5.3 Decision mechanisms:

  • Letting others decide: often good place to start in an unknown field
  • Micromanaged, structured decision tree: allows for successful completion of a difficult process
  • Random search: helps to discover new solutions
  • Goal seeking: decisions in the flow towards an objective
  • Information-based decision making: iterative development process trying new things and evaluating the outcomes

5.4 Usefulness of statistics

6 Correlation vs Regression

  • Correlaton R value
  • Goodness of fit
  • Types of relationships

6.1 Correlation

7 Gas prices in Thailand

8 Linear Regression


Call:
lm(formula = evo95 ~ wks)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.88969 -0.35508 -0.08585  0.39108  1.22185 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.45700    0.21288  147.77   <2e-16 ***
wks          0.35577    0.01432   24.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5163 on 23 degrees of freedom
Multiple R-squared:  0.9641,    Adjusted R-squared:  0.9625 
F-statistic: 617.3 on 1 and 23 DF,  p-value: < 2.2e-16

9 Normal Population: x = 50 +/- 15

10 Uniform Population for 0-100

11 Parento Distribution


## Basic description

11.1 Measures of variability:

  • Range: minimum to maximum

  • Quartiles values: \[x[n/4], x[n/2], x[3n/4]\]

  • Standard deviation: \[\sigma = \sqrt{\sum (\bar x - x)^2 / (n - 1)}\]

  • Standard error of the mean: \[SE = \sigma/\sqrt{n}\]

11.2 Comparisons

Qualitative

  • t- test:

\[t = \frac{diff}{SE}\]

Quantitative

  • Chi Square test:

\[\chi^2 = \sum\frac{(obs - exp)^2}{exp}\]

12 Statistical issues

::: {.column width=“50%”

Correlation does not equal casuation

Type I and Type II error (

::::

12.1 Decision making:

  • Comparison of values on a shared standard
  • Statistically significant differences
  • Weighted scores to change emphasis
  • Backpack algorithm

12.2 shortcomings of datasets

  • Biased data: usually due to sampling methods
  • Biased weighting: choosing weights to force an outcome
  • Shifting values over time: grading system is not standardized or permenant
  • Incomplete data about characteristics: missing key aspects of the problem
  • Incomplete understanding of the relationships between factors: factors are not independant and influence each other

13 Correlation vs Regression

  • Correlaton R value
  • Goodness of fit
  • Types of relationships

13.1 Exam

  • GE141: 13 May 2023: 09:30-12:30, PC401

I hope this outline is helpful for you.😊