GE142 - Course Introduction

6. Statistics and Probability

Dr Robert Batzinger
Instructor Emeritus

8/15/22

1 Rubber band Bungee

1.1 No Name corrected

1.2 Ye Huiwen, Nang Num Oo, et. al

1.3 Tony, Zen Miko, et. al

1.4 James, et. al

1.5 Nichana, Sirasin, et al

1.6 Chev, Pam, Tan

Chev, Pam, Tan

2 Basic Statistics and Probability

2.1 Descriptive statistics

summarizes and organizes characteristics of a data set
provide understanding of the distribution, central tendency, and variability of the data
uses tables or graphs to visualize the data

- Additional details:

-   https://www.scribbr.com/statistics/descriptive-statistics/)

-   https://www.investopedia.com/terms/d/descriptive_statistics.asp

2.2 Dataset

A collection of data that is
A random sampling
Accuracy represents the population being studied

Mercury content of fish from Florida rivers or lakes (measured as mg/kg)

\[\begin{matrix} 1.230 &1.330 &0.040 &0.044 &1.200 &0.270 \\ 0.490 &0.190 &0.830 &0.810 &0.710 &0.500 \\ 0.490 &1.160 &0.050 &0.150 &0.190 &0.770 \\ 1.080 &0.980 &0.630 &0.560 &0.410 &0.730 \\ 0.590 &0.340 &0.750 &0.870 &0.560 &0.170 \\ 0.180 &0.190 &0.040 &0.490 &1.100 &0.160 \\ 0.100 &0.210 &0.860 &0.520 &0.650 &0.270 \\ 0.940 &0.400 &0.430 &0.250 &0.270 &\\ \end{matrix}\]

2.3 Frequency distribution:

summarizes the frequency of each value or category of a variable.
Works for categorical (qualitative) and numerical (quantitative) variables.

2.4 Common Scales Modeling used for modeling

2.5 Descriptive statistics()

summarizes and organizes characteristics of a data set
provide understanding of the distribution, central tendency, and variability of the data
uses tables or graphs to visualize the data
Additional details:
- https://www.scribbr.com/statistics/descriptive-statistics/)
- https://www.investopedia.com/terms/d/descriptive_statistics.asp

2.6 Dataset

A collection of data that is
A random sampling
Accuracy represents the population being studied

Mercury content of fish from Florida rivers or lakes (measured as mg/kg)

2.7 Frequency distribution:

summarizes the frequency of each value or category of a variable.
Works for categorical (qualitative) and numerical (quantitative) variables.

2.8 Visualization

Bar chart

Pie chart

Histogram

2.9 Visualization

Box Plot

Violin Plot

2.10 Basic description

Quantitative Stats

Mean, Median, Mode
Distribution

Qualitative Stats

Distribution/ Clusters
Tabulation

\[\small\begin{matrix} & Heavy & Light & Non- \\ Factors & Smoker & Smoker & smoker\\ Cancer & 20 & 9 & 5 \\ Cancerfree & 40 & 30 & 60 \\ \end{matrix}\]

2.11 Basic description

2.12 Measures of variability:

Range: minimum to maximum
Quartiles values: \[x[n/4], x[n/2], x[2n/4]\]
Standard deviation: \[\sigma = \sqrt{\sum (\bar x - x)^2 / (n - 1)}\]
Standard error of the mean: \[SE = \sigma/\sqrt{n}\]

2.13 Comparisons of subpopulations

Qualitative

t- test:

\[t = \frac{diff}{SE}\]

Quantitative

Chi Square test:

\[\chi^2 = \sum\frac{(obs - exp)^2}{exp}\]

3 t-test

\[\displaystyle{t=\frac{difference}{standard\ error}}\]

3.1 t test - One sample

\[{\displaystyle t={\frac {{\bar {x}}-\mu _{0}}{ \frac{s}{\sqrt {n}}}} = \frac{\Delta x}{\frac{\sigma}{\sqrt{n}}}}\]

\[df = n - 1\]

3.2 2 Sample of equal size equal variance and size

\[{\displaystyle t={\frac {{\bar {X}}_{1}-{\bar {X}}_{2}}{s_{p}{\sqrt {\frac {2}{n}}}}}}={\frac {{\bar {X}}_{1}-{\bar {X}}_{2}}{\sqrt {\frac {s_{1}^{2}+s_{2}^{2}}{n}}}}\]

\[{\displaystyle s_{p}={\sqrt {\frac {s_{1}^{2}+s_{2}^{2}}{2}}}}\] \[df= 2n - 2\]

3.3 2 Sample t-test

\[t = \frac{\bar x_1 - \bar x_2}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

\[s_p = \sqrt{\frac{(n_1-1)s_1^2 +(n_2-1)s_2^2}{n_1+n_2-2}}\]

\[df = n_1 + n_2 -2\]

3.4 Welch’s t-test

\[t = \frac{\quad\bar x_1 - \bar x_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

\[{\displaystyle \mathrm {d.f.} ={\frac {\left({\frac {s_{1}^{2}}{n_{1}}}+{\frac {s_{2}^{2}}{n_{2}}}\right)^{2}}{{\frac {\left(s_{1}^{2}/n_{1}\right)^{2}}{n_{1}-1}}+{\frac {\left(s_{2}^{2}/n_{2}\right)^{2}}{n_{2}-1}}}}} \]

4 Minimum sample size

\[\displaystyle \left[t = \frac{\Delta x}{\frac{\sigma}{\sqrt{n}}}\right] \quad \longrightarrow\quad \left[n = \left(\frac{t \sigma}{\Delta x}\right)^2\right]\] \[n = \left(\frac{t\sigma}{M}\right)^2\]

5 Statistical issues

Correlation does not equal casuation

Type I and Type II error

5.1 Decision making:

Comparison of values on a shared standard
Statistically significant differences
Weighted scores to change emphasis
Backpack algorithm

5.2 shortcomings of datasets

Biased data: usually due to sampling methods
Biased weighting: choosing weights to force an outcome
Shifting values over time: grading system is not standardized or permenant
Incomplete data about characteristics: missing key aspects of the problem
Incomplete understanding of the relationships between factors: factors are not independant and influence each other

5.3 Decision mechanisms:

Letting others decide: often good place to start in an unknown field
Micromanaged, structured decision tree: allows for successful completion of a difficult process
Random search: helps to discover new solutions
Goal seeking: decisions in the flow towards an objective
Information-based decision making: iterative development process trying new things and evaluating the outcomes

5.4 Usefulness of statistics

6 Correlation vs Regression

Correlaton R value
Goodness of fit
Types of relationships

6.1 Correlation

7 Gas prices in Thailand

8 Linear Regression


Call:
lm(formula = evo95 ~ wks)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.88969 -0.35508 -0.08585  0.39108  1.22185 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.45700    0.21288  147.77   <2e-16 ***
wks          0.35577    0.01432   24.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5163 on 23 degrees of freedom
Multiple R-squared:  0.9641,    Adjusted R-squared:  0.9625 
F-statistic: 617.3 on 1 and 23 DF,  p-value: < 2.2e-16

9 Normal Population: x = 50 +/- 15

10 Uniform Population for 0-100

11 Parento Distribution

## Basic description

11.1 Measures of variability:

Range: minimum to maximum
Quartiles values: \[x[n/4], x[n/2], x[3n/4]\]
Standard deviation: \[\sigma = \sqrt{\sum (\bar x - x)^2 / (n - 1)}\]
Standard error of the mean: \[SE = \sigma/\sqrt{n}\]

11.2 Comparisons

Qualitative

t- test:

\[t = \frac{diff}{SE}\]

Quantitative

Chi Square test:

\[\chi^2 = \sum\frac{(obs - exp)^2}{exp}\]

12 Statistical issues

::: {.column width=“50%”

Correlation does not equal casuation

Type I and Type II error (

::::

12.1 Decision making:

Comparison of values on a shared standard
Statistically significant differences
Weighted scores to change emphasis
Backpack algorithm

12.2 shortcomings of datasets

Biased data: usually due to sampling methods
Biased weighting: choosing weights to force an outcome
Shifting values over time: grading system is not standardized or permenant
Incomplete data about characteristics: missing key aspects of the problem
Incomplete understanding of the relationships between factors: factors are not independant and influence each other

13 Correlation vs Regression

Correlaton R value
Goodness of fit
Types of relationships

13.1 Exam

GE141: 13 May 2023: 09:30-12:30, PC401

I hope this outline is helpful for you.😊