IT408 Data Mining - Short Course (Python)

Unit 5: Data modelling

R Batzinger

2025-07-12

\[p=a\left(1-e^{-bt}\right)\] \[p=a\left(e^{-bt}\right)\] \[p = \ln(bt) + a\]

\[p = bt +a\]

      year Corn.Prod Forecast Model.Residual
 [1,] 2014      2.80 3.111323  -0.3113234835
 [2,] 2015      4.70 4.413861   0.2861391363
 [3,] 2016      5.20 4.959161   0.2408394798
 [4,] 2017      5.00 5.187447  -0.1874470323
 [5,] 2018      5.20 5.283018  -0.0830178455
 [6,] 2019      5.30 5.323028  -0.0230280075
 [7,] 2020      4.20 5.339778  -1.1397780279
 [8,] 2021      5.50 5.346790   0.1532096741
 [9,] 2022      5.30 5.349726  -0.0497259835
[10,] 2023      5.35 5.350955  -0.0009549793
[11,] 2024      5.30 5.351469  -0.0514694913

\[\begin{matrix} Year & Corn\ Prod. & Estimate & Residual \\ 2014 & 2.80 & 3.111 & -0.311\\ 2015 & 4.70 & 4.413 & 0.286\\ 2016 & 5.20 & 4.959 & 0.241\\ 2017 & 5.00 & 5.187 & -0.187\\ 2018 & 5.20 & 5.283 & -0.083\\ 2019 & 5.30 & 5.323 & -0.023\\ 2020 & 4.20 & 5.339 & -1.140\\ 2021 & 5.50 & 5.347 & 0.153\\ 2022 & 5.30 & 5.349 & -0.050\\ 2023 & 5.35 & 5.351 & -0.001\\ 2024 & 5.30 & 5.351 & -0.051\\ \end{matrix}\]

Common Data Analysis

R-Squared=Regression Coefficient

A test of fit

\[R^2 =\frac{\sum{(x_i-\bar x)^2}}{\sum{\left(x_i - \hat{x_i}\right)^2}} = \frac{4.895}{6.172} = 0.793\] ## Mean (Average)

Central value

\[\bar x =\frac{\sum(x_i)}{n}= \frac{53.85}{11} = 4.896\] ## Standard Deviation

Measure of the variancem of individual points

\[\sigma =\sqrt{\frac{\sum\left(x_i - \bar{x}\right)^2}{n-1}} = \sqrt{\frac{6.172}{10}}= 0.786\]

Standard error

Variance of the mean

\[SE = \frac{\sigma}{\sqrt{n}} = \frac{0.786}{\sqrt{11}}=0.237 \]

Correlation coefficient

A test of the correlation between the x and y axis

\[r_{xy} = \frac{\sum\left(x_i-\bar x\right)^2\ \left(y_i-\bar y\right)^2}{\sqrt{\sum{\left(x_i-\bar x\right)^2}}\ \sqrt{\sum{\left(y_i-\bar y\right)^2}}}\]

Basic Modelling Functions

Surge Function

\[y = ax e^{-bx}\]

Linear Function

\[y = mx + b\]

Oscillating Function with a Linear Trend

\[y = ax + b \sin(cx) + d\]

Oscillating/cyclic Function

\[y = a \sin(bx) +c\]

Exponential Decay to a Minimum

\[y = a* exp(-bx) + c\]

Exponential Growth

\[y = a e^{bx} + c\]

Exponential Growth to a Maximum

\[y = a(1-e^{-bx}) + b\]

Logarithmic Function

\[y = \log (ax)\]

Normal Probability Curve

\[y=\frac{1}{2\pi}e^{(-(3-x)^2)}\]

Polynomial

\[y = ax^3 + bx^2 + cx + d\]

\[y = x^4 -14x^3 +65x^2 -101x +32\]

S-Curve Logistics Function

\[y = \frac{1}{1+e^{-a}}\]

Abuse of Statistics:

  • Assuming small differences are meaningful
  • Equating statistical significance with real-world significance
  • Neglecting to look at extremes
  • Trusting coincidence
  • Getting causation backwards
  • Forgetting to consider outside causes
  • Deceptive graphs

Biases in Data Science

  • Historical bias:

    • occurs when AI models are trained on data that reflects past societal prejudices or inequalities
    • This bias perpetuates unfair outcomes
    • requires auditing historical data for systemic inequities and adjusting datasets
  • Representation bias:

    • occurs when certain classes or demographics are underrepresented in the training data
    • results in a skewed model performance
    • requires augmenting datasets with diverse samples or reweighting to balance representation
  • Measurement bias:

    • comes from using inaccurate or inappropriate proxies for real-world phenomena
    • distorts model predictions.
    • requires the careful selection of features
    • and validation against ground-truth data to ensure relevance
  • Algorithmic bias:

    • emerges during the training or optimization process when the model’s design or learning algorithm introduces unfairness. * happens if optimization prioritizes accuracy for the majority group, neglecting minorities, or if the model overfits to biased patterns
    • Mitigation involves incorporating fairness constraints, like adversarial training, during model development.