Data Analytics

Basic Data Science

Dr Robert Batzinger
Instructor Emeritus

2022-08-15

Data project workflow

  • A. Business understanding: defines the objectives and scope of the data project.

  • B. Data collection obtain data from the available sources and explore its characteristics, quality, and potential problems

  • C. Data preparation: making the dataset suitable for data mining

  • D. Data transform: changing the complexity of the data by adjusting the dimensionality or discreteness, representation or normalization of the data

  • E. Data modeling: applications of algorithms and techniques to the data to create models that describe the patterns and relationships in the data.

  • F. Model Evaluation: assess the quality and validity of the models and compare them with objectives and expectations.

  • G. Knowledge deployment applying this information learned update business models and processes

Data Analytics

flowchart LR
  A(A. Business\n understanding) --> B(B. Data\ncollection)
  B --> C(C. Data\n preparation)
  C --> D(D. Data\n transform)
  D --> E(E. Data\n visualization)
  E --> F(F. Data\n modeling)
  F --> G{G. Model\n verification}
  G --> D
  G --> H(H. Knowledge\n deployment)

(a) Business understanding

  • Goal of the research
  • Key research question
  • Hypothesis

(b) Data collection

  • Accumulate appropriate data from different sources
  • Government sources: data.go.th
  • Open source: github.com
  • Web crawling and scraping: ngram
  • Commercial sources: Gallup.com, Statis.com

(c) Data preparation:

  • Transform and integrate the data into a unified dataset with a standard form and data representation

    • Calibrated readings
    • Standard units of measurement
    • Standard sampling rate

(d) Tidy the data

Create a dataset that truly represents the problem domain

  • Missing values
  • Outliers
  • Inconsistencies
  • Duplicates
  • Noise

(e) Data visualization

:::

(f) Data modeling

  • clustering: grouping by commonality
  • decision tree classification: grouping by attribute thresholds
  • regression: attribute relation
  • association: relations between individual object types
  • forecasting: using trends to suggest future outcomes.

(g) Model verification

  • Residuals
  • Mean absolute deviation
  • Standard error

Various estimates of error

\[\small\begin{matrix} x & y & \hat y & \Delta y & MAD & \chi^2 & err \\ 2 & 9.2 & 10.8 & 1.6 & 1.6 & 0.3&17\%\\ 4 & 13.6 & 13.6 & 0.0 & 0.0 & 0.0&0\%\\ 6 & 18.0 & 13.6 & -4.4 & 4.4 & 1.1&24\%\\ 8 & 22.4 & 24.8 & 2.4 & 2.4 & 0.3&11\%\\ 10 & 26.8 & 27.2 & 0.4 & 0.4& 0.0 & 1\%\\ \\ Statistic & & & &1.76 & 1.70&11\%\\ \end{matrix}\]

::: :::

(h) Knowledge deployment: common apps

  • Image and speech recognition: (identifying and tracking objects in an image, or transcribing spoken and written text)

  • Natural language processing: (sentiment analysis, proofreading, chatbots, and language translation)

  • Recommender systems: (recommend products or services to users based on their past behavior, preferences or simularities)

  • Optimized solver: (determine optimized solutions that cut costs by improving efficency and effectiveness)

  • Anomaly detection: (unusual patterns or behaviors in data, useful for fraud, scam, intrusion or failure)

  • Forecasting: (using trend to predict utilitization, pending failure, and need for preventative maintenance)