DATA SCIENCE FOR STATISTICIAN

1 Data Science for Statisticians: Comprehensive Course Notes

1.1 Course Overview and Objectives

1.1.1 Course Purpose

This course empowers statisticians to leverage modern data science tools and approaches, enhancing their ability to extract meaningful insights, build predictive models, and contribute to data-driven decision-making across applied domains. The focus bridges statistical theory with practical data science implementation.

1.1.2 Expected Learning Outcomes

By course completion, students will:

Data Management: Proficiently acquire, clean, transform, and manage diverse datasets (structured and unstructured) using appropriate programming languages and tools
Data Science Techniques: Implement and interpret fundamental data science techniques with theoretical understanding
Statistical Modeling: Understand theoretical underpinnings and practical applications of statistical and machine learning models
Programming Proficiency: Develop strong programming skills in Python and R
Critical Assessment: Evaluate model performance, understand limitations, and recognize ethical implications
Collaboration: Work effectively in teams using professional workflows, version control, and collaborative practices

1.2 MODULE 1: INTRODUCTION TO DATA SCIENCE AND PROGRAMMING FUNDAMENTALS

1.2.1 1.1 Overview of Data Science

Definition: Data Science is an interdisciplinary field combining statistics, computer science, domain expertise, and mathematics to extract actionable insights from data and drive decision-making.

The Data Science Workflow: 1. Problem Definition and Understanding 2. Data Collection and Acquisition 3. Data Cleaning and Preparation 4. Exploratory Data Analysis (EDA) 5. Feature Engineering 6. Model Development and Training 7. Model Evaluation and Validation 8. Deployment and Monitoring 9. Communication and Visualization

Key Distinction from Statistics: While classical statistics emphasizes inference and hypothesis testing under specific distributional assumptions, data science emphasizes prediction, pattern discovery, and handling of complex, high-dimensional data with minimal assumptions.

1.2.2 1.2 Introduction to Programming for Data Science

1.2.2.1 1.2.1 Why R and Python?

R: Statistical computing environment with extensive packages for statistical analysis, visualization, and modeling. - Advantages: Domain-specific for statistics, ggplot2 for visualization, tidyverse for data manipulation - Best for: Statistical analysis, academic research, visualization

Python: General-purpose programming language with powerful data science libraries. - Advantages: Versatility, scikit-learn for ML, large community, production deployment - Best for: Machine learning, deep learning, production systems, web applications

1.2.2.2 1.2.2 Getting Started: R Fundamentals

Basic Data Types:

# Numeric
x <- 42
y <- 3.14

# Character
name <- "Statistics"

# Logical
flag <- TRUE

# Vectors (homogeneous)
vec <- c(1, 2, 3, 4, 5)
vec_char <- c("a", "b", "c")

# Lists (heterogeneous)
my_list <- list(name = "John", age = 30, scores = c(85, 90, 78))
my_list$name

## [1] "John"

# Data frames (tabular)
df <- data.frame(
  id = 1:3,
  name = c("Alice", "Bob", "Carol"),
  score = c(95, 87, 92)
)

# Matrix
mat <- matrix(1:6, nrow = 2, ncol = 3)

Basic Operations:

# Vectorized operations
x <- c(1, 2, 3, 4, 5)
y <- x * 2  # c(2, 4, 6, 8, 10)
z <- x[x > 2]  # c(3, 4, 5)

# Functions
mean_value <- mean(x)
sd_value <- sd(x)
summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2       3       3       4       5

# Custom function
calculate_cv <- function(x) {
  cv <- sd(x) / mean(x) * 100
  return(cv)
}

cv_result <- calculate_cv(c(10, 12, 15, 18, 20))

1.2.2.3 1.2.3 Getting Started: Python Fundamentals

Basic Data Types and Structures:

# Numeric types
x = 42
y = 3.14

# String
name = "Statistics"

# Boolean
flag = True

# Lists (mutable, heterogeneous)
lst = [1, 2, 3, "a", "b"]

# Tuples (immutable)
tup = (1, 2, 3)

# Dictionaries
my_dict = {"name": "John", "age": 30, "scores": [85, 90, 78]}
my_dict["name"]

## 'John'


# NumPy arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

# Pandas DataFrame
import pandas as pd
df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Carol'],
    'score': [95, 87, 92]
})

Basic Operations:

# Array operations (vectorized)
x = np.array([1, 2, 3, 4, 5])
y = x * 2  # array([ 2,  4,  6,  8, 10])
z = x[x > 2]  # array([3, 4, 5])

# Functions
mean_value = np.mean(x)
std_value = np.std(x)

# Custom function
def calculate_cv(x):
    cv = np.std(x) / np.mean(x) * 100
    return cv

cv_result = calculate_cv(np.array([10, 12, 15, 18, 20]))

# List comprehension
squares = [i**2 for i in range(1, 6)]  # [1, 4, 9, 16, 25]

1.2.3 1.3 Integrated Development Environments (IDEs)

1.2.3.1 1.3.1 RStudio

Four-pane interface: Editor, Console, Environment/History, Files/Plots/Packages/Help
Projects for organized workflows
R Markdown for dynamic reports combining code and narrative
Package management integrated

1.2.3.2 1.3.2 Jupyter Notebooks

Interactive computation environment
Browser-based interface
Mix markdown and code cells
Supports multiple kernels (Python, R, Julia)
Ideal for exploratory analysis and sharing

1.2.3.3 1.3.3 VS Code

Lightweight, extensible editor
Extensions for Python, R, Jupyter
Integrated terminal
Git integration
Good for production code development

1.2.4 1.4 Version Control with Git and GitHub

Why Version Control? - Track changes over time - Collaborate with team members - Revert to previous versions - Manage multiple branches

Basic Git Workflow:

2 Initialize repository

git init git clone https://github.com/username/repo.git

3 Make changes and stage them

git add file.py git add . # Add all changes

4 Commit changes

git commit -m “Add feature: data cleaning module”

5 Push to remote

git push origin main

6 Pull latest changes

git pull origin main

7 Create and switch branches

git checkout -b feature/new-analysis git checkout main

8 Merge branches

git merge feature/new-analysis

GitHub Best Practices:

Meaningful commit messages
Frequent, atomic commits
Pull requests for code review
.gitignore for sensitive/large files
README documentation

8.1 MODULE 2: DATA ACQUISITION, CLEANING, AND MANAGEMENT

8.1.1 2.1 Data Sources and Types

Data Sources: - Internal databases and data warehouses - APIs (Application Programming Interfaces) - Web scraping - Sensors and IoT devices - Surveys and questionnaires - Social media and web platforms - Open data repositories

Data Types: - Structured Data: Relational databases, CSV, Excel, with predefined schema - Semi-structured Data: JSON, XML, logs with partial schema - Unstructured Data: Text, images, audio, video with no predefined structure

8.1.2 2.2 Data Import and Export

8.1.2.1 2.2.1 R: Data Import and Export

# Reading CSV
df <- read.csv("employees.csv")
df_csv <- readr::read_csv("employees.csv")  # Faster, more robust

## Rows: 200 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, department, education_level
## dbl (5): emp_id, age, experience, salary, performance_score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Reading Excel
library(readxl)
#df_excel <- read_excel("data.xlsx", sheet = 1)

# Reading from database
#library(DBI)
#library(RSQLite)
#conn <- dbConnect(SQLite(), "database.db")
#df_db <- dbGetQuery(conn, "SELECT * FROM table_name")
#dbDisconnect(conn)

# Reading JSON
library(jsonlite)
#df_json <- fromJSON("data.json")

# Writing data
#write.csv(df, "output.csv", row.names = FALSE)
#readr::write_csv(df, "output.csv")
#write_excel_csv(df, "output.csv")

8.1.2.2 2.2.2 Python: Data Import and Export

import pandas as pd
import sqlite3
import json

# Reading CSV
df = pd.read_csv('employees.csv')
df = pd.read_csv('employees.csv', sep=';', encoding='utf-8')

# Reading Excel
#df = pd.read_excel('data.xlsx', sheet_name=0)

# Reading from database
#conn = sqlite3.connect('database.db')
#df = pd.read_sql_query('SELECT * FROM table_name', conn)
#conn.close()

# Reading JSON
#df = pd.read_json('data.json')
#df = pd.read_json('data.jsonl', lines=True)

# Writing data
#df.to_csv('output.csv', index=False)
#df.to_excel('output.xlsx', index=False)
#df.to_sql('table_name', conn, if_exists='replace')

8.1.3 2.3 Data Cleaning and Preprocessing

Common Data Quality Issues: - Missing values (MCAR, MAR, MNAR) - Duplicates - Outliers - Inconsistent formatting - Type mismatches - Categorical encoding issues

8.1.3.1 2.3.1 R: Data Cleaning

# Load tidyverse for data manipulation
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Sample dataset
data <- tibble(
  id = c(1, 2, 2, 4, 5),
  name = c("John", "Jane", "Jane", NA, "Mike"),
  age = c(25, 30, 30, 35, "40"),  # Type inconsistency
  salary = c(50000, NA, 60000, 75000, 45000),
  department = c("IT", "HR", "HR", "IT", "IT"))

# Check for missing values
sum(is.na(data))

## [1] 2

colSums(is.na(data))

##         id       name        age     salary department 
##          0          1          0          1          0

naniar::vis_miss(data)  # Visualize missing patterns

# Identify duplicates
duplicate_rows <- data[duplicated(data),]
data_no_dup <- distinct(data)

# Remove rows with any NA
data_clean <- na.omit(data)

# Handle missing values - multiple strategies
# Strategy 1: Remove rows with missing values in key columns
data_clean <- data %>% filter(!is.na(name), !is.na(salary))

# Strategy 2: Impute with mean/median
data_clean <- data %>%
  mutate(salary = ifelse(is.na(salary), median(salary, na.rm = TRUE), salary))

# Strategy 3: Impute with forward fill (time series)
data_clean <- data %>%
  fill(department, .direction = "down")

# Fix type mismatches
data_clean <- data %>%
  mutate(age = as.numeric(age))

# Remove duplicates on specific columns
data_clean <- data %>% distinct(id, name, .keep_all = TRUE)

# Standardize text
data_clean <- data %>%
  mutate(
    name = str_trim(name),  # Remove leading/trailing spaces
    name = str_to_title(name),  # Title case
    department = str_to_upper(department)  # Upper case
  )

# Handle outliers (IQR method)
detect_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  return(x < lower_bound | x > upper_bound)
}

data_clean <- data %>%
  mutate(outlier = detect_outliers(salary))

8.1.3.2 2.3.2 Python: Data Cleaning

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = pd.DataFrame({
    'id': [1, 2, 2, 4, 5],
    'name': ['John', 'Jane', 'Jane', None, 'Mike'],
    'age': [25, 30, 30, 35, '40'],  # Type inconsistency
    'salary': [50000, np.nan, 60000, 75000, 45000],
    'department': ['IT', 'HR', 'HR', 'IT', 'IT']
})

# Check for missing values
data.isnull().sum()
data.isnull().sum() / len(data) * 100  # Percentage

# Visualize missing patterns
import missingno as msno
msno.matrix(data)

# Identify and remove duplicates
duplicate_mask = data.duplicated()
data_no_dup = data.drop_duplicates()
data_no_dup = data.drop_duplicates(subset=['id', 'name'])

# Remove rows with any missing values
data_clean = data.dropna()

# Remove rows with missing values in specific columns
data_clean = data.dropna(subset=['name', 'salary'])

# Impute missing values
# Strategy 1: Mean imputation
data_clean = data.copy()
data_clean['salary'].fillna(data_clean['salary'].mean(), inplace=True)

# Strategy 2: Median imputation
data_clean['salary'] = data['salary'].fillna(data['salary'].median())

# Strategy 3: Forward fill
data_clean['department'] = data['department'].fillna(method='ffill')

# Strategy 4: Using interpolation
data_clean['salary'] = data['salary'].interpolate()

# Fix type mismatches
data_clean['age'] = pd.to_numeric(data['age'], errors='coerce')

# Standardize text
data_clean['name'] = data_clean['name'].str.strip().str.title()
data_clean['department'] = data_clean['department'].str.upper()

# Handle outliers (IQR method)
def detect_outliers_iqr(x):
    Q1 = x.quantile(0.25)
    Q3 = x.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (x < lower_bound) | (x > upper_bound)

outliers = detect_outliers_iqr(data_clean['salary'])
data_no_outliers = data_clean[~outliers]

8.1.4 2.4 Data Manipulation and Transformation

8.1.4.1 2.4.1 R: Data Manipulation with dplyr

library(dplyr)
library(tidyr)

# Sample data
employees <- tibble(
  emp_id = 1:5,
  name = c("Alice", "Bob", "Carol", "David", "Eve"),
  department = c("IT", "HR", "IT", "Finance", "HR"),
  salary = c(80000, 60000, 85000, 70000, 65000),
  years = c(3, 5, 2, 7, 4)
)

# Select columns
selected <- employees %>% select(name, salary, department)
selected <- employees %>% select(starts_with("e"))  # Select by pattern

# Filter rows
it_dept <- employees %>% filter(department == "IT")
high_earners <- employees %>% filter(salary > 70000)
complex_filter <- employees %>% 
  filter(department %in% c("IT", "HR") & years > 2)

# Mutate - create/modify columns
employees_enhanced <- employees %>%
  mutate(
    salary_thousands = salary / 1000,
    seniority = if_else(years >= 5, "Senior", "Junior"),
    annual_increase = salary * 0.05
  )

# Group by and summarize
summary_stats <- employees %>%
  group_by(department) %>%
  summarise(
    count = n(),
    avg_salary = mean(salary),
    median_salary = median(salary),
    total_years = sum(years),
    .groups = 'drop'
  )

# Arrange (sort)
sorted <- employees %>% arrange(salary)  # Ascending
sorted <- employees %>% arrange(desc(salary))  # Descending

# Rename columns
renamed <- employees %>% rename(employee_name = name)

# Pivot longer (wide to long)
wide_data <- tibble(
  id = 1:3,
  Q1_sales = c(1000, 1500, 2000),
  Q2_sales = c(1200, 1600, 2100)
)

long_data <- wide_data %>%
  pivot_longer(cols = starts_with("Q"),
               names_to = "quarter",
               values_to = "sales")

# Pivot wider (long to wide)
sales_data <- tibble(
  product = c("A", "A", "B", "B"),
  quarter = c("Q1", "Q2", "Q1", "Q2"),
  revenue = c(1000, 1200, 1500, 1600)
)

wide_sales <- sales_data %>%
  pivot_wider(names_from = quarter,
              values_from = revenue)

# Join operations
departments <- tibble(
  dept_id = 1:3,
  dept_name = c("IT", "HR", "Finance")
)

employees_expanded <- employees %>%
  mutate(dept_id = match(department, c("IT", "HR", "Finance"))) %>%
  left_join(departments, by = "dept_id")

# Chain operations
result <- employees %>%
  filter(salary > 60000) %>%
  group_by(department) %>%
  summarise(avg_salary = mean(salary), count = n()) %>%
  arrange(desc(avg_salary))

8.1.4.2 2.4.2 Python: Data Manipulation with Pandas

import pandas as pd
import numpy as np

# Sample data
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
    'department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
    'salary': [80000, 60000, 85000, 70000, 65000],
    'years': [3, 5, 2, 7, 4]
})

# Select columns
selected = employees[['name', 'salary', 'department']]
selected = employees.loc[:, ['name', 'salary']]
selected = employees.filter(like='name')  # Filter by pattern

# Filter rows
it_dept = employees[employees['department'] == 'IT']
high_earners = employees[employees['salary'] > 70000]
complex_filter = employees[(employees['department'].isin(['IT', 'HR'])) & 
                          (employees['years'] > 2)]

# Create/modify columns (assign)
employees_enhanced = employees.assign(
    salary_thousands=employees['salary'] / 1000,
    seniority=employees['years'].apply(
        lambda x: 'Senior' if x >= 5 else 'Junior'
    ),
    annual_increase=employees['salary'] * 0.05
)

# Using .loc for direct assignment
employees_enhanced = employees.copy()
employees_enhanced['seniority'] = np.where(
    employees['years'] >= 5, 'Senior', 'Junior'
)

# Group by and aggregate
summary_stats = employees.groupby('department').agg({
    'emp_id': 'count',
    'salary': ['mean', 'median', 'min', 'max'],
    'years': 'sum'
})

# Rename aggregation columns
summary_stats = employees.groupby('department').agg(
    count=('emp_id', 'count'),
    avg_salary=('salary', 'mean'),
    median_salary=('salary', 'median'),
    total_years=('years', 'sum')
).reset_index()

# Sort (sort_values)
sorted_asc = employees.sort_values('salary')
sorted_desc = employees.sort_values('salary', ascending=False)

# Rename columns
renamed = employees.rename(columns={'name': 'employee_name'})

# Pivot long to wide (melt vs pivot_table)
wide_data = pd.DataFrame({
    'id': [1, 2, 3],
    'Q1_sales': [1000, 1500, 2000],
    'Q2_sales': [1200, 1600, 2100]
})

long_data = wide_data.melt(id_vars='id',
                           var_name='quarter',
                           value_name='sales')

# Pivot wide to long
sales_data = pd.DataFrame({
    'product': ['A', 'A', 'B', 'B'],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'revenue': [1000, 1200, 1500, 1600]
})

wide_sales = sales_data.pivot_table(index='product',
                                    columns='quarter',
                                    values='revenue',
                                    aggfunc='sum')

# Join operations
departments = pd.DataFrame({
    'dept_id': [1, 2, 3],
    'dept_name': ['IT', 'HR', 'Finance']
})

employees['dept_id'] = employees['department'].map(
    {'IT': 1, 'HR': 2, 'Finance': 3}
)

employees_expanded = employees.merge(departments, on='dept_id')

# Chain operations (method chaining)
result = (employees
    .query('salary > 60000')
    .groupby('department')
    .agg(avg_salary=('salary', 'mean'), count=('emp_id', 'count'))
    .reset_index()
    .sort_values('avg_salary', ascending=False)
)

8.2 MODULE 3: EXPLORATORY DATA ANALYSIS AND VISUALIZATION

8.2.1 3.1 Descriptive Statistics

Key Metrics: - Central Tendency: Mean, Median, Mode - Dispersion: Range, Variance, Standard Deviation, IQR - Distribution Shape: Skewness, Kurtosis - Relationships: Correlation, Covariance

8.2.1.1 3.1.1 R: Descriptive Statistics

library(dplyr)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

# Sample data
data <- tibble(
  age = c(25, 28, 30, 35, 42, 45, 50),
  income = c(45000, 52000, 58000, 65000, 75000, 82000, 95000),
  experience = c(2, 3, 4, 8, 12, 15, 20)
)

# Basic summary statistics
summary(data)

##       age            income        experience    
##  Min.   :25.00   Min.   :45000   Min.   : 2.000  
##  1st Qu.:29.00   1st Qu.:55000   1st Qu.: 3.500  
##  Median :35.00   Median :65000   Median : 8.000  
##  Mean   :36.43   Mean   :67429   Mean   : 9.143  
##  3rd Qu.:43.50   3rd Qu.:78500   3rd Qu.:13.500  
##  Max.   :50.00   Max.   :95000   Max.   :20.000

# Detailed statistics
describe(data)

##            vars n     mean       sd median  trimmed      mad   min   max range
## age           1 7    36.43     9.43     35    36.43    10.38    25    50    25
## income        2 7 67428.57 17633.84  65000 67428.57 19273.80 45000 95000 50000
## experience    3 7     9.14     6.79      8     9.14     7.41     2    20    18
##            skew kurtosis      se
## age        0.16    -1.84    3.56
## income     0.23    -1.61 6664.97
## experience 0.35    -1.67    2.57

# Custom summary function
data %>%
  summarise(
    across(everything(),
      list(
        mean = ~mean(., na.rm = TRUE),
        median = ~median(., na.rm = TRUE),
        sd = ~sd(., na.rm = TRUE),
        min = ~min(., na.rm = TRUE),
        max = ~max(., na.rm = TRUE),
        q25 = ~quantile(., 0.25, na.rm = TRUE),
        q75 = ~quantile(., 0.75, na.rm = TRUE)
      )
    )
  )

## # A tibble: 1 × 21
##   age_mean age_median age_sd age_min age_max age_q25 age_q75 income_mean
##      <dbl>      <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>       <dbl>
## 1     36.4         35   9.43      25      50      29    43.5      67429.
## # ℹ 13 more variables: income_median <dbl>, income_sd <dbl>, income_min <dbl>,
## #   income_max <dbl>, income_q25 <dbl>, income_q75 <dbl>,
## #   experience_mean <dbl>, experience_median <dbl>, experience_sd <dbl>,
## #   experience_min <dbl>, experience_max <dbl>, experience_q25 <dbl>,
## #   experience_q75 <dbl>

# Correlation matrix
cor_matrix <- cor(data)

# Correlation with significance tests
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

corr_test <- rcorr(as.matrix(data))

# Skewness and Kurtosis
#skewness(data$age)
#kurtosis(data$age)

# Group-wise summaries
data %>%
  mutate(age_group = cut(age, breaks = c(20, 35, 50, 100))) %>%
  group_by(age_group) %>%
  summarise(
    n = n(),
    mean_income = mean(income),
    sd_income = sd(income)
  )

## # A tibble: 2 × 4
##   age_group     n mean_income sd_income
##   <fct>     <int>       <dbl>     <dbl>
## 1 (20,35]       4       55000     8524.
## 2 (35,50]       3       84000    10149.

8.2.1.2 3.1.2 Python: Descriptive Statistics

import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns

# Sample data
data = pd.DataFrame({
    'age': [25, 28, 30, 35, 42, 45, 50],
    'income': [45000, 52000, 58000, 65000, 75000, 82000, 95000],
    'experience': [2, 3, 4, 8, 12, 15, 20]
})

# Basic summary statistics
data.describe()

# Detailed statistics
data.describe(percentiles=[0.25, 0.5, 0.75])

# Individual statistics
data['age'].mean()
data['age'].median()
data['age'].std()
data[['age', 'income']].mean()

# Custom summary
data.agg({
    'age': ['mean', 'median', 'std', 'min', 'max'],
    'income': ['mean', 'median', 'std', 'min', 'max']
})

# Correlation matrix
corr_matrix = data.corr()

# Correlation with p-values
from scipy.stats import pearsonr
corr, p_value = pearsonr(data['age'], data['income'])

# Skewness and Kurtosis
stats.skew(data['age'])
stats.kurtosis(data['age'])

# Group-wise summaries
data['age_group'] = pd.cut(data['age'], bins=[20, 35, 50, 100])
data.groupby('age_group').agg({
    'income': ['mean', 'std', 'count']
})

# Value counts
data['age_group'].value_counts()

8.2.2 3.2 Data Visualization Principles

Key Principles: 1. Clarity: Choose appropriate chart types for data structure 2. Simplicity: Remove clutter; emphasize important patterns 3. Color: Use purposefully; ensure accessibility 4. Context: Include titles, labels, legends, and references 5. Accuracy: Don’t distort data through inappropriate scaling

Chart Type Selection: - Univariate Continuous: Histogram, Density Plot, Box Plot - Univariate Categorical: Bar Chart, Pie Chart - Bivariate: Scatter Plot, Line Plot, Box Plot - Multivariate: Faceted plots, Color/Size encoding, Parallel coordinates

8.2.3 3.3 Static Visualizations

8.2.3.1 3.3.1 R: Static Visualizations with ggplot2

library(ggplot2)
library(dplyr)

# Sample dataset
data <- tibble(
  department = rep(c("IT", "HR", "Finance"), each = 10),
  salary = c(rnorm(10, 80000, 5000), 
             rnorm(10, 60000, 3000),
             rnorm(10, 70000, 4000)),
  experience = rep(1:10, 3)
)

# Histogram
ggplot(data, aes(x = salary)) +
  geom_histogram(binwidth = 5000, fill = "steelblue", alpha = 0.7) +
  labs(title = "Distribution of Salaries",
       x = "Salary ($)",
       y = "Frequency") +
  theme_minimal()

# Density plot
ggplot(data, aes(x = salary, fill = department)) +
  geom_density(alpha = 0.5) +
  labs(title = "Salary Distribution by Department") +
  theme_minimal()

# Box plot
ggplot(data, aes(x = department, y = salary, fill = department)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(title = "Salary Distribution by Department",
       x = "Department",
       y = "Salary ($)") +
  theme_minimal() +
  theme(legend.position = "none")

# Scatter plot
ggplot(data, aes(x = experience, y = salary, color = department)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
  labs(title = "Salary vs Experience by Department",
       x = "Years of Experience",
       y = "Salary ($)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Bar plot
dept_summary <- data %>%
  group_by(department) %>%
  summarise(avg_salary = mean(salary))

ggplot(dept_summary, aes(x = department, y = avg_salary, fill = department)) +
  geom_col(alpha = 0.7) +
  geom_text(aes(label = round(avg_salary, -3)), vjust = -0.5) +
  labs(title = "Average Salary by Department",
       x = "Department",
       y = "Average Salary ($)") +
  theme_minimal() +
  theme(legend.position = "none")

# Faceted plots
ggplot(data, aes(x = experience, y = salary)) +
  geom_point(color = "steelblue", alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  facet_wrap(~ department) +
  labs(title = "Salary vs Experience (by Department)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Heatmap/Correlation plot
correlation_plot <- data %>%
  select(salary, experience) %>%
  cor() %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  pivot_longer(-rowname) %>%
  ggplot(aes(x = rowname, y = name, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2))) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", 
                       midpoint = 0, limits = c(-1, 1)) +
  theme_minimal() +
  theme(axis.title = element_blank())

8.2.3.2 3.3.2 Python: Static Visualizations with Matplotlib and Seaborn

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Sample dataset
np.random.seed(42)
data = pd.DataFrame({
    'department': np.repeat(['IT', 'HR', 'Finance'], 10),
    'salary': np.concatenate([
        np.random.normal(80000, 5000, 10),
        np.random.normal(60000, 3000, 10),
        np.random.normal(70000, 4000, 10)
    ]),
    'experience': np.tile(np.arange(1, 11), 3)
})

# Histogram
plt.figure(figsize=(10, 6))
plt.hist(data['salary'], bins=15, color='steelblue', alpha=0.7, edgecolor='black')
plt.title('Distribution of Salaries', fontsize=14, fontweight='bold')
plt.xlabel('Salary ($)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.tight_layout()
plt.show()

# Density plot
plt.figure(figsize=(10, 6))
for dept in data['department'].unique():
    data[data['department'] == dept]['salary'].plot(kind='density', label=dept, alpha=0.7)
plt.title('Salary Distribution by Department', fontsize=14, fontweight='bold')
plt.xlabel('Salary ($)', fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()

# Box plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='department', y='salary', palette='Set2')
sns.stripplot(data=data, x='department', y='salary', color='black', alpha=0.3, size=4)
plt.title('Salary Distribution by Department', fontsize=14, fontweight='bold')
plt.xlabel('Department', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.tight_layout()
plt.show()

# Scatter plot
plt.figure(figsize=(10, 6))
for dept in data['department'].unique():
    dept_data = data[data['department'] == dept]
    plt.scatter(dept_data['experience'], dept_data['salary'], label=dept, alpha=0.7, s=100)
plt.title('Salary vs Experience by Department', fontsize=14, fontweight='bold')
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Bar plot
dept_summary = data.groupby('department')['salary'].mean()
plt.figure(figsize=(10, 6))
bars = plt.bar(dept_summary.index, dept_summary.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'], alpha=0.7)
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'${height:,.0f}',
             ha='center', va='bottom', fontsize=11)
plt.title('Average Salary by Department', fontsize=14, fontweight='bold')
plt.xlabel('Department', fontsize=12)
plt.ylabel('Average Salary ($)', fontsize=12)
plt.tight_layout()
plt.show()

# Faceted plots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, dept in enumerate(data['department'].unique()):
    dept_data = data[data['department'] == dept]
    axes[idx].scatter(dept_data['experience'], dept_data['salary'], alpha=0.7)
    axes[idx].set_title(dept, fontsize=12)
    axes[idx].set_xlabel('Experience (years)', fontsize=10)
    axes[idx].set_ylabel('Salary ($)', fontsize=10)
    axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(8, 6))
corr_matrix = data[['salary', 'experience']].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            cbar_kws={'label': 'Correlation'}, square=True)
plt.title('Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

8.2.4 3.4 Interactive Visualizations

8.2.4.1 3.4.1 R: Interactive Visualizations with Plotly and Shiny

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:Hmisc':
## 
##     subplot

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(shiny)

## 
## Attaching package: 'shiny'

## The following object is masked from 'package:jsonlite':
## 
##     validate

# Interactive scatter plot with Plotly
plot_ly(data = mtcars, 
        x = ~wt, 
        y = ~mpg,
        type = 'scatter',
        mode = 'markers',
        marker = list(size = 8, color = ~hp, colorscale = 'Viridis')) %>%
  layout(title = 'MPG vs Weight',
         xaxis = list(title = 'Weight (1000 lbs)'),
         yaxis = list(title = 'Miles per Gallon'))

# Interactive time series
library(dplyr)
time_data <- tibble(
  date = seq(as.Date('2020-01-01'), as.Date('2023-12-31'), by = 'day'),
  value = cumsum(rnorm(1461))
)

plot_ly(data = time_data, x = ~date, y = ~value, type = 'scatter', mode = 'lines') %>%
  layout(title = 'Time Series Example',
         xaxis = list(title = 'Date'),
         yaxis = list(title = 'Value'))

# Shiny app example
ui <- fluidPage(
  titlePanel("Data Visualization Dashboard"),
  sidebarLayout(
    sidebarPanel(
      selectInput("dataset", "Choose dataset:",
                  choices = c("mtcars", "iris", "chickwts")),
      selectInput("x_var", "X variable:", choices = NULL),
      selectInput("y_var", "Y variable:", choices = NULL)
    ),
    mainPanel(
      plotlyOutput("plot")
    )
  )
)

server <- function(input, output, session) {
  # Update variable choices based on dataset
  observe({
    data <- get(input$dataset)
    updateSelectInput(session, "x_var", choices = names(data))
    updateSelectInput(session, "y_var", choices = names(data))
  })
  
  output$plot <- renderPlotly({
    data <- get(input$dataset)
    plot_ly(data = data,
            x = as.formula(paste0('~', input$x_var)),
            y = as.formula(paste0('~', input$y_var)),
            type = 'scatter',
            mode = 'markers') %>%
      layout(title = paste(input$x_var, 'vs', input$y_var))
  })
}

# shinyApp(ui, server)

8.2.4.2 3.4.2 Python: Interactive Visualizations with Plotly

import plotly.graph_objects as go
import plotly.express as px
import pandas as pd

# Sample data
data = pd.DataFrame({
    'x': range(1, 11),
    'y': [10, 15, 13, 17, 20, 25, 22, 24, 26, 28]
})

# Interactive scatter plot
fig = px.scatter(data, x='x', y='y',
                title='Interactive Scatter Plot',
                labels={'x': 'X Variable', 'y': 'Y Variable'})
fig.show()

# Interactive line plot with range slider
dates = pd.date_range('2020-01-01', periods=100)
time_data = pd.DataFrame({
    'date': dates,
    'value': np.cumsum(np.random.randn(100))
})

fig = go.Figure()
fig.add_trace(go.Scatter(x=time_data['date'], y=time_data['value'],
                        mode='lines', name='Value'))
fig.update_layout(
    title='Interactive Time Series',
    xaxis_rangeslider_visible=False,
    xaxis_rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=3, label="3m", step="month", stepmode="backward"),
            dict(step="all", label="All")
        ])
    )
)
fig.show()

# Interactive box plot by category
iris = px.data.iris()
fig = px.box(iris, x='species', y='sepal_length',
            title='Iris Sepal Length by Species')
fig.show()

# Dashboard-style visualization
from plotly.subplots import make_subplots

fig = make_subplots(rows=2, cols=2, 
                    specs=[[{'type': 'scatter'}, {'type': 'bar'}],
                           [{'type': 'pie'}, {'type': 'box'}]])

# Add scatter
fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6], name='Scatter'),
             row=1, col=1)

# Add bar
fig.add_trace(go.Bar(x=['A', 'B', 'C'], y=[1, 3, 2], name='Bar'),
             row=1, col=2)

# Add pie
fig.add_trace(go.Pie(labels=['A', 'B', 'C'], values=[1, 2, 3], name='Pie'),
             row=2, col=1)

# Add box
fig.add_trace(go.Box(y=[1, 2, 3, 4, 5], name='Box'),
             row=2, col=2)

fig.update_layout(height=800, showlegend=False, title_text="Dashboard")
fig.show()

8.2.5 3.5 Exploratory Data Analysis (EDA) Workflow

EDA Checklist: 1. Load and inspect data (dimensions, types, head/tail) 2. Check for missing values and outliers 3. Examine distributions of variables 4. Identify relationships between variables 5. Look for patterns, clusters, anomalies 6. Document findings and insights

8.2.5.1 3.5.1 R: Complete EDA Example

library(tidyverse)
library(naniar)
library(psych)

# Load data
df <- read_csv("employees.csv")

## Rows: 200 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, department, education_level
## dbl (5): emp_id, age, experience, salary, performance_score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# 1. Basic inspection
dim(df)

## [1] 200   8

names(df)

## [1] "emp_id"            "name"              "department"       
## [4] "age"               "experience"        "salary"           
## [7] "education_level"   "performance_score"

head(df, 10)

## # A tibble: 10 × 8
##    emp_id name        department   age experience  salary education_level
##     <dbl> <chr>       <chr>      <dbl>      <dbl>   <dbl> <chr>          
##  1      1 Employee_1  Marketing     23         26  61972. Master         
##  2      2 Employee_2  Marketing     45          2  65322. Bachelor       
##  3      3 Employee_3  Sales         60         14  68593. Master         
##  4      4 Employee_4  HR            47          2  83483. PhD            
##  5      5 Employee_5  Finance       35         29  85457. Master         
##  6      6 Employee_6  Finance       45          4  63526. Bachelor       
##  7      7 Employee_7  Finance       56         17  74513. PhD            
##  8      8 Employee_8  Marketing     55          3 117845. Master         
##  9      9 Employee_9  Marketing     29         13 109551. Bachelor       
## 10     10 Employee_10 HR            45          9  83726. PhD            
## # ℹ 1 more variable: performance_score <dbl>

str(df)

## spc_tbl_ [200 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ emp_id           : num [1:200] 1 2 3 4 5 6 7 8 9 10 ...
##  $ name             : chr [1:200] "Employee_1" "Employee_2" "Employee_3" "Employee_4" ...
##  $ department       : chr [1:200] "Marketing" "Marketing" "Sales" "HR" ...
##  $ age              : num [1:200] 23 45 60 47 35 45 56 55 29 45 ...
##  $ experience       : num [1:200] 26 2 14 2 29 4 17 3 13 9 ...
##  $ salary           : num [1:200] 61972 65322 68593 83483 85457 ...
##  $ education_level  : chr [1:200] "Master" "Bachelor" "Master" "PhD" ...
##  $ performance_score: num [1:200] 1.93 1.32 4.11 1.49 3.38 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   emp_id = col_double(),
##   ..   name = col_character(),
##   ..   department = col_character(),
##   ..   age = col_double(),
##   ..   experience = col_double(),
##   ..   salary = col_double(),
##   ..   education_level = col_character(),
##   ..   performance_score = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

glimpse(df)

## Rows: 200
## Columns: 8
## $ emp_id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ name              <chr> "Employee_1", "Employee_2", "Employee_3", "Employee_…
## $ department        <chr> "Marketing", "Marketing", "Sales", "HR", "Finance", …
## $ age               <dbl> 23, 45, 60, 47, 35, 45, 56, 55, 29, 45, 38, 48, 43, …
## $ experience        <dbl> 26, 2, 14, 2, 29, 4, 17, 3, 13, 9, 7, 15, 23, 29, 29…
## $ salary            <dbl> 61971.64, 65322.28, 68593.05, 83483.32, 85456.71, 63…
## $ education_level   <chr> "Master", "Bachelor", "Master", "PhD", "Master", "Ba…
## $ performance_score <dbl> 1.930205, 1.319862, 4.106895, 1.488140, 3.377163, 4.…

# 2. Missing values
colSums(is.na(df))

##            emp_id              name        department               age 
##                 0                 0                 0                 0 
##        experience            salary   education_level performance_score 
##                 0                 0                 0                 0

vis_miss(df)

# 3. Univariate analysis
#Numeric variables
df %>%
  select(where(is.numeric)) %>%
  describe()

## . 
## 
##  5  Variables      200  Observations
## --------------------------------------------------------------------------------
## emp_id 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      200        0      200        1    100.5    100.5       67    10.95 
##      .10      .25      .50      .75      .90      .95 
##    20.90    50.75   100.50   150.25   180.10   190.05 
## 
## lowest :   1   2   3   4   5, highest: 196 197 198 199 200
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      200        0       42    0.999    39.12     39.5    11.17    22.00 
##      .10      .25      .50      .75      .90      .95 
##    24.90    32.00    40.00    46.00    50.10    53.05 
## 
## lowest : 18 19 20 21 22, highest: 55 56 59 60 63
## --------------------------------------------------------------------------------
## experience 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      200        0       29    0.998    15.24       15    9.736        2 
##      .10      .25      .50      .75      .90      .95 
##        3        8       15       22       27       28 
## 
## lowest :  1  2  3  4  5, highest: 25 26 27 28 29
## --------------------------------------------------------------------------------
## salary 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      200        0      198        1    76714    76819    22872    43456 
##      .10      .25      .50      .75      .90      .95 
##    49317    62861    77096    90242   102610   108722 
## 
## lowest : 30000   31932.2 35036   40205.7 41140.9
## highest: 117845  120049  120819  123068  123795 
## --------------------------------------------------------------------------------
## performance_score 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##      200        0      200        1    2.921    2.919    1.347    1.256 
##      .10      .25      .50      .75      .90      .95 
##    1.406    1.902    2.920    3.970    4.526    4.774 
## 
## lowest : 1.00133 1.04171 1.05086 1.07075 1.14228
## highest: 4.90396 4.91359 4.9531  4.96635 4.96966
## --------------------------------------------------------------------------------

# Categorical variables
df %>%
  select(where(is.character)) %>%
  map_df(table)

## # A tibble: 3 × 208
##   Employee_1  Employee_10 Employee_100 Employee_101 Employee_102 Employee_103
##   <table[1d]> <table[1d]> <table[1d]>  <table[1d]>  <table[1d]>  <table[1d]> 
## 1  1           1           1            1            1            1          
## 2 NA          NA          NA           NA           NA           NA          
## 3 NA          NA          NA           NA           NA           NA          
## # ℹ 202 more variables: Employee_104 <table[1d]>, Employee_105 <table[1d]>,
## #   Employee_106 <table[1d]>, Employee_107 <table[1d]>,
## #   Employee_108 <table[1d]>, Employee_109 <table[1d]>,
## #   Employee_11 <table[1d]>, Employee_110 <table[1d]>,
## #   Employee_111 <table[1d]>, Employee_112 <table[1d]>,
## #   Employee_113 <table[1d]>, Employee_114 <table[1d]>,
## #   Employee_115 <table[1d]>, Employee_116 <table[1d]>, …

# 4. Bivariate analysis
# Numeric vs Numeric
pairs(df %>% select(where(is.numeric)))

cor(df %>% select(where(is.numeric)), use = "complete.obs")

##                         emp_id          age  experience      salary
## emp_id             1.000000000 -0.005122961 -0.10436278 -0.05412891
## age               -0.005122961  1.000000000 -0.06868697 -0.07345337
## experience        -0.104362779 -0.068686968  1.00000000 -0.04075939
## salary            -0.054128907 -0.073453371 -0.04075939  1.00000000
## performance_score  0.065627808  0.084145311 -0.06907438 -0.03085486
##                   performance_score
## emp_id                   0.06562781
## age                      0.08414531
## experience              -0.06907438
## salary                  -0.03085486
## performance_score        1.00000000

# Categorical vs Numeric
df %>%
  ggplot(aes(x = department, y = performance_score)) +
  geom_boxplot() +
  theme_minimal()

# 5. Multivariate analysis
df %>%
  select(where(is.numeric)) %>%
  cor() %>%
  corrplot::corrplot()

# 6. Outlier detection
df %>%
  select(where(is.numeric)) %>%
  map_df(function(x) {
    Q1 <- quantile(x, 0.25, na.rm = TRUE)
   Q3 <- quantile(x, 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    outliers <- sum(x < Q1 - 1.5*IQR | x > Q3 + 1.5*IQR, na.rm = TRUE)
    return(tibble(outliers = outliers))})

## # A tibble: 5 × 1
##   outliers
##      <int>
## 1        0
## 2        0
## 3        0
## 4        0
## 5        0

# Generate EDA report
#library(DataExplorer)
#create_report(df)

8.2.5.2 3.5.2 Python: Complete EDA Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load data
df = pd.read_csv('employees.csv')

# 1. Basic inspection
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.head(10))
print(df.info())
print(df.dtypes)

# 2. Missing values
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)

# Visualize missing
import missingno as msno
msno.matrix(df)
plt.show()

# 3. Univariate analysis
# Numeric variables
print(df.describe())
print(df.describe(percentiles=[0.25, 0.5, 0.75, 0.95]))

# Categorical variables
for col in df.select_dtypes(include='object').columns:
    print(df[col].value_counts())

# 4. Bivariate analysis
# Correlation matrix
numeric_cols = df.select_dtypes(include=[np.number]).columns
corr_matrix = df[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()

# 5. Visualizations
# Histograms
for col in numeric_cols:
    plt.figure(figsize=(10, 4))
    plt.hist(df[col], bins=30, edgecolor='black')
    plt.title(f'Distribution of {col}')
    plt.show()

# Box plots
for col in df.select_dtypes(include='object').columns:
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x=col, y=numeric_cols[0])
    plt.title(f'{numeric_cols[0]} by {col}')
    plt.show()

# 6. Outlier detection
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return ((data < lower_bound) | (data > upper_bound)).sum()

outlier_counts = df[numeric_cols].apply(detect_outliers_iqr)
print("Outlier counts:\n", outlier_counts)

# 7. Statistical tests
# Normality test (Shapiro-Wilk)
for col in numeric_cols:
    stat, p_value = stats.shapiro(df[col].dropna())
    print(f"{col}: Shapiro-Wilk p-value = {p_value:.4f}")

# Generate comprehensive EDA report
#from ydata_profiling import ProfileReport
#profile = ProfileReport(df, title='EDA Report')
#profile.to_file('eda_report.html')

8.3 MODULE 4: STATISTICAL MODELING AND MACHINE LEARNING FUNDAMENTALS

8.3.1 4.1 Review of Linear Models

Ordinary Least Squares (OLS) Regression:

The linear regression model: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i\]

Where: - \(y_i\) is the response variable - \(x_{ij}\) are predictor variables - \(\beta_j\) are regression coefficients - \(\epsilon_i\) is the error term (assumed \(\sim N(0, \sigma^2)\))

Key Assumptions: 1. Linearity: Relationship between predictors and response is linear 2. Independence: Observations are independent 3. Homoscedasticity: Constant variance of errors 4. Normality: Errors are normally distributed 5. No multicollinearity: Predictors are not highly correlated

8.3.1.1 4.1.1 R: Linear Regression

library(tidyverse)
library(broom)

# Sample data
data <- tibble(
  x1 = c(2, 3, 4, 5, 6, 7, 8, 9),
  x2 = c(1, 2, 2, 3, 4, 5, 6, 7),
  y = c(5.1, 6.2, 8.0, 8.9, 11.1, 12.8, 14.5, 16.2)
)

# Fit linear model
model <- lm(y ~ x1 + x2, data = data)

# Model summary
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = data)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
##  0.29167 -0.29167  0.29167 -0.49167  0.02500  0.04167  0.05833  0.07500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.9083     0.4827   3.954   0.0108 *
## x1            1.2167     0.3444   3.533   0.0167 *
## x2            0.4667     0.3977   1.173   0.2935  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3189 on 5 degrees of freedom
## Multiple R-squared:  0.9954, Adjusted R-squared:  0.9936 
## F-statistic: 540.5 on 2 and 5 DF,  p-value: 1.438e-06

# Extract coefficients
tidy(model)

## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    1.91      0.483      3.95  0.0108
## 2 x1             1.22      0.344      3.53  0.0167
## 3 x2             0.467     0.398      1.17  0.293

glance(model)  # Model fit statistics

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.995         0.994 0.319      541. 0.00000144     2 -0.327  8.65  8.97
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# Predictions
predictions <- predict(model, newdata = data.frame(x1 = c(5, 6), x2 = c(3, 4)))
pred_with_ci <- predict(model, newdata = data.frame(x1 = c(5, 6), x2 = c(3, 4)),
                        interval = "confidence", level = 0.95)

# Residuals and diagnostics
residuals <- resid(model)
fitted_values <- fitted(model)

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model)

par(mfrow = c(1, 1))

# Assumption checks
# Normality
shapiro.test(resid(model))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model)
## W = 0.87827, p-value = 0.1813

# Homoscedasticity
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 3.0706, df = 2, p-value = 0.2154

# Multicollinearity
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(model)

## x1 x2 
## 49 49

# Visualization
data %>%
  mutate(predictions = predict(model)) %>%
  ggplot(aes(x = x1, y = y)) +
  geom_point(size = 3) +
  geom_line(aes(y = predictions), color = "red") +
  labs(title = "Linear Regression Fit",
       x = "X1", y = "Y") +
  theme_minimal()

8.3.1.2 4.1.2 Python: Linear Regression

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

# Sample data
data = pd.DataFrame({
    'x1': [2, 3, 4, 5, 6, 7, 8, 9],
    'x2': [1, 2, 2, 3, 4, 5, 6, 7],
    'y': [5.1, 6.2, 8.0, 8.9, 11.1, 12.8, 14.5, 16.2]
})

# Using scikit-learn
X = data[['x1', 'x2']]
y = data['y']

model = LinearRegression()
model.fit(X, y)

# Coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

# Predictions
y_pred = model.predict(X)
print("Predictions:", y_pred)

# Model evaluation
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)
print(f"MSE: {mse}, RMSE: {rmse}, R²: {r2}")

# Using statsmodels for detailed statistics
X_sm = sm.add_constant(X)
model_sm = sm.OLS(y, X_sm).fit()
print(model_sm.summary())

# Predictions with confidence intervals
predictions = model_sm.get_prediction(X_sm)
pred_summary = predictions.summary_frame(alpha=0.05)
print(pred_summary)

# Residual analysis
residuals = y - y_pred
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')

plt.subplot(1, 3, 2)
from scipy import stats
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')

plt.subplot(1, 3, 3)
plt.hist(residuals, bins=10, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Histogram of Residuals')

plt.tight_layout()
plt.show()

# Check assumptions
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(residuals, X_sm)
print(f"Breusch-Pagan test p-value: {bp_test[1]}")

# VIF for multicollinearity
#from statsmodels.stats.outliers_influence import variance_inflation_factor
#vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
#print("VIF values:", vif)

8.3.2 4.2 Introduction to Supervised Learning

Supervised Learning: Learning from labeled data where both input features and target outputs are known.

Classification vs Regression: - Classification: Predict categorical/discrete target (classes) - Regression: Predict continuous target (numeric values)

General Workflow: 1. Data splitting (train/test/validation) 2. Feature engineering and preprocessing 3. Model training 4. Hyperparameter tuning 5. Model evaluation 6. Prediction on new data

8.3.3 4.3 Classification Algorithms

8.3.3.1 4.3.1 Logistic Regression

Model: Predicts probability of class membership using logistic function: \[P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots)}}\]

R Implementation:

library(tidyverse)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# Sample binary classification data
set.seed(42)
n <- 200
data <- tibble(
  feature1 = rnorm(n, mean = 0, sd = 1),
  feature2 = rnorm(n, mean = 0, sd = 1),
  target = factor(ifelse(0.5 + 0.8 * feature1 + 0.6 * feature2 + rnorm(n, 0, 0.5) > 0, 
                         "Class1", "Class0"))
)

# Split data
train_idx <- createDataPartition(data$target, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# Fit logistic regression
model <- glm(target ~ feature1 + feature2, 
             family = binomial(link = "logit"), 
             data = train_data)

# Predictions
pred_prob <- predict(model, newdata = test_data, type = "response")
pred_class <- ifelse(pred_prob > 0.5, "Class1", "Class0")

# Evaluation
library(caret)
confusionMatrix(factor(pred_class), test_data$target)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Class0 Class1
##     Class0     17      7
##     Class1      4     32
##                                           
##                Accuracy : 0.8167          
##                  95% CI : (0.6956, 0.9048)
##     No Information Rate : 0.65            
##     P-Value [Acc > NIR] : 0.003653        
##                                           
##                   Kappa : 0.6099          
##                                           
##  Mcnemar's Test P-Value : 0.546494        
##                                           
##             Sensitivity : 0.8095          
##             Specificity : 0.8205          
##          Pos Pred Value : 0.7083          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.3500          
##          Detection Rate : 0.2833          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.8150          
##                                           
##        'Positive' Class : Class0          
##

# ROC curve
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc_obj <- roc(test_data$target, pred_prob)

## Setting levels: control = Class0, case = Class1

## Setting direction: controls < cases

plot(roc_obj, main = "ROC Curve")

print(paste("AUC:", round(auc(roc_obj), 3)))

## [1] "AUC: 0.939"

Python Implementation:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, auc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample binary classification data
np.random.seed(42)
n = 200
X = np.random.randn(n, 2)
y = (0.5 + 0.8 * X[:, 0] + 0.6 * X[:, 1] + np.random.randn(n) * 0.5 > 0).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit logistic regression
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

8.3.3.2 4.3.2 Decision Trees

Concept: Recursive partitioning of feature space through binary splits.

Advantages: Interpretable, handles non-linearity, no scaling needed Disadvantages: Prone to overfitting, unstable

R Implementation:

library(rpart)
library(rpart.plot)
library(tidyverse)
library(caret)

# Sample classification data
set.seed(42)
data <- read_csv("iris.csv")  # Using iris dataset

## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): species
## dbl (4): sepal_length, sepal_width, petal_length, petal_width
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Train-test split
train_idx <- createDataPartition(data$species, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# Fit decision tree
tree_model <- rpart(species ~ ., 
                    data = train_data,
                    method = "class",
                    control = rpart.control(cp = 0.01, minsplit = 5))

# Visualize tree
rpart.plot(tree_model, main = "Classification Tree")

# Predictions
pred <- predict(tree_model, newdata = test_data, type = "class")

# Evaluation
pred <- as.factor(pred)
test_data$species <- as.factor(test_data$species)
levels(pred) <- levels(test_data$species)  # Ensure same levels

confusionMatrix(pred, test_data$species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         15         0
##   virginica       0          0        15
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9213, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

Python Implementation:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt

# Load data
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

# Predictions
pred = tree.predict(X_test)

# Confusion matrix (no factor issues!)
cm = confusion_matrix(y_test, pred)
print("Confusion Matrix:\n", cm)

# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot()
plt.show()

8.3.3.3 4.3.3 Random Forest

Concept: Ensemble of decision trees aggregating predictions through voting/averaging.

Advantages: Reduces overfitting, handles feature importance, robust Disadvantages: Less interpretable, computationally intensive

R Implementation:

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:psych':
## 
##     outlier

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

# Sample data
set.seed(42)
data <- iris

# Train-test split
train_idx <- createDataPartition(data$Species, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# Fit random forest
rf_model <- randomForest(Species ~ ., 
                         data = train_data,
                         ntree = 100,
                         mtry = 2)

# Predictions
pred <- predict(rf_model, newdata = test_data)

# Evaluation
confusionMatrix(pred, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         15         2
##   virginica       0          0        13
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.8667
## Specificity                 1.0000            0.9333           1.0000
## Pos Pred Value              1.0000            0.8824           1.0000
## Neg Pred Value              1.0000            1.0000           0.9375
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.2889
## Detection Prevalence        0.3333            0.3778           0.2889
## Balanced Accuracy           1.0000            0.9667           0.9333

# Feature importance
importance(rf_model)

##              MeanDecreaseGini
## Sepal.Length         7.393491
## Sepal.Width          1.979317
## Petal.Length        30.176703
## Petal.Width         29.739061

varImpPlot(rf_model, main = "Feature Importance")

Python Implementation:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix, accuracy_score
import pandas as pd
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit random forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred = rf_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print(feature_importance)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()

8.3.3.4 4.3.4 Support Vector Machines (SVM)

Concept: Find optimal hyperplane maximizing margin between classes.

Key Parameters: Kernel type (linear, RBF, polynomial), regularization (C), gamma

R Implementation:

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:Hmisc':
## 
##     impute

## The following object is masked from 'package:ggplot2':
## 
##     element

library(caret)

# Sample data
set.seed(42)
data <- iris

# Train-test split
train_idx <- createDataPartition(data$Species, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# Fit SVM with RBF kernel
svm_model <- svm(Species ~ ., 
                 data = train_data,
                 kernel = "radial",
                 cost = 1)

# Predictions
pred <- predict(svm_model, newdata = test_data)

# Evaluation
confusionMatrix(pred, test_data$Species)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      1         14         2
##   virginica       0          1        13
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9111          
##                  95% CI : (0.7878, 0.9752)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 8.467e-16       
##                                           
##                   Kappa : 0.8667          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 0.9333            0.9333           0.8667
## Specificity                 1.0000            0.9000           0.9667
## Pos Pred Value              1.0000            0.8235           0.9286
## Neg Pred Value              0.9677            0.9643           0.9355
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3111            0.3111           0.2889
## Detection Prevalence        0.3111            0.3778           0.3111
## Balanced Accuracy           0.9667            0.9167           0.9167

Python Implementation:

#from sklearn.svm import SVC
#from sklearn.preprocessing import StandardScaler
#from sklearn.model_selection import train_test_split
#from sklearn.datasets import load_iris
#from sklearn.metrics import confusion_matrix, accuracy_score

# Load data
#iris = load_iris()
#X, y = iris.data, iris.target

# Standardize features
#scaler = StandardScaler()
#X_scaled = scaler.fit_transform(X)

# Train-test split
#X_train, X_test, y_train, y_test = train_test_split(
#    X_scaled, y, test_size=0.3, random_state=42)

# Fit SVM
#svm_model = SVC(kernel='rbf', C=1.0, gamma='scale')
#svm_model.fit(X_train, y_train)

# Predictions
#y_pred = svm_model.predict(X_test)

# Evaluation
#print("Accuracy:", accuracy_score(y_test, y_pred))
#print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

8.3.4 4.4 Regression Algorithms

8.3.4.1 4.4.1 Multiple Linear Regression

Already covered in 4.1. Key concepts: - Multiple predictors - Assumption checking important - Interpretability: \(\beta_j\) represents unit change in \(y\) per unit change in \(x_j\)

8.3.4.2 4.4.2 Ridge and Lasso Regression

Problem: Multicollinearity and overfitting in high dimensions

Ridge Regression (L2 penalty): \[\text{minimize} \quad \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p}\beta_j^2\]

Lasso Regression (L1 penalty): \[\text{minimize} \quad \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p}|\beta_j|\]

R Implementation:

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-10

library(caret)

# Sample regression data
set.seed(42)
X <- matrix(rnorm(100 * 20), 100, 20)
y <- X[, 1] * 2 + X[, 2] * 1.5 + rnorm(100)

# Train-test split
train_idx <- 1:70
X_train <- X[train_idx, ]
y_train <- y[train_idx]
X_test <- X[-train_idx, ]
y_test <- y[-train_idx]

# Ridge regression
ridge_model <- glmnet(X_train, y_train, alpha = 0, lambda = seq(0.01, 1, 0.01))
ridge_cv <- cv.glmnet(X_train, y_train, alpha = 0)
plot(ridge_cv)

ridge_pred <- predict(ridge_model, s = ridge_cv$lambda.min, newx = X_test)

# Lasso regression
lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = seq(0.01, 1, 0.01))
lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1)
plot(lasso_cv)

lasso_pred <- predict(lasso_model, s = lasso_cv$lambda.min, newx = X_test)

# Evaluation
ridge_mse <- mean((ridge_pred - y_test)^2)
lasso_mse <- mean((lasso_pred - y_test)^2)
print(paste("Ridge MSE:", ridge_mse))

## [1] "Ridge MSE: 1.50161663606881"

print(paste("Lasso MSE:", lasso_mse))

## [1] "Lasso MSE: 1.39301262183192"

Python Implementation:

from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# Sample regression data
np.random.seed(42)
X = np.random.randn(100, 20)
y = X[:, 0] * 2 + X[:, 1] * 1.5 + np.random.randn(100)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Ridge regression with cross-validation
ridge_cv = RidgeCV(alphas=np.logspace(-2, 3, 100), cv=5)
ridge_cv.fit(X_train, y_train)
ridge_pred = ridge_cv.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)
print(f"Ridge MSE: {ridge_mse:.4f}, Best alpha: {ridge_cv.alpha_:.4f}")

# Lasso regression with cross-validation
lasso_cv = LassoCV(alphas=np.logspace(-4, 0, 100), cv=5)
lasso_cv.fit(X_train, y_train)
lasso_pred = lasso_cv.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)
print(f"Lasso MSE: {lasso_mse:.4f}, Best alpha: {lasso_cv.alpha_:.4f}")

# Visualize coefficients
ridge_coef = ridge_cv.coef_
lasso_coef = lasso_cv.coef_

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.bar(range(len(ridge_coef)), ridge_coef)
plt.title('Ridge Regression Coefficients')
plt.xlabel('Feature Index')

plt.subplot(1, 2, 2)
plt.bar(range(len(lasso_coef)), lasso_coef)
plt.title('Lasso Regression Coefficients')
plt.xlabel('Feature Index')

plt.tight_layout()
plt.show()

8.3.4.3 4.4.3 Random Forest for Regression

Concept: Ensemble of regression trees Output: Average predictions from multiple trees

R Implementation:

library(randomForest)
library(tibble)

# Sample regression data
set.seed(42)
data <- tibble(
  x1 = runif(100, 0, 10),
  x2 = runif(100, 0, 10),
  y = 2 * x1 + 1.5 * x2 + rnorm(100, 0, 2))

# Train-test split
train_idx <- 1:70
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# Fit random forest for regression
rf_model <- randomForest(y ~ x1 + x2, 
                         data = train_data,
                         ntree = 100)

# Predictions
pred <- predict(rf_model, newdata = test_data)

# Evaluation
mse <- mean((pred - test_data$y)^2)
rmse <- sqrt(mse)
r2 <- 1 - (sum((pred - test_data$y)^2) / sum((test_data$y - mean(test_data$y))^2))
print(paste("MSE:", mse, "RMSE:", rmse, "R²:", r2))

## [1] "MSE: 8.7368963712931 RMSE: 2.95582414417588 R²: 0.7744288197865"

Python Implementation:

#from sklearn.ensemble import RandomForestRegressor
#from sklearn.model_selection import train_test_split
#from sklearn.metrics import mean_squared_error, r2_score
#import numpy as np

# Sample regression data
#np.random.seed(42)
#X = np.column_stack([
#    np.random.uniform(0, 10, 100),
#    np.random.uniform(0, 10, 100)])
#y = 2 * X[:, 0] + 1.5 * X[:, 1] + np.random.randn(100) * 2

# Train-test split
#X_train, X_test, y_train, y_test = train_test_split(
#    X, y, test_size=0.3, random_state=42)

# Fit random forest for regression
#rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
#rf_model.fit(X_train, y_train)

# Predictions
#y_pred = rf_model.predict(X_test)

# Evaluation
#mse = mean_squared_error(y_test, y_pred)
#rmse = np.sqrt(mse)
#r2 = r2_score(y_test, y_pred)
#print(f"MSE: {mse:.4f}, RMSE: {rmse:.4f}, R²: {r2:.4f}")

8.3.5 4.5 Introduction to Unsupervised Learning

Unsupervised Learning: Finding patterns in unlabeled data

Common Tasks: - Clustering: Group similar observations - Dimensionality Reduction: Reduce number of features - Anomaly Detection: Identify unusual observations

8.3.5.1 4.5.1 K-Means Clustering

Algorithm: 1. Initialize k centroids randomly 2. Assign each point to nearest centroid 3. Update centroids as mean of assigned points 4. Repeat steps 2-3 until convergence

R Implementation:

library(tidyverse)

# Sample data
set.seed(42)
data <- tibble(
  x = c(rnorm(30, 2, 0.5), rnorm(30, 6, 0.5)),
  y = c(rnorm(30, 2, 0.5), rnorm(30, 6, 0.5)))

# K-means clustering
kmeans_model <- kmeans(data, centers = 2, nstart = 25)

# Add cluster assignments
data_clustered <- data %>%
  mutate(cluster = factor(kmeans_model$cluster))

# Visualize
ggplot(data_clustered, aes(x = x, y = y, color = cluster)) +
  geom_point(size = 3) +
  geom_point(data = tibble(x = kmeans_model$centers[, 1],
                           y = kmeans_model$centers[, 2]),
             aes(color = NA), size = 5, shape = 3, color = "black") +
  labs(title = "K-Means Clustering") +
  theme_minimal()

# Elbow method for optimal k
withinss <- map_dbl(1:10, ~kmeans(data, centers = ., nstart = 25)$tot.withinss)
ggplot(tibble(k = 1:10, withinss = withinss), aes(x = k, y = withinss)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Plot", x = "Number of Clusters", y = "Within-cluster SS") +
  theme_minimal()

Python Implementation:

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Sample data
np.random.seed(42)
X = np.vstack([
    np.random.randn(30, 2) + np.array([2, 2]),
    np.random.randn(30, 2) + np.array([6, 6])
])

# K-means clustering
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[labels == 0, 0], X[labels == 0, 1], label='Cluster 1', alpha=0.6)
plt.scatter(X[labels == 1, 0], X[labels == 1, 1], label='Cluster 2', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
           marker='X', s=300, color='red', label='Centroids')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('K-Means Clustering')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Elbow method
inertias = []
silhouette_scores = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(alpha=0.3)
plt.show()

8.3.5.2 4.5.2 Hierarchical Clustering

Types: - Agglomerative: Bottom-up, start with individual points - Divisive: Top-down, start with all points

Linkage Methods: Single, Complete, Average, Ward

R Implementation:

# Sample data
set.seed(42)
data <- tibble(
  x = c(rnorm(20, 2, 0.5), rnorm(20, 6, 0.5)),
  y = c(rnorm(20, 2, 0.5), rnorm(20, 6, 0.5))
)

# Hierarchical clustering
dist_matrix <- dist(data)
hc_model <- hclust(dist_matrix, method = "ward.D2")

# Dendrogram
plot(hc_model, main = "Hierarchical Clustering Dendrogram")
abline(h = 3, col = "red", lty = 2)

# Cut dendrogram to get clusters
clusters <- cutree(hc_model, k = 2)
data_clustered <- data %>%
  mutate(cluster = factor(clusters))

# Visualize
ggplot(data_clustered, aes(x = x, y = y, color = cluster)) +
  geom_point(size = 3) +
  labs(title = "Hierarchical Clustering") +
  theme_minimal()

Python Implementation:

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist
import numpy as np
import matplotlib.pyplot as plt

# Sample data
np.random.seed(42)
X = np.vstack([
    np.random.randn(20, 2) + np.array([2, 2]),
    np.random.randn(20, 2) + np.array([6, 6])
])

# Hierarchical clustering
linkage_matrix = linkage(X, method='ward')

# Dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.axhline(y=3, color='r', linestyle='--')
plt.show()

# Cut dendrogram
clusters = fcluster(linkage_matrix, t=2, criterion='maxclust')

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[clusters == 1, 0], X[clusters == 1, 1], label='Cluster 1', alpha=0.6)
plt.scatter(X[clusters == 2, 0], X[clusters == 2, 1], label='Cluster 2', alpha=0.6)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Hierarchical Clustering Results')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

8.3.5.3 4.5.3 Principal Component Analysis (PCA)

Goal: Reduce dimensionality while preserving variance

Steps: 1. Standardize features 2. Compute covariance matrix 3. Calculate eigenvalues and eigenvectors 4. Select top components based on variance explained

R Implementation:

# Sample high-dimensional data
set.seed(42)
data <- scale(iris[, 1:4])

# PCA
pca_model <- prcomp(data, scale = FALSE, center = FALSE)

# Variance explained
variance_explained <- (pca_model$sdev^2 / sum(pca_model$sdev^2)) * 100
cumsum_variance <- cumsum(variance_explained)

# Scree plot
plot(cumsum_variance, type = "b", 
     main = "Cumulative Variance Explained",
     xlab = "Principal Component", 
     ylab = "Cumulative Variance (%)")

# PC scores and loadings
pc_scores <- pca_model$x[, 1:2]
loadings <- pca_model$rotation[, 1:2]

# Biplot
biplot(pca_model, scale = 0)

Python Implementation:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt

# Load and standardize data
iris = load_iris()
X = StandardScaler().fit_transform(iris.data)

# PCA
pca = PCA()
pc_scores = pca.fit_transform(X)

# Variance explained
variance_explained = pca.explained_variance_ratio_
cumsum_variance = np.cumsum(variance_explained)

# Scree plot
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.bar(range(1, len(variance_explained) + 1), variance_explained)
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')

plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumsum_variance) + 1), cumsum_variance, 'bo-')
plt.xlabel('Principal Component')
plt.ylabel('Cumulative Variance Explained')
plt.title('Cumulative Variance Explained')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Biplot
plt.figure(figsize=(10, 8))
plt.scatter(pc_scores[:, 0], pc_scores[:, 1], c=iris.target, cmap='viridis', alpha=0.6)
for i, feature in enumerate(iris.feature_names):
    plt.arrow(0, 0, pca.components_[0, i]*3, pca.components_[1, i]*3,
             head_width=0.1, head_length=0.1, fc='red', ec='red')
    plt.text(pca.components_[0, i]*3.2, pca.components_[1, i]*3.2, feature)
plt.xlabel(f'PC1 ({variance_explained[0]:.1%})')
plt.ylabel(f'PC2 ({variance_explained[1]:.1%})')
plt.title('PCA Biplot')
plt.grid(alpha=0.3)
plt.show()

8.3.6 4.6 Model Evaluation and Selection

8.3.6.1 4.6.1 Performance Metrics

Classification Metrics:

Confusion Matrix: TP, TN, FP, FN
Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
Precision: \(\frac{TP}{TP + FP}\) (relevance)
Recall/Sensitivity: \(\frac{TP}{TP + FN}\) (coverage)
F1-Score: \(2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
AUC-ROC: Area under ROC curve

Regression Metrics:

Mean Squared Error (MSE): \(\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\)
Root Mean Squared Error (RMSE): \(\sqrt{\text{MSE}}\)
Mean Absolute Error (MAE): \(\frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\)
R-squared: \(1 - \frac{SS_{res}}{SS_{tot}}\)

8.3.6.2 4.6.2 Cross-Validation

Purpose: Estimate model performance on unseen data

Techniques: - k-Fold CV: Divide into k folds, train on k-1, test on 1 - Stratified CV: Maintain class distribution - Time Series CV: Respect temporal ordering

R Implementation:

library(caret)

# Define cross-validation strategy
ctrl <- trainControl(method = "cv", number = 5)

# Train model with cross-validation
model <- train(Species ~ ., 
               data = iris,
               method = "rf",
               trControl = ctrl)

# Model performance
model$results

##   mtry Accuracy Kappa AccuracySD    KappaSD
## 1    2     0.96  0.94 0.04346135 0.06519202
## 2    3     0.96  0.94 0.04346135 0.06519202
## 3    4     0.96  0.94 0.04346135 0.06519202

model$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.33%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          5        45        0.10

Python Implementation:

from sklearn.model_selection import cross_val_score, cross_validate, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Model
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Single metric
cv_scores = cross_val_score(rf, X, y, cv=kfold, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")

# Multiple metrics
scoring = {'accuracy': 'accuracy',
           'precision': 'precision_weighted',
           'recall': 'recall_weighted',
           'f1': 'f1_weighted'}
cv_results = cross_validate(rf, X, y, cv=kfold, scoring=scoring)

for metric, scores in cv_results.items():
    if metric.startswith('test_'):
        print(f"{metric}: {scores.mean():.4f} +/- {scores.std():.4f}")

8.3.6.3 4.6.3 Hyperparameter Tuning

Grid Search: Exhaustive search over parameter grid

R Implementation:

library(caret)

# Define parameter grid
param_grid <- expand.grid(
  mtry = c(2, 3, 4),
  splitrule = "gini",
  min.node.size = c(1, 5, 10)
)

# Grid search with cross-validation
ctrl <- trainControl(method = "cv", number = 5)
model <- train(Species ~ ., 
               data = iris,
               method = "ranger",
               trControl = ctrl,
               tuneGrid = param_grid)

# Best model
print(model)

## Random Forest 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 120, 120, 120, 120, 120 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy   Kappa
##   2      1             0.9600000  0.94 
##   2      5             0.9600000  0.94 
##   2     10             0.9666667  0.95 
##   3      1             0.9666667  0.95 
##   3      5             0.9666667  0.95 
##   3     10             0.9666667  0.95 
##   4      1             0.9666667  0.95 
##   4      5             0.9666667  0.95 
##   4     10             0.9666667  0.95 
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
##  and min.node.size = 10.

plot(model)

Python Implementation:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

# Grid search
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

# Results
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Best model
best_model = grid_search.best_estimator_

8.4 MODULE 5: ADVANCED TOPICS AND APPLICATIONS

8.4.1 5.1 Time Series Analysis Basics

Key Concepts: - Temporal Dependency: Past values influence future values - Stationarity: Mean and variance constant over time - Autocorrelation: Correlation with own lags - Seasonality: Regular patterns repeating at fixed intervals

8.4.1.1 5.1.1 Exploratory Time Series Analysis

R Implementation:

library(tidyverse)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tseries)

# Create time series
ts_data <- ts(c(100, 102, 101, 103, 105, 104, 106, 108, 107, 110), 
              frequency = 4, start = c(2020, 1))

# Plot time series
plot(ts_data, main = "Time Series Plot")

# Decomposition
decomposed <- decompose(ts_data)
plot(decomposed)

# ACF and PACF
acf(ts_data)

pacf(ts_data)

# Stationarity test (ADF)
adf.test(ts_data)

## Warning in adf.test(ts_data): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -9.1652, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary

Python Implementation:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller

# Create time series
dates = pd.date_range('2020-01-01', periods=40, freq='Q')
values = [100, 102, 101, 103, 105, 104, 106, 108, 107, 110,
          115, 113, 117, 120, 118, 122, 125, 123, 127, 130,
          135, 133, 137, 140, 138, 142, 145, 143, 147, 150,
          155, 153, 157, 160, 158, 162, 165, 163, 167, 170]
ts_data = pd.Series(values, index=dates)

# Plot time series
plt.figure(figsize=(12, 4))
plt.plot(ts_data)
plt.title('Time Series')
plt.ylabel('Value')
plt.grid(alpha=0.3)
plt.show()

# Decomposition
decomposition = seasonal_decompose(ts_data, model='additive', period=4)
decomposition.plot()
plt.show()

# ACF and PACF
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(ts_data, lags=20, ax=axes[0])
plot_pacf(ts_data, lags=20, ax=axes[1])
plt.tight_layout()
plt.show()

# Stationarity test (ADF)
adf_result = adfuller(ts_data)
print(f"ADF Statistic: {adf_result[0]:.6f}")
print(f"p-value: {adf_result[1]:.6f}")

8.4.1.2 5.1.2 ARIMA Models

ARIMA(p, d, q): - p: AR order (autoregressive) - d: Differencing order - q: MA order (moving average)

R Implementation:

library(forecast)

# Auto ARIMA
auto_model <- auto.arima(ts_data)
summary(auto_model)

# Manual ARIMA
manual_model <- arima(ts_data, order = c(1, 1, 1))

# Forecasting
forecasts <- forecast(auto_model, h = 8)
plot(forecasts, main = "ARIMA Forecast")

# Model diagnostics
checkresiduals(auto_model)

Python Implementation:

from statsmodels.tsa.arima.model import ARIMA
from pmdarima import auto_arima
import matplotlib.pyplot as plt

# Auto ARIMA
auto_model = auto_arima(ts_data, seasonal=False, stepwise=True)
print(auto_model.summary())

# Manual ARIMA
arima_model = ARIMA(ts_data, order=(1, 1, 1))
fitted_model = arima_model.fit()
print(fitted_model.summary())

# Forecasting
forecast_result = fitted_model.get_forecast(steps=8)
forecast_df = forecast_result.conf_int(alpha=0.05)
forecast_mean = forecast_result.predicted_mean

# Plot
plt.figure(figsize=(12, 6))
plt.plot(ts_data, label='Observed')
plt.plot(forecast_mean.index, forecast_mean, label='Forecast', color='red')
plt.fill_between(forecast_df.index,
                 forecast_df.iloc[:, 0],
                 forecast_df.iloc[:, 1], alpha=0.3)
plt.legend()
plt.title('ARIMA Forecast')
plt.grid(alpha=0.3)
plt.show()

# Residual diagnostics
fitted_model.plot_diagnostics(figsize=(12, 8))
plt.tight_layout()
plt.show()

8.4.2 5.2 Introduction to Text Mining and NLP

Common Tasks: - Text preprocessing - Sentiment analysis - Topic modeling - Named entity recognition - Text classification

8.4.2.1 5.2.1 Text Preprocessing

R Implementation:

library(tidytext)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(stringr)

# Sample text
texts <- c(
  "Machine Learning is a subset of Artificial Intelligence.",
  "Data Science involves statistics and programming.",
  "Natural Language Processing is fascinating!"
)

# Create corpus and clean
corpus <- Corpus(VectorSource(texts))
corpus <- tm_map(corpus, tolower)  # Lowercase

## Warning in tm_map.SimpleCorpus(corpus, tolower): transformation drops documents

corpus <- tm_map(corpus, removePunctuation)  # Remove punctuation

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents

corpus <- tm_map(corpus, removeNumbers)  # Remove numbers

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents

corpus <- tm_map(corpus, removeWords, stopwords("english"))  # Remove stopwords

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents

corpus <- tm_map(corpus, stripWhitespace)  # Remove extra spaces

## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation drops
## documents

# Document-term matrix
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)

## <<DocumentTermMatrix (documents: 3, terms: 14)>>
## Non-/sparse entries: 14/28
## Sparsity           : 67%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs artificial data intelligence involves learning machine programming science
##    1          1    0            1        0        1       1           0       0
##    2          0    1            0        1        0       0           1       1
##    3          0    0            0        0        0       0           0       0
##     Terms
## Docs statistics subset
##    1          0      1
##    2          1      0
##    3          0      0

# Alternative with tidytext
tidy_texts <- tibble(
  id = 1:length(texts),
  text = texts
) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining with `by = join_by(word)`

Python Implementation:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
import pandas as pd

# Download required data
nltk.download('punkt')
nltk.download('stopwords')

# Sample texts
texts = [
    "Machine Learning is a subset of Artificial Intelligence.",
    "Data Science involves statistics and programming.",
    "Natural Language Processing is fascinating!"
]

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

# Process texts
processed_texts = [preprocess_text(text) for text in texts]
print(processed_texts)

# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(texts)
print(tfidf_matrix.toarray())
print(vectorizer.get_feature_names_out())

8.4.2.2 5.2.2 Sentiment Analysis

R Implementation:

library(tidytext)

# Sentiment lexicon
sentiment_df <- tibble(
  text = c("I love this product!", 
           "This is terrible.",
           "I am neutral about this."))

# Sentiment analysis using tidytext
#sentiment_scores <- sentiment_df %>%
#  unnest_tokens(word, text) %>%
#  inner_join(get_sentiments("bing")) %>%
#  group_by(text) %>%
#  summarise(sentiment = sum(ifelse(sentiment == "positive", 1, -1)))

Python Implementation:

from textblob import TextBlob
from transformers import pipeline

# Sample texts
texts = [
    "I love this product! It's amazing.",
    "This is terrible. Worst experience ever.",
    "The product is okay, nothing special."
]

# Simple sentiment analysis using TextBlob
for text in texts:
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity
    print(f"Text: {text}")
    print(f"Polarity: {polarity:.2f}, Subjectivity: {subjectivity:.2f}\n")

# Advanced sentiment analysis with transformers
classifier = pipeline('sentiment-analysis')
for text in texts:
    result = classifier(text)
    print(f"Text: {text}")
    print(f"Sentiment: {result[0]['label']}, Score: {result[0]['score']:.4f}\n")

8.4.2.3 5.2.3 Topic Modeling

Latent Dirichlet Allocation (LDA):

R Implementation:

library(tidytext)
library(topicmodels)

# Document-term matrix (from preprocessed texts)
dtm <- DocumentTermMatrix(corpus)

# LDA model
lda_model <- LDA(dtm, k = 2, control = list(seed = 1234))

# Topics
topics <- tidy(lda_model, matrix = "beta")
top_terms <- topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  arrange(topic, -beta)

# Document-topic distributions
doc_topics <- tidy(lda_model, matrix = "gamma")

Python Implementation:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd

# Sample documents
documents = [
    "Machine learning artificial intelligence deep learning",
    "Data science statistics programming",
    "Natural language processing text mining",
    "Computer vision image processing",
    "Big data hadoop spark"
]

# Create document-term matrix
vectorizer = CountVectorizer(stop_words='english', max_df=0.95, min_df=1)
dtm = vectorizer.fit_transform(documents)

# LDA model
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(dtm)

# Topics
feature_names = vectorizer.get_feature_names_out()
n_top_words = 5

for topic_idx, topic in enumerate(lda.components_):
    top_indices = topic.argsort()[-n_top_words:][::-1]
    top_words = [feature_names[i] for i in top_indices]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")

# Document-topic distribution
doc_topic_dist = lda.transform(dtm)
print(doc_topic_dist)

8.4.3 5.3 Ethical Considerations in Data Science

Key Issues: 1. Bias and Fairness: Models reflecting historical biases 2. Privacy: Protecting sensitive information 3. Transparency: Explainability of model decisions 4. Accountability: Responsibility for model outcomes 5. Consent: Ethical data collection 6. Disparate Impact: Differential outcomes for groups

Best Practices: - Audit models for bias across subgroups - Use differential privacy techniques - Document data provenance and quality - Implement explainability methods (SHAP, LIME) - Establish governance frameworks - Regular monitoring for model drift - Diverse team involvement in development

8.4.4 5.4 Case Studies and Real-World Applications

8.4.4.1 5.4.1 Customer Churn Prediction

Problem: Identify customers likely to leave

Data: Customer demographics, usage, support interactions

Approach: 1. EDA: Churn rate, feature distributions by churn status 2. Feature engineering: Create behavioral features 3. Model: Classification (logistic regression, random forest) 4. Evaluation: AUC-ROC, business metrics 5. Deployment: Score customer base, target interventions

8.4.4.2 5.4.2 Credit Risk Assessment

Problem: Assess probability of loan default

Data: Credit history, income, employment, loan details

Approach: 1. EDA: Default rates, feature relationships 2. Feature engineering: Ratio analysis, binning 3. Model: Logistic regression, gradient boosting 4. Evaluation: AUC, KS statistic, calibration 5. Deployment: Risk scoring, rate adjustment

8.4.4.3 5.4.3 Demand Forecasting

Problem: Predict future product demand

Data: Historical sales, seasonality, external factors

Approach: 1. EDA: Trends, patterns, outliers 2. Model: Time series (ARIMA), machine learning 3. Evaluation: RMSE, MAPE, directional accuracy 4. Deployment: Inventory planning, supply chain optimization

8.5 APPENDIX: Common Data Science Workflows

8.5.1 General Project Structure

project_name/
├── data/
│   ├── raw/
│   ├── processed/
│   └── external/
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_modeling.ipynb
├── src/
│   ├── data_loading.py
│   ├── preprocessing.py
│   ├── features.py
│   └── models.py
├── results/
│   ├── figures/
│   └── models/
├── README.md
├── requirements.txt
├── .gitignore
└── environment.yml

8.5.2 Essential Libraries

R: - tidyverse: Data manipulation and visualization - caret: Machine learning - ggplot2: Advanced visualization - rmarkdown: Report generation - shiny: Interactive apps

Python: - pandas: Data manipulation - numpy: Numerical computing - scikit-learn: Machine learning - matplotlib, seaborn: Visualization - statsmodels: Statistical modeling - jupyter: Interactive notebooks

8.6 Conclusion

This comprehensive course provides statisticians with essential tools and competencies for modern data science. Success requires practice through real datasets, collaboration with domain experts, and continuous learning as tools and techniques evolve.