DATA SCIENCE FOR STATISTICIAN
1 Data Science for Statisticians: Comprehensive Course Notes
1.1 Course Overview and Objectives
1.1.1 Course Purpose
This course empowers statisticians to leverage modern data science tools and approaches, enhancing their ability to extract meaningful insights, build predictive models, and contribute to data-driven decision-making across applied domains. The focus bridges statistical theory with practical data science implementation.
1.1.2 Expected Learning Outcomes
By course completion, students will:
- Data Management: Proficiently acquire, clean, transform, and manage diverse datasets (structured and unstructured) using appropriate programming languages and tools
- Data Science Techniques: Implement and interpret fundamental data science techniques with theoretical understanding
- Statistical Modeling: Understand theoretical underpinnings and practical applications of statistical and machine learning models
- Programming Proficiency: Develop strong programming skills in Python and R
- Critical Assessment: Evaluate model performance, understand limitations, and recognize ethical implications
- Collaboration: Work effectively in teams using professional workflows, version control, and collaborative practices
1.2 MODULE 1: INTRODUCTION TO DATA SCIENCE AND PROGRAMMING FUNDAMENTALS
1.2.1 1.1 Overview of Data Science
Definition: Data Science is an interdisciplinary field combining statistics, computer science, domain expertise, and mathematics to extract actionable insights from data and drive decision-making.
The Data Science Workflow: 1. Problem Definition and Understanding 2. Data Collection and Acquisition 3. Data Cleaning and Preparation 4. Exploratory Data Analysis (EDA) 5. Feature Engineering 6. Model Development and Training 7. Model Evaluation and Validation 8. Deployment and Monitoring 9. Communication and Visualization
Key Distinction from Statistics: While classical statistics emphasizes inference and hypothesis testing under specific distributional assumptions, data science emphasizes prediction, pattern discovery, and handling of complex, high-dimensional data with minimal assumptions.
1.2.2 1.2 Introduction to Programming for Data Science
1.2.2.1 1.2.1 Why R and Python?
R: Statistical computing environment with extensive packages for statistical analysis, visualization, and modeling. - Advantages: Domain-specific for statistics, ggplot2 for visualization, tidyverse for data manipulation - Best for: Statistical analysis, academic research, visualization
Python: General-purpose programming language with powerful data science libraries. - Advantages: Versatility, scikit-learn for ML, large community, production deployment - Best for: Machine learning, deep learning, production systems, web applications
1.2.2.2 1.2.2 Getting Started: R Fundamentals
Basic Data Types:
# Numeric
x <- 42
y <- 3.14
# Character
name <- "Statistics"
# Logical
flag <- TRUE
# Vectors (homogeneous)
vec <- c(1, 2, 3, 4, 5)
vec_char <- c("a", "b", "c")
# Lists (heterogeneous)
my_list <- list(name = "John", age = 30, scores = c(85, 90, 78))
my_list$name
## [1] "John"
# Data frames (tabular)
df <- data.frame(
id = 1:3,
name = c("Alice", "Bob", "Carol"),
score = c(95, 87, 92)
)
# Matrix
mat <- matrix(1:6, nrow = 2, ncol = 3)
Basic Operations:
# Vectorized operations
x <- c(1, 2, 3, 4, 5)
y <- x * 2 # c(2, 4, 6, 8, 10)
z <- x[x > 2] # c(3, 4, 5)
# Functions
mean_value <- mean(x)
sd_value <- sd(x)
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 2 3 3 4 5
# Custom function
calculate_cv <- function(x) {
cv <- sd(x) / mean(x) * 100
return(cv)
}
cv_result <- calculate_cv(c(10, 12, 15, 18, 20))
1.2.2.3 1.2.3 Getting Started: Python Fundamentals
Basic Data Types and Structures:
# Numeric types
x = 42
y = 3.14
# String
name = "Statistics"
# Boolean
flag = True
# Lists (mutable, heterogeneous)
lst = [1, 2, 3, "a", "b"]
# Tuples (immutable)
tup = (1, 2, 3)
# Dictionaries
my_dict = {"name": "John", "age": 30, "scores": [85, 90, 78]}
my_dict["name"]
## 'John'
# NumPy arrays
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
# Pandas DataFrame
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Carol'],
'score': [95, 87, 92]
})
Basic Operations:
# Array operations (vectorized)
x = np.array([1, 2, 3, 4, 5])
y = x * 2 # array([ 2, 4, 6, 8, 10])
z = x[x > 2] # array([3, 4, 5])
# Functions
mean_value = np.mean(x)
std_value = np.std(x)
# Custom function
def calculate_cv(x):
cv = np.std(x) / np.mean(x) * 100
return cv
cv_result = calculate_cv(np.array([10, 12, 15, 18, 20]))
# List comprehension
squares = [i**2 for i in range(1, 6)] # [1, 4, 9, 16, 25]
1.2.3 1.3 Integrated Development Environments (IDEs)
1.2.3.1 1.3.1 RStudio
- Four-pane interface: Editor, Console, Environment/History, Files/Plots/Packages/Help
- Projects for organized workflows
- R Markdown for dynamic reports combining code and narrative
- Package management integrated
1.2.3.2 1.3.2 Jupyter Notebooks
- Interactive computation environment
- Browser-based interface
- Mix markdown and code cells
- Supports multiple kernels (Python, R, Julia)
- Ideal for exploratory analysis and sharing
1.2.3.3 1.3.3 VS Code
- Lightweight, extensible editor
- Extensions for Python, R, Jupyter
- Integrated terminal
- Git integration
- Good for production code development
1.2.4 1.4 Version Control with Git and GitHub
Why Version Control? - Track changes over time - Collaborate with team members - Revert to previous versions - Manage multiple branches
Basic Git Workflow:
2 Initialize repository
git init git clone https://github.com/username/repo.git
3 Make changes and stage them
git add file.py git add . # Add all changes
4 Commit changes
git commit -m “Add feature: data cleaning module”
5 Push to remote
git push origin main
6 Pull latest changes
git pull origin main
7 Create and switch branches
git checkout -b feature/new-analysis git checkout main
8 Merge branches
git merge feature/new-analysis
GitHub Best Practices:
- Meaningful commit messages
- Frequent, atomic commits
- Pull requests for code review
.gitignorefor sensitive/large files- README documentation
8.1 MODULE 2: DATA ACQUISITION, CLEANING, AND MANAGEMENT
8.1.1 2.1 Data Sources and Types
Data Sources: - Internal databases and data warehouses - APIs (Application Programming Interfaces) - Web scraping - Sensors and IoT devices - Surveys and questionnaires - Social media and web platforms - Open data repositories
Data Types: - Structured Data: Relational databases, CSV, Excel, with predefined schema - Semi-structured Data: JSON, XML, logs with partial schema - Unstructured Data: Text, images, audio, video with no predefined structure
8.1.2 2.2 Data Import and Export
8.1.2.1 2.2.1 R: Data Import and Export
# Reading CSV
df <- read.csv("employees.csv")
df_csv <- readr::read_csv("employees.csv") # Faster, more robust
## Rows: 200 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, department, education_level
## dbl (5): emp_id, age, experience, salary, performance_score
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Reading Excel
library(readxl)
#df_excel <- read_excel("data.xlsx", sheet = 1)
# Reading from database
#library(DBI)
#library(RSQLite)
#conn <- dbConnect(SQLite(), "database.db")
#df_db <- dbGetQuery(conn, "SELECT * FROM table_name")
#dbDisconnect(conn)
# Reading JSON
library(jsonlite)
#df_json <- fromJSON("data.json")
# Writing data
#write.csv(df, "output.csv", row.names = FALSE)
#readr::write_csv(df, "output.csv")
#write_excel_csv(df, "output.csv")
8.1.2.2 2.2.2 Python: Data Import and Export
import pandas as pd
import sqlite3
import json
# Reading CSV
df = pd.read_csv('employees.csv')
df = pd.read_csv('employees.csv', sep=';', encoding='utf-8')
# Reading Excel
#df = pd.read_excel('data.xlsx', sheet_name=0)
# Reading from database
#conn = sqlite3.connect('database.db')
#df = pd.read_sql_query('SELECT * FROM table_name', conn)
#conn.close()
# Reading JSON
#df = pd.read_json('data.json')
#df = pd.read_json('data.jsonl', lines=True)
# Writing data
#df.to_csv('output.csv', index=False)
#df.to_excel('output.xlsx', index=False)
#df.to_sql('table_name', conn, if_exists='replace')
8.1.3 2.3 Data Cleaning and Preprocessing
Common Data Quality Issues: - Missing values (MCAR, MAR, MNAR) - Duplicates - Outliers - Inconsistent formatting - Type mismatches - Categorical encoding issues
8.1.3.1 2.3.1 R: Data Cleaning
# Load tidyverse for data manipulation
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Sample dataset
data <- tibble(
id = c(1, 2, 2, 4, 5),
name = c("John", "Jane", "Jane", NA, "Mike"),
age = c(25, 30, 30, 35, "40"), # Type inconsistency
salary = c(50000, NA, 60000, 75000, 45000),
department = c("IT", "HR", "HR", "IT", "IT"))
# Check for missing values
sum(is.na(data))
## [1] 2
colSums(is.na(data))
## id name age salary department
## 0 1 0 1 0
naniar::vis_miss(data) # Visualize missing patterns
# Identify duplicates
duplicate_rows <- data[duplicated(data),]
data_no_dup <- distinct(data)
# Remove rows with any NA
data_clean <- na.omit(data)
# Handle missing values - multiple strategies
# Strategy 1: Remove rows with missing values in key columns
data_clean <- data %>% filter(!is.na(name), !is.na(salary))
# Strategy 2: Impute with mean/median
data_clean <- data %>%
mutate(salary = ifelse(is.na(salary), median(salary, na.rm = TRUE), salary))
# Strategy 3: Impute with forward fill (time series)
data_clean <- data %>%
fill(department, .direction = "down")
# Fix type mismatches
data_clean <- data %>%
mutate(age = as.numeric(age))
# Remove duplicates on specific columns
data_clean <- data %>% distinct(id, name, .keep_all = TRUE)
# Standardize text
data_clean <- data %>%
mutate(
name = str_trim(name), # Remove leading/trailing spaces
name = str_to_title(name), # Title case
department = str_to_upper(department) # Upper case
)
# Handle outliers (IQR method)
detect_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
return(x < lower_bound | x > upper_bound)
}
data_clean <- data %>%
mutate(outlier = detect_outliers(salary))
8.1.3.2 2.3.2 Python: Data Cleaning
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Sample dataset
data = pd.DataFrame({
'id': [1, 2, 2, 4, 5],
'name': ['John', 'Jane', 'Jane', None, 'Mike'],
'age': [25, 30, 30, 35, '40'], # Type inconsistency
'salary': [50000, np.nan, 60000, 75000, 45000],
'department': ['IT', 'HR', 'HR', 'IT', 'IT']
})
# Check for missing values
data.isnull().sum()
data.isnull().sum() / len(data) * 100 # Percentage
# Visualize missing patterns
import missingno as msno
msno.matrix(data)
# Identify and remove duplicates
duplicate_mask = data.duplicated()
data_no_dup = data.drop_duplicates()
data_no_dup = data.drop_duplicates(subset=['id', 'name'])
# Remove rows with any missing values
data_clean = data.dropna()
# Remove rows with missing values in specific columns
data_clean = data.dropna(subset=['name', 'salary'])
# Impute missing values
# Strategy 1: Mean imputation
data_clean = data.copy()
data_clean['salary'].fillna(data_clean['salary'].mean(), inplace=True)
# Strategy 2: Median imputation
data_clean['salary'] = data['salary'].fillna(data['salary'].median())
# Strategy 3: Forward fill
data_clean['department'] = data['department'].fillna(method='ffill')
# Strategy 4: Using interpolation
data_clean['salary'] = data['salary'].interpolate()
# Fix type mismatches
data_clean['age'] = pd.to_numeric(data['age'], errors='coerce')
# Standardize text
data_clean['name'] = data_clean['name'].str.strip().str.title()
data_clean['department'] = data_clean['department'].str.upper()
# Handle outliers (IQR method)
def detect_outliers_iqr(x):
Q1 = x.quantile(0.25)
Q3 = x.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return (x < lower_bound) | (x > upper_bound)
outliers = detect_outliers_iqr(data_clean['salary'])
data_no_outliers = data_clean[~outliers]
8.1.4 2.4 Data Manipulation and Transformation
8.1.4.1 2.4.1 R: Data Manipulation with dplyr
library(dplyr)
library(tidyr)
# Sample data
employees <- tibble(
emp_id = 1:5,
name = c("Alice", "Bob", "Carol", "David", "Eve"),
department = c("IT", "HR", "IT", "Finance", "HR"),
salary = c(80000, 60000, 85000, 70000, 65000),
years = c(3, 5, 2, 7, 4)
)
# Select columns
selected <- employees %>% select(name, salary, department)
selected <- employees %>% select(starts_with("e")) # Select by pattern
# Filter rows
it_dept <- employees %>% filter(department == "IT")
high_earners <- employees %>% filter(salary > 70000)
complex_filter <- employees %>%
filter(department %in% c("IT", "HR") & years > 2)
# Mutate - create/modify columns
employees_enhanced <- employees %>%
mutate(
salary_thousands = salary / 1000,
seniority = if_else(years >= 5, "Senior", "Junior"),
annual_increase = salary * 0.05
)
# Group by and summarize
summary_stats <- employees %>%
group_by(department) %>%
summarise(
count = n(),
avg_salary = mean(salary),
median_salary = median(salary),
total_years = sum(years),
.groups = 'drop'
)
# Arrange (sort)
sorted <- employees %>% arrange(salary) # Ascending
sorted <- employees %>% arrange(desc(salary)) # Descending
# Rename columns
renamed <- employees %>% rename(employee_name = name)
# Pivot longer (wide to long)
wide_data <- tibble(
id = 1:3,
Q1_sales = c(1000, 1500, 2000),
Q2_sales = c(1200, 1600, 2100)
)
long_data <- wide_data %>%
pivot_longer(cols = starts_with("Q"),
names_to = "quarter",
values_to = "sales")
# Pivot wider (long to wide)
sales_data <- tibble(
product = c("A", "A", "B", "B"),
quarter = c("Q1", "Q2", "Q1", "Q2"),
revenue = c(1000, 1200, 1500, 1600)
)
wide_sales <- sales_data %>%
pivot_wider(names_from = quarter,
values_from = revenue)
# Join operations
departments <- tibble(
dept_id = 1:3,
dept_name = c("IT", "HR", "Finance")
)
employees_expanded <- employees %>%
mutate(dept_id = match(department, c("IT", "HR", "Finance"))) %>%
left_join(departments, by = "dept_id")
# Chain operations
result <- employees %>%
filter(salary > 60000) %>%
group_by(department) %>%
summarise(avg_salary = mean(salary), count = n()) %>%
arrange(desc(avg_salary))
8.1.4.2 2.4.2 Python: Data Manipulation with Pandas
import pandas as pd
import numpy as np
# Sample data
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
'department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
'salary': [80000, 60000, 85000, 70000, 65000],
'years': [3, 5, 2, 7, 4]
})
# Select columns
selected = employees[['name', 'salary', 'department']]
selected = employees.loc[:, ['name', 'salary']]
selected = employees.filter(like='name') # Filter by pattern
# Filter rows
it_dept = employees[employees['department'] == 'IT']
high_earners = employees[employees['salary'] > 70000]
complex_filter = employees[(employees['department'].isin(['IT', 'HR'])) &
(employees['years'] > 2)]
# Create/modify columns (assign)
employees_enhanced = employees.assign(
salary_thousands=employees['salary'] / 1000,
seniority=employees['years'].apply(
lambda x: 'Senior' if x >= 5 else 'Junior'
),
annual_increase=employees['salary'] * 0.05
)
# Using .loc for direct assignment
employees_enhanced = employees.copy()
employees_enhanced['seniority'] = np.where(
employees['years'] >= 5, 'Senior', 'Junior'
)
# Group by and aggregate
summary_stats = employees.groupby('department').agg({
'emp_id': 'count',
'salary': ['mean', 'median', 'min', 'max'],
'years': 'sum'
})
# Rename aggregation columns
summary_stats = employees.groupby('department').agg(
count=('emp_id', 'count'),
avg_salary=('salary', 'mean'),
median_salary=('salary', 'median'),
total_years=('years', 'sum')
).reset_index()
# Sort (sort_values)
sorted_asc = employees.sort_values('salary')
sorted_desc = employees.sort_values('salary', ascending=False)
# Rename columns
renamed = employees.rename(columns={'name': 'employee_name'})
# Pivot long to wide (melt vs pivot_table)
wide_data = pd.DataFrame({
'id': [1, 2, 3],
'Q1_sales': [1000, 1500, 2000],
'Q2_sales': [1200, 1600, 2100]
})
long_data = wide_data.melt(id_vars='id',
var_name='quarter',
value_name='sales')
# Pivot wide to long
sales_data = pd.DataFrame({
'product': ['A', 'A', 'B', 'B'],
'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
'revenue': [1000, 1200, 1500, 1600]
})
wide_sales = sales_data.pivot_table(index='product',
columns='quarter',
values='revenue',
aggfunc='sum')
# Join operations
departments = pd.DataFrame({
'dept_id': [1, 2, 3],
'dept_name': ['IT', 'HR', 'Finance']
})
employees['dept_id'] = employees['department'].map(
{'IT': 1, 'HR': 2, 'Finance': 3}
)
employees_expanded = employees.merge(departments, on='dept_id')
# Chain operations (method chaining)
result = (employees
.query('salary > 60000')
.groupby('department')
.agg(avg_salary=('salary', 'mean'), count=('emp_id', 'count'))
.reset_index()
.sort_values('avg_salary', ascending=False)
)
8.2 MODULE 3: EXPLORATORY DATA ANALYSIS AND VISUALIZATION
8.2.1 3.1 Descriptive Statistics
Key Metrics: - Central Tendency: Mean, Median, Mode - Dispersion: Range, Variance, Standard Deviation, IQR - Distribution Shape: Skewness, Kurtosis - Relationships: Correlation, Covariance
8.2.1.1 3.1.1 R: Descriptive Statistics
library(dplyr)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# Sample data
data <- tibble(
age = c(25, 28, 30, 35, 42, 45, 50),
income = c(45000, 52000, 58000, 65000, 75000, 82000, 95000),
experience = c(2, 3, 4, 8, 12, 15, 20)
)
# Basic summary statistics
summary(data)
## age income experience
## Min. :25.00 Min. :45000 Min. : 2.000
## 1st Qu.:29.00 1st Qu.:55000 1st Qu.: 3.500
## Median :35.00 Median :65000 Median : 8.000
## Mean :36.43 Mean :67429 Mean : 9.143
## 3rd Qu.:43.50 3rd Qu.:78500 3rd Qu.:13.500
## Max. :50.00 Max. :95000 Max. :20.000
# Detailed statistics
describe(data)
## vars n mean sd median trimmed mad min max range
## age 1 7 36.43 9.43 35 36.43 10.38 25 50 25
## income 2 7 67428.57 17633.84 65000 67428.57 19273.80 45000 95000 50000
## experience 3 7 9.14 6.79 8 9.14 7.41 2 20 18
## skew kurtosis se
## age 0.16 -1.84 3.56
## income 0.23 -1.61 6664.97
## experience 0.35 -1.67 2.57
# Custom summary function
data %>%
summarise(
across(everything(),
list(
mean = ~mean(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
min = ~min(., na.rm = TRUE),
max = ~max(., na.rm = TRUE),
q25 = ~quantile(., 0.25, na.rm = TRUE),
q75 = ~quantile(., 0.75, na.rm = TRUE)
)
)
)
## # A tibble: 1 × 21
## age_mean age_median age_sd age_min age_max age_q25 age_q75 income_mean
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 36.4 35 9.43 25 50 29 43.5 67429.
## # ℹ 13 more variables: income_median <dbl>, income_sd <dbl>, income_min <dbl>,
## # income_max <dbl>, income_q25 <dbl>, income_q75 <dbl>,
## # experience_mean <dbl>, experience_median <dbl>, experience_sd <dbl>,
## # experience_min <dbl>, experience_max <dbl>, experience_q25 <dbl>,
## # experience_q75 <dbl>
# Correlation matrix
cor_matrix <- cor(data)
# Correlation with significance tests
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
corr_test <- rcorr(as.matrix(data))
# Skewness and Kurtosis
#skewness(data$age)
#kurtosis(data$age)
# Group-wise summaries
data %>%
mutate(age_group = cut(age, breaks = c(20, 35, 50, 100))) %>%
group_by(age_group) %>%
summarise(
n = n(),
mean_income = mean(income),
sd_income = sd(income)
)
## # A tibble: 2 × 4
## age_group n mean_income sd_income
## <fct> <int> <dbl> <dbl>
## 1 (20,35] 4 55000 8524.
## 2 (35,50] 3 84000 10149.
8.2.1.2 3.1.2 Python: Descriptive Statistics
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
# Sample data
data = pd.DataFrame({
'age': [25, 28, 30, 35, 42, 45, 50],
'income': [45000, 52000, 58000, 65000, 75000, 82000, 95000],
'experience': [2, 3, 4, 8, 12, 15, 20]
})
# Basic summary statistics
data.describe()
# Detailed statistics
data.describe(percentiles=[0.25, 0.5, 0.75])
# Individual statistics
data['age'].mean()
data['age'].median()
data['age'].std()
data[['age', 'income']].mean()
# Custom summary
data.agg({
'age': ['mean', 'median', 'std', 'min', 'max'],
'income': ['mean', 'median', 'std', 'min', 'max']
})
# Correlation matrix
corr_matrix = data.corr()
# Correlation with p-values
from scipy.stats import pearsonr
corr, p_value = pearsonr(data['age'], data['income'])
# Skewness and Kurtosis
stats.skew(data['age'])
stats.kurtosis(data['age'])
# Group-wise summaries
data['age_group'] = pd.cut(data['age'], bins=[20, 35, 50, 100])
data.groupby('age_group').agg({
'income': ['mean', 'std', 'count']
})
# Value counts
data['age_group'].value_counts()
8.2.2 3.2 Data Visualization Principles
Key Principles: 1. Clarity: Choose appropriate chart types for data structure 2. Simplicity: Remove clutter; emphasize important patterns 3. Color: Use purposefully; ensure accessibility 4. Context: Include titles, labels, legends, and references 5. Accuracy: Don’t distort data through inappropriate scaling
Chart Type Selection: - Univariate Continuous: Histogram, Density Plot, Box Plot - Univariate Categorical: Bar Chart, Pie Chart - Bivariate: Scatter Plot, Line Plot, Box Plot - Multivariate: Faceted plots, Color/Size encoding, Parallel coordinates
8.2.3 3.3 Static Visualizations
8.2.3.1 3.3.1 R: Static Visualizations with ggplot2
library(ggplot2)
library(dplyr)
# Sample dataset
data <- tibble(
department = rep(c("IT", "HR", "Finance"), each = 10),
salary = c(rnorm(10, 80000, 5000),
rnorm(10, 60000, 3000),
rnorm(10, 70000, 4000)),
experience = rep(1:10, 3)
)
# Histogram
ggplot(data, aes(x = salary)) +
geom_histogram(binwidth = 5000, fill = "steelblue", alpha = 0.7) +
labs(title = "Distribution of Salaries",
x = "Salary ($)",
y = "Frequency") +
theme_minimal()
# Density plot
ggplot(data, aes(x = salary, fill = department)) +
geom_density(alpha = 0.5) +
labs(title = "Salary Distribution by Department") +
theme_minimal()
# Box plot
ggplot(data, aes(x = department, y = salary, fill = department)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.5) +
labs(title = "Salary Distribution by Department",
x = "Department",
y = "Salary ($)") +
theme_minimal() +
theme(legend.position = "none")
# Scatter plot
ggplot(data, aes(x = experience, y = salary, color = department)) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
labs(title = "Salary vs Experience by Department",
x = "Years of Experience",
y = "Salary ($)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Bar plot
dept_summary <- data %>%
group_by(department) %>%
summarise(avg_salary = mean(salary))
ggplot(dept_summary, aes(x = department, y = avg_salary, fill = department)) +
geom_col(alpha = 0.7) +
geom_text(aes(label = round(avg_salary, -3)), vjust = -0.5) +
labs(title = "Average Salary by Department",
x = "Department",
y = "Average Salary ($)") +
theme_minimal() +
theme(legend.position = "none")
# Faceted plots
ggplot(data, aes(x = experience, y = salary)) +
geom_point(color = "steelblue", alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
facet_wrap(~ department) +
labs(title = "Salary vs Experience (by Department)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Heatmap/Correlation plot
correlation_plot <- data %>%
select(salary, experience) %>%
cor() %>%
as.data.frame() %>%
rownames_to_column() %>%
pivot_longer(-rowname) %>%
ggplot(aes(x = rowname, y = name, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2))) +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0, limits = c(-1, 1)) +
theme_minimal() +
theme(axis.title = element_blank())
8.2.3.2 3.3.2 Python: Static Visualizations with Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
# Sample dataset
np.random.seed(42)
data = pd.DataFrame({
'department': np.repeat(['IT', 'HR', 'Finance'], 10),
'salary': np.concatenate([
np.random.normal(80000, 5000, 10),
np.random.normal(60000, 3000, 10),
np.random.normal(70000, 4000, 10)
]),
'experience': np.tile(np.arange(1, 11), 3)
})
# Histogram
plt.figure(figsize=(10, 6))
plt.hist(data['salary'], bins=15, color='steelblue', alpha=0.7, edgecolor='black')
plt.title('Distribution of Salaries', fontsize=14, fontweight='bold')
plt.xlabel('Salary ($)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.tight_layout()
plt.show()
# Density plot
plt.figure(figsize=(10, 6))
for dept in data['department'].unique():
data[data['department'] == dept]['salary'].plot(kind='density', label=dept, alpha=0.7)
plt.title('Salary Distribution by Department', fontsize=14, fontweight='bold')
plt.xlabel('Salary ($)', fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()
# Box plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x='department', y='salary', palette='Set2')
sns.stripplot(data=data, x='department', y='salary', color='black', alpha=0.3, size=4)
plt.title('Salary Distribution by Department', fontsize=14, fontweight='bold')
plt.xlabel('Department', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.tight_layout()
plt.show()
# Scatter plot
plt.figure(figsize=(10, 6))
for dept in data['department'].unique():
dept_data = data[data['department'] == dept]
plt.scatter(dept_data['experience'], dept_data['salary'], label=dept, alpha=0.7, s=100)
plt.title('Salary vs Experience by Department', fontsize=14, fontweight='bold')
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Bar plot
dept_summary = data.groupby('department')['salary'].mean()
plt.figure(figsize=(10, 6))
bars = plt.bar(dept_summary.index, dept_summary.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'], alpha=0.7)
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2., height,
f'${height:,.0f}',
ha='center', va='bottom', fontsize=11)
plt.title('Average Salary by Department', fontsize=14, fontweight='bold')
plt.xlabel('Department', fontsize=12)
plt.ylabel('Average Salary ($)', fontsize=12)
plt.tight_layout()
plt.show()
# Faceted plots
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, dept in enumerate(data['department'].unique()):
dept_data = data[data['department'] == dept]
axes[idx].scatter(dept_data['experience'], dept_data['salary'], alpha=0.7)
axes[idx].set_title(dept, fontsize=12)
axes[idx].set_xlabel('Experience (years)', fontsize=10)
axes[idx].set_ylabel('Salary ($)', fontsize=10)
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Correlation heatmap
plt.figure(figsize=(8, 6))
corr_matrix = data[['salary', 'experience']].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
cbar_kws={'label': 'Correlation'}, square=True)
plt.title('Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
8.2.4 3.4 Interactive Visualizations
8.2.4.1 3.4.1 R: Interactive Visualizations with Plotly and Shiny
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(shiny)
##
## Attaching package: 'shiny'
## The following object is masked from 'package:jsonlite':
##
## validate
# Interactive scatter plot with Plotly
plot_ly(data = mtcars,
x = ~wt,
y = ~mpg,
type = 'scatter',
mode = 'markers',
marker = list(size = 8, color = ~hp, colorscale = 'Viridis')) %>%
layout(title = 'MPG vs Weight',
xaxis = list(title = 'Weight (1000 lbs)'),
yaxis = list(title = 'Miles per Gallon'))
# Interactive time series
library(dplyr)
time_data <- tibble(
date = seq(as.Date('2020-01-01'), as.Date('2023-12-31'), by = 'day'),
value = cumsum(rnorm(1461))
)
plot_ly(data = time_data, x = ~date, y = ~value, type = 'scatter', mode = 'lines') %>%
layout(title = 'Time Series Example',
xaxis = list(title = 'Date'),
yaxis = list(title = 'Value'))
# Shiny app example
ui <- fluidPage(
titlePanel("Data Visualization Dashboard"),
sidebarLayout(
sidebarPanel(
selectInput("dataset", "Choose dataset:",
choices = c("mtcars", "iris", "chickwts")),
selectInput("x_var", "X variable:", choices = NULL),
selectInput("y_var", "Y variable:", choices = NULL)
),
mainPanel(
plotlyOutput("plot")
)
)
)
server <- function(input, output, session) {
# Update variable choices based on dataset
observe({
data <- get(input$dataset)
updateSelectInput(session, "x_var", choices = names(data))
updateSelectInput(session, "y_var", choices = names(data))
})
output$plot <- renderPlotly({
data <- get(input$dataset)
plot_ly(data = data,
x = as.formula(paste0('~', input$x_var)),
y = as.formula(paste0('~', input$y_var)),
type = 'scatter',
mode = 'markers') %>%
layout(title = paste(input$x_var, 'vs', input$y_var))
})
}
# shinyApp(ui, server)
8.2.4.2 3.4.2 Python: Interactive Visualizations with Plotly
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# Sample data
data = pd.DataFrame({
'x': range(1, 11),
'y': [10, 15, 13, 17, 20, 25, 22, 24, 26, 28]
})
# Interactive scatter plot
fig = px.scatter(data, x='x', y='y',
title='Interactive Scatter Plot',
labels={'x': 'X Variable', 'y': 'Y Variable'})
fig.show()
# Interactive line plot with range slider
dates = pd.date_range('2020-01-01', periods=100)
time_data = pd.DataFrame({
'date': dates,
'value': np.cumsum(np.random.randn(100))
})
fig = go.Figure()
fig.add_trace(go.Scatter(x=time_data['date'], y=time_data['value'],
mode='lines', name='Value'))
fig.update_layout(
title='Interactive Time Series',
xaxis_rangeslider_visible=False,
xaxis_rangeselector=dict(
buttons=list([
dict(count=1, label="1m", step="month", stepmode="backward"),
dict(count=3, label="3m", step="month", stepmode="backward"),
dict(step="all", label="All")
])
)
)
fig.show()
# Interactive box plot by category
iris = px.data.iris()
fig = px.box(iris, x='species', y='sepal_length',
title='Iris Sepal Length by Species')
fig.show()
# Dashboard-style visualization
from plotly.subplots import make_subplots
fig = make_subplots(rows=2, cols=2,
specs=[[{'type': 'scatter'}, {'type': 'bar'}],
[{'type': 'pie'}, {'type': 'box'}]])
# Add scatter
fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6], name='Scatter'),
row=1, col=1)
# Add bar
fig.add_trace(go.Bar(x=['A', 'B', 'C'], y=[1, 3, 2], name='Bar'),
row=1, col=2)
# Add pie
fig.add_trace(go.Pie(labels=['A', 'B', 'C'], values=[1, 2, 3], name='Pie'),
row=2, col=1)
# Add box
fig.add_trace(go.Box(y=[1, 2, 3, 4, 5], name='Box'),
row=2, col=2)
fig.update_layout(height=800, showlegend=False, title_text="Dashboard")
fig.show()
8.2.5 3.5 Exploratory Data Analysis (EDA) Workflow
EDA Checklist: 1. Load and inspect data (dimensions, types, head/tail) 2. Check for missing values and outliers 3. Examine distributions of variables 4. Identify relationships between variables 5. Look for patterns, clusters, anomalies 6. Document findings and insights
8.2.5.1 3.5.1 R: Complete EDA Example
library(tidyverse)
library(naniar)
library(psych)
# Load data
df <- read_csv("employees.csv")
## Rows: 200 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, department, education_level
## dbl (5): emp_id, age, experience, salary, performance_score
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 1. Basic inspection
dim(df)
## [1] 200 8
names(df)
## [1] "emp_id" "name" "department"
## [4] "age" "experience" "salary"
## [7] "education_level" "performance_score"
head(df, 10)
## # A tibble: 10 × 8
## emp_id name department age experience salary education_level
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 Employee_1 Marketing 23 26 61972. Master
## 2 2 Employee_2 Marketing 45 2 65322. Bachelor
## 3 3 Employee_3 Sales 60 14 68593. Master
## 4 4 Employee_4 HR 47 2 83483. PhD
## 5 5 Employee_5 Finance 35 29 85457. Master
## 6 6 Employee_6 Finance 45 4 63526. Bachelor
## 7 7 Employee_7 Finance 56 17 74513. PhD
## 8 8 Employee_8 Marketing 55 3 117845. Master
## 9 9 Employee_9 Marketing 29 13 109551. Bachelor
## 10 10 Employee_10 HR 45 9 83726. PhD
## # ℹ 1 more variable: performance_score <dbl>
str(df)
## spc_tbl_ [200 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ emp_id : num [1:200] 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr [1:200] "Employee_1" "Employee_2" "Employee_3" "Employee_4" ...
## $ department : chr [1:200] "Marketing" "Marketing" "Sales" "HR" ...
## $ age : num [1:200] 23 45 60 47 35 45 56 55 29 45 ...
## $ experience : num [1:200] 26 2 14 2 29 4 17 3 13 9 ...
## $ salary : num [1:200] 61972 65322 68593 83483 85457 ...
## $ education_level : chr [1:200] "Master" "Bachelor" "Master" "PhD" ...
## $ performance_score: num [1:200] 1.93 1.32 4.11 1.49 3.38 ...
## - attr(*, "spec")=
## .. cols(
## .. emp_id = col_double(),
## .. name = col_character(),
## .. department = col_character(),
## .. age = col_double(),
## .. experience = col_double(),
## .. salary = col_double(),
## .. education_level = col_character(),
## .. performance_score = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
glimpse(df)
## Rows: 200
## Columns: 8
## $ emp_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ name <chr> "Employee_1", "Employee_2", "Employee_3", "Employee_…
## $ department <chr> "Marketing", "Marketing", "Sales", "HR", "Finance", …
## $ age <dbl> 23, 45, 60, 47, 35, 45, 56, 55, 29, 45, 38, 48, 43, …
## $ experience <dbl> 26, 2, 14, 2, 29, 4, 17, 3, 13, 9, 7, 15, 23, 29, 29…
## $ salary <dbl> 61971.64, 65322.28, 68593.05, 83483.32, 85456.71, 63…
## $ education_level <chr> "Master", "Bachelor", "Master", "PhD", "Master", "Ba…
## $ performance_score <dbl> 1.930205, 1.319862, 4.106895, 1.488140, 3.377163, 4.…
# 2. Missing values
colSums(is.na(df))
## emp_id name department age
## 0 0 0 0
## experience salary education_level performance_score
## 0 0 0 0
vis_miss(df)
# 3. Univariate analysis
#Numeric variables
df %>%
select(where(is.numeric)) %>%
describe()
## .
##
## 5 Variables 200 Observations
## --------------------------------------------------------------------------------
## emp_id
## n missing distinct Info Mean pMedian Gmd .05
## 200 0 200 1 100.5 100.5 67 10.95
## .10 .25 .50 .75 .90 .95
## 20.90 50.75 100.50 150.25 180.10 190.05
##
## lowest : 1 2 3 4 5, highest: 196 197 198 199 200
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean pMedian Gmd .05
## 200 0 42 0.999 39.12 39.5 11.17 22.00
## .10 .25 .50 .75 .90 .95
## 24.90 32.00 40.00 46.00 50.10 53.05
##
## lowest : 18 19 20 21 22, highest: 55 56 59 60 63
## --------------------------------------------------------------------------------
## experience
## n missing distinct Info Mean pMedian Gmd .05
## 200 0 29 0.998 15.24 15 9.736 2
## .10 .25 .50 .75 .90 .95
## 3 8 15 22 27 28
##
## lowest : 1 2 3 4 5, highest: 25 26 27 28 29
## --------------------------------------------------------------------------------
## salary
## n missing distinct Info Mean pMedian Gmd .05
## 200 0 198 1 76714 76819 22872 43456
## .10 .25 .50 .75 .90 .95
## 49317 62861 77096 90242 102610 108722
##
## lowest : 30000 31932.2 35036 40205.7 41140.9
## highest: 117845 120049 120819 123068 123795
## --------------------------------------------------------------------------------
## performance_score
## n missing distinct Info Mean pMedian Gmd .05
## 200 0 200 1 2.921 2.919 1.347 1.256
## .10 .25 .50 .75 .90 .95
## 1.406 1.902 2.920 3.970 4.526 4.774
##
## lowest : 1.00133 1.04171 1.05086 1.07075 1.14228
## highest: 4.90396 4.91359 4.9531 4.96635 4.96966
## --------------------------------------------------------------------------------
# Categorical variables
df %>%
select(where(is.character)) %>%
map_df(table)
## # A tibble: 3 × 208
## Employee_1 Employee_10 Employee_100 Employee_101 Employee_102 Employee_103
## <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1d]>
## 1 1 1 1 1 1 1
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## # ℹ 202 more variables: Employee_104 <table[1d]>, Employee_105 <table[1d]>,
## # Employee_106 <table[1d]>, Employee_107 <table[1d]>,
## # Employee_108 <table[1d]>, Employee_109 <table[1d]>,
## # Employee_11 <table[1d]>, Employee_110 <table[1d]>,
## # Employee_111 <table[1d]>, Employee_112 <table[1d]>,
## # Employee_113 <table[1d]>, Employee_114 <table[1d]>,
## # Employee_115 <table[1d]>, Employee_116 <table[1d]>, …
# 4. Bivariate analysis
# Numeric vs Numeric
pairs(df %>% select(where(is.numeric)))
cor(df %>% select(where(is.numeric)), use = "complete.obs")
## emp_id age experience salary
## emp_id 1.000000000 -0.005122961 -0.10436278 -0.05412891
## age -0.005122961 1.000000000 -0.06868697 -0.07345337
## experience -0.104362779 -0.068686968 1.00000000 -0.04075939
## salary -0.054128907 -0.073453371 -0.04075939 1.00000000
## performance_score 0.065627808 0.084145311 -0.06907438 -0.03085486
## performance_score
## emp_id 0.06562781
## age 0.08414531
## experience -0.06907438
## salary -0.03085486
## performance_score 1.00000000
# Categorical vs Numeric
df %>%
ggplot(aes(x = department, y = performance_score)) +
geom_boxplot() +
theme_minimal()
# 5. Multivariate analysis
df %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot::corrplot()
# 6. Outlier detection
df %>%
select(where(is.numeric)) %>%
map_df(function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
outliers <- sum(x < Q1 - 1.5*IQR | x > Q3 + 1.5*IQR, na.rm = TRUE)
return(tibble(outliers = outliers))})
## # A tibble: 5 × 1
## outliers
## <int>
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
# Generate EDA report
#library(DataExplorer)
#create_report(df)
8.2.5.2 3.5.2 Python: Complete EDA Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Load data
df = pd.read_csv('employees.csv')
# 1. Basic inspection
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(df.head(10))
print(df.info())
print(df.dtypes)
# 2. Missing values
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)
# Visualize missing
import missingno as msno
msno.matrix(df)
plt.show()
# 3. Univariate analysis
# Numeric variables
print(df.describe())
print(df.describe(percentiles=[0.25, 0.5, 0.75, 0.95]))
# Categorical variables
for col in df.select_dtypes(include='object').columns:
print(df[col].value_counts())
# 4. Bivariate analysis
# Correlation matrix
numeric_cols = df.select_dtypes(include=[np.number]).columns
corr_matrix = df[numeric_cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()
# 5. Visualizations
# Histograms
for col in numeric_cols:
plt.figure(figsize=(10, 4))
plt.hist(df[col], bins=30, edgecolor='black')
plt.title(f'Distribution of {col}')
plt.show()
# Box plots
for col in df.select_dtypes(include='object').columns:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x=col, y=numeric_cols[0])
plt.title(f'{numeric_cols[0]} by {col}')
plt.show()
# 6. Outlier detection
def detect_outliers_iqr(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return ((data < lower_bound) | (data > upper_bound)).sum()
outlier_counts = df[numeric_cols].apply(detect_outliers_iqr)
print("Outlier counts:\n", outlier_counts)
# 7. Statistical tests
# Normality test (Shapiro-Wilk)
for col in numeric_cols:
stat, p_value = stats.shapiro(df[col].dropna())
print(f"{col}: Shapiro-Wilk p-value = {p_value:.4f}")
# Generate comprehensive EDA report
#from ydata_profiling import ProfileReport
#profile = ProfileReport(df, title='EDA Report')
#profile.to_file('eda_report.html')
8.3 MODULE 4: STATISTICAL MODELING AND MACHINE LEARNING FUNDAMENTALS
8.3.1 4.1 Review of Linear Models
Ordinary Least Squares (OLS) Regression:
The linear regression model: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i\]
Where: - \(y_i\) is the response variable - \(x_{ij}\) are predictor variables - \(\beta_j\) are regression coefficients - \(\epsilon_i\) is the error term (assumed \(\sim N(0, \sigma^2)\))
Key Assumptions: 1. Linearity: Relationship between predictors and response is linear 2. Independence: Observations are independent 3. Homoscedasticity: Constant variance of errors 4. Normality: Errors are normally distributed 5. No multicollinearity: Predictors are not highly correlated
8.3.1.1 4.1.1 R: Linear Regression
library(tidyverse)
library(broom)
# Sample data
data <- tibble(
x1 = c(2, 3, 4, 5, 6, 7, 8, 9),
x2 = c(1, 2, 2, 3, 4, 5, 6, 7),
y = c(5.1, 6.2, 8.0, 8.9, 11.1, 12.8, 14.5, 16.2)
)
# Fit linear model
model <- lm(y ~ x1 + x2, data = data)
# Model summary
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2, data = data)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 0.29167 -0.29167 0.29167 -0.49167 0.02500 0.04167 0.05833 0.07500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9083 0.4827 3.954 0.0108 *
## x1 1.2167 0.3444 3.533 0.0167 *
## x2 0.4667 0.3977 1.173 0.2935
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3189 on 5 degrees of freedom
## Multiple R-squared: 0.9954, Adjusted R-squared: 0.9936
## F-statistic: 540.5 on 2 and 5 DF, p-value: 1.438e-06
# Extract coefficients
tidy(model)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.91 0.483 3.95 0.0108
## 2 x1 1.22 0.344 3.53 0.0167
## 3 x2 0.467 0.398 1.17 0.293
glance(model) # Model fit statistics
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.995 0.994 0.319 541. 0.00000144 2 -0.327 8.65 8.97
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Predictions
predictions <- predict(model, newdata = data.frame(x1 = c(5, 6), x2 = c(3, 4)))
pred_with_ci <- predict(model, newdata = data.frame(x1 = c(5, 6), x2 = c(3, 4)),
interval = "confidence", level = 0.95)
# Residuals and diagnostics
residuals <- resid(model)
fitted_values <- fitted(model)
# Diagnostic plots
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
# Assumption checks
# Normality
shapiro.test(resid(model))
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.87827, p-value = 0.1813
# Homoscedasticity
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 3.0706, df = 2, p-value = 0.2154
# Multicollinearity
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(model)
## x1 x2
## 49 49
# Visualization
data %>%
mutate(predictions = predict(model)) %>%
ggplot(aes(x = x1, y = y)) +
geom_point(size = 3) +
geom_line(aes(y = predictions), color = "red") +
labs(title = "Linear Regression Fit",
x = "X1", y = "Y") +
theme_minimal()
8.3.1.2 4.1.2 Python: Linear Regression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
# Sample data
data = pd.DataFrame({
'x1': [2, 3, 4, 5, 6, 7, 8, 9],
'x2': [1, 2, 2, 3, 4, 5, 6, 7],
'y': [5.1, 6.2, 8.0, 8.9, 11.1, 12.8, 14.5, 16.2]
})
# Using scikit-learn
X = data[['x1', 'x2']]
y = data['y']
model = LinearRegression()
model.fit(X, y)
# Coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)
# Predictions
y_pred = model.predict(X)
print("Predictions:", y_pred)
# Model evaluation
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)
print(f"MSE: {mse}, RMSE: {rmse}, R²: {r2}")
# Using statsmodels for detailed statistics
X_sm = sm.add_constant(X)
model_sm = sm.OLS(y, X_sm).fit()
print(model_sm.summary())
# Predictions with confidence intervals
predictions = model_sm.get_prediction(X_sm)
pred_summary = predictions.summary_frame(alpha=0.05)
print(pred_summary)
# Residual analysis
residuals = y - y_pred
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.subplot(1, 3, 2)
from scipy import stats
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.subplot(1, 3, 3)
plt.hist(residuals, bins=10, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Histogram of Residuals')
plt.tight_layout()
plt.show()
# Check assumptions
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(residuals, X_sm)
print(f"Breusch-Pagan test p-value: {bp_test[1]}")
# VIF for multicollinearity
#from statsmodels.stats.outliers_influence import variance_inflation_factor
#vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
#print("VIF values:", vif)
8.3.2 4.2 Introduction to Supervised Learning
Supervised Learning: Learning from labeled data where both input features and target outputs are known.
Classification vs Regression: - Classification: Predict categorical/discrete target (classes) - Regression: Predict continuous target (numeric values)
General Workflow: 1. Data splitting (train/test/validation) 2. Feature engineering and preprocessing 3. Model training 4. Hyperparameter tuning 5. Model evaluation 6. Prediction on new data
8.3.3 4.3 Classification Algorithms
8.3.3.1 4.3.1 Logistic Regression
Model: Predicts probability of class membership using logistic function: \[P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots)}}\]
R Implementation:
library(tidyverse)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Sample binary classification data
set.seed(42)
n <- 200
data <- tibble(
feature1 = rnorm(n, mean = 0, sd = 1),
feature2 = rnorm(n, mean = 0, sd = 1),
target = factor(ifelse(0.5 + 0.8 * feature1 + 0.6 * feature2 + rnorm(n, 0, 0.5) > 0,
"Class1", "Class0"))
)
# Split data
train_idx <- createDataPartition(data$target, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]
# Fit logistic regression
model <- glm(target ~ feature1 + feature2,
family = binomial(link = "logit"),
data = train_data)
# Predictions
pred_prob <- predict(model, newdata = test_data, type = "response")
pred_class <- ifelse(pred_prob > 0.5, "Class1", "Class0")
# Evaluation
library(caret)
confusionMatrix(factor(pred_class), test_data$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Class0 Class1
## Class0 17 7
## Class1 4 32
##
## Accuracy : 0.8167
## 95% CI : (0.6956, 0.9048)
## No Information Rate : 0.65
## P-Value [Acc > NIR] : 0.003653
##
## Kappa : 0.6099
##
## Mcnemar's Test P-Value : 0.546494
##
## Sensitivity : 0.8095
## Specificity : 0.8205
## Pos Pred Value : 0.7083
## Neg Pred Value : 0.8889
## Prevalence : 0.3500
## Detection Rate : 0.2833
## Detection Prevalence : 0.4000
## Balanced Accuracy : 0.8150
##
## 'Positive' Class : Class0
##
# ROC curve
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(test_data$target, pred_prob)
## Setting levels: control = Class0, case = Class1
## Setting direction: controls < cases
plot(roc_obj, main = "ROC Curve")
print(paste("AUC:", round(auc(roc_obj), 3)))
## [1] "AUC: 0.939"
Python Implementation:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, auc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Sample binary classification data
np.random.seed(42)
n = 200
X = np.random.randn(n, 2)
y = (0.5 + 0.8 * X[:, 0] + 0.6 * X[:, 1] + np.random.randn(n) * 0.5 > 0).astype(int)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit logistic regression
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
8.3.3.2 4.3.2 Decision Trees
Concept: Recursive partitioning of feature space through binary splits.
Advantages: Interpretable, handles non-linearity, no scaling needed Disadvantages: Prone to overfitting, unstable
R Implementation:
library(rpart)
library(rpart.plot)
library(tidyverse)
library(caret)
# Sample classification data
set.seed(42)
data <- read_csv("iris.csv") # Using iris dataset
## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): species
## dbl (4): sepal_length, sepal_width, petal_length, petal_width
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Train-test split
train_idx <- createDataPartition(data$species, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]
# Fit decision tree
tree_model <- rpart(species ~ .,
data = train_data,
method = "class",
control = rpart.control(cp = 0.01, minsplit = 5))
# Visualize tree
rpart.plot(tree_model, main = "Classification Tree")
# Predictions
pred <- predict(tree_model, newdata = test_data, type = "class")
# Evaluation
pred <- as.factor(pred)
test_data$species <- as.factor(test_data$species)
levels(pred) <- levels(test_data$species) # Ensure same levels
confusionMatrix(pred, test_data$species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 15 0
## virginica 0 0 15
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9213, 1)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3333
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 1.0000 1.0000
Python Implementation:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
# Load data
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
# Predictions
pred = tree.predict(X_test)
# Confusion matrix (no factor issues!)
cm = confusion_matrix(y_test, pred)
print("Confusion Matrix:\n", cm)
# Visualize
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot()
plt.show()
8.3.3.3 4.3.3 Random Forest
Concept: Ensemble of decision trees aggregating predictions through voting/averaging.
Advantages: Reduces overfitting, handles feature importance, robust Disadvantages: Less interpretable, computationally intensive
R Implementation:
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
# Sample data
set.seed(42)
data <- iris
# Train-test split
train_idx <- createDataPartition(data$Species, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]
# Fit random forest
rf_model <- randomForest(Species ~ .,
data = train_data,
ntree = 100,
mtry = 2)
# Predictions
pred <- predict(rf_model, newdata = test_data)
# Evaluation
confusionMatrix(pred, test_data$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 15 2
## virginica 0 0 13
##
## Overall Statistics
##
## Accuracy : 0.9556
## 95% CI : (0.8485, 0.9946)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9333
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.8667
## Specificity 1.0000 0.9333 1.0000
## Pos Pred Value 1.0000 0.8824 1.0000
## Neg Pred Value 1.0000 1.0000 0.9375
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.2889
## Detection Prevalence 0.3333 0.3778 0.2889
## Balanced Accuracy 1.0000 0.9667 0.9333
# Feature importance
importance(rf_model)
## MeanDecreaseGini
## Sepal.Length 7.393491
## Sepal.Width 1.979317
## Petal.Length 30.176703
## Petal.Width 29.739061
varImpPlot(rf_model, main = "Feature Importance")
Python Implementation:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix, accuracy_score
import pandas as pd
import matplotlib.pyplot as plt
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit random forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Predictions
y_pred = rf_model.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
# Feature importance
feature_importance = pd.DataFrame({
'feature': iris.feature_names,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print(feature_importance)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()
8.3.3.4 4.3.4 Support Vector Machines (SVM)
Concept: Find optimal hyperplane maximizing margin between classes.
Key Parameters: Kernel type (linear, RBF, polynomial), regularization (C), gamma
R Implementation:
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
## The following object is masked from 'package:ggplot2':
##
## element
library(caret)
# Sample data
set.seed(42)
data <- iris
# Train-test split
train_idx <- createDataPartition(data$Species, p = 0.7, list = FALSE)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]
# Fit SVM with RBF kernel
svm_model <- svm(Species ~ .,
data = train_data,
kernel = "radial",
cost = 1)
# Predictions
pred <- predict(svm_model, newdata = test_data)
# Evaluation
confusionMatrix(pred, test_data$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 14 0 0
## versicolor 1 14 2
## virginica 0 1 13
##
## Overall Statistics
##
## Accuracy : 0.9111
## 95% CI : (0.7878, 0.9752)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 8.467e-16
##
## Kappa : 0.8667
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 0.9333 0.9333 0.8667
## Specificity 1.0000 0.9000 0.9667
## Pos Pred Value 1.0000 0.8235 0.9286
## Neg Pred Value 0.9677 0.9643 0.9355
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3111 0.3111 0.2889
## Detection Prevalence 0.3111 0.3778 0.3111
## Balanced Accuracy 0.9667 0.9167 0.9167
Python Implementation:
#from sklearn.svm import SVC
#from sklearn.preprocessing import StandardScaler
#from sklearn.model_selection import train_test_split
#from sklearn.datasets import load_iris
#from sklearn.metrics import confusion_matrix, accuracy_score
# Load data
#iris = load_iris()
#X, y = iris.data, iris.target
# Standardize features
#scaler = StandardScaler()
#X_scaled = scaler.fit_transform(X)
# Train-test split
#X_train, X_test, y_train, y_test = train_test_split(
# X_scaled, y, test_size=0.3, random_state=42)
# Fit SVM
#svm_model = SVC(kernel='rbf', C=1.0, gamma='scale')
#svm_model.fit(X_train, y_train)
# Predictions
#y_pred = svm_model.predict(X_test)
# Evaluation
#print("Accuracy:", accuracy_score(y_test, y_pred))
#print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
8.3.4 4.4 Regression Algorithms
8.3.4.1 4.4.1 Multiple Linear Regression
Already covered in 4.1. Key concepts: - Multiple predictors - Assumption checking important - Interpretability: \(\beta_j\) represents unit change in \(y\) per unit change in \(x_j\)
8.3.4.2 4.4.2 Ridge and Lasso Regression
Problem: Multicollinearity and overfitting in high dimensions
Ridge Regression (L2 penalty): \[\text{minimize} \quad \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p}\beta_j^2\]
Lasso Regression (L1 penalty): \[\text{minimize} \quad \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p}|\beta_j|\]
R Implementation:
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-10
library(caret)
# Sample regression data
set.seed(42)
X <- matrix(rnorm(100 * 20), 100, 20)
y <- X[, 1] * 2 + X[, 2] * 1.5 + rnorm(100)
# Train-test split
train_idx <- 1:70
X_train <- X[train_idx, ]
y_train <- y[train_idx]
X_test <- X[-train_idx, ]
y_test <- y[-train_idx]
# Ridge regression
ridge_model <- glmnet(X_train, y_train, alpha = 0, lambda = seq(0.01, 1, 0.01))
ridge_cv <- cv.glmnet(X_train, y_train, alpha = 0)
plot(ridge_cv)
ridge_pred <- predict(ridge_model, s = ridge_cv$lambda.min, newx = X_test)
# Lasso regression
lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = seq(0.01, 1, 0.01))
lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1)
plot(lasso_cv)
lasso_pred <- predict(lasso_model, s = lasso_cv$lambda.min, newx = X_test)
# Evaluation
ridge_mse <- mean((ridge_pred - y_test)^2)
lasso_mse <- mean((lasso_pred - y_test)^2)
print(paste("Ridge MSE:", ridge_mse))
## [1] "Ridge MSE: 1.50161663606881"
print(paste("Lasso MSE:", lasso_mse))
## [1] "Lasso MSE: 1.39301262183192"
Python Implementation:
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt
# Sample regression data
np.random.seed(42)
X = np.random.randn(100, 20)
y = X[:, 0] * 2 + X[:, 1] * 1.5 + np.random.randn(100)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Ridge regression with cross-validation
ridge_cv = RidgeCV(alphas=np.logspace(-2, 3, 100), cv=5)
ridge_cv.fit(X_train, y_train)
ridge_pred = ridge_cv.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)
print(f"Ridge MSE: {ridge_mse:.4f}, Best alpha: {ridge_cv.alpha_:.4f}")
# Lasso regression with cross-validation
lasso_cv = LassoCV(alphas=np.logspace(-4, 0, 100), cv=5)
lasso_cv.fit(X_train, y_train)
lasso_pred = lasso_cv.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)
print(f"Lasso MSE: {lasso_mse:.4f}, Best alpha: {lasso_cv.alpha_:.4f}")
# Visualize coefficients
ridge_coef = ridge_cv.coef_
lasso_coef = lasso_cv.coef_
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.bar(range(len(ridge_coef)), ridge_coef)
plt.title('Ridge Regression Coefficients')
plt.xlabel('Feature Index')
plt.subplot(1, 2, 2)
plt.bar(range(len(lasso_coef)), lasso_coef)
plt.title('Lasso Regression Coefficients')
plt.xlabel('Feature Index')
plt.tight_layout()
plt.show()
8.3.4.3 4.4.3 Random Forest for Regression
Concept: Ensemble of regression trees Output: Average predictions from multiple trees
R Implementation:
library(randomForest)
library(tibble)
# Sample regression data
set.seed(42)
data <- tibble(
x1 = runif(100, 0, 10),
x2 = runif(100, 0, 10),
y = 2 * x1 + 1.5 * x2 + rnorm(100, 0, 2))
# Train-test split
train_idx <- 1:70
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]
# Fit random forest for regression
rf_model <- randomForest(y ~ x1 + x2,
data = train_data,
ntree = 100)
# Predictions
pred <- predict(rf_model, newdata = test_data)
# Evaluation
mse <- mean((pred - test_data$y)^2)
rmse <- sqrt(mse)
r2 <- 1 - (sum((pred - test_data$y)^2) / sum((test_data$y - mean(test_data$y))^2))
print(paste("MSE:", mse, "RMSE:", rmse, "R²:", r2))
## [1] "MSE: 8.7368963712931 RMSE: 2.95582414417588 R²: 0.7744288197865"
Python Implementation:
#from sklearn.ensemble import RandomForestRegressor
#from sklearn.model_selection import train_test_split
#from sklearn.metrics import mean_squared_error, r2_score
#import numpy as np
# Sample regression data
#np.random.seed(42)
#X = np.column_stack([
# np.random.uniform(0, 10, 100),
# np.random.uniform(0, 10, 100)])
#y = 2 * X[:, 0] + 1.5 * X[:, 1] + np.random.randn(100) * 2
# Train-test split
#X_train, X_test, y_train, y_test = train_test_split(
# X, y, test_size=0.3, random_state=42)
# Fit random forest for regression
#rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
#rf_model.fit(X_train, y_train)
# Predictions
#y_pred = rf_model.predict(X_test)
# Evaluation
#mse = mean_squared_error(y_test, y_pred)
#rmse = np.sqrt(mse)
#r2 = r2_score(y_test, y_pred)
#print(f"MSE: {mse:.4f}, RMSE: {rmse:.4f}, R²: {r2:.4f}")
8.3.5 4.5 Introduction to Unsupervised Learning
Unsupervised Learning: Finding patterns in unlabeled data
Common Tasks: - Clustering: Group similar observations - Dimensionality Reduction: Reduce number of features - Anomaly Detection: Identify unusual observations
8.3.5.1 4.5.1 K-Means Clustering
Algorithm: 1. Initialize k centroids randomly 2. Assign each point to nearest centroid 3. Update centroids as mean of assigned points 4. Repeat steps 2-3 until convergence
R Implementation:
library(tidyverse)
# Sample data
set.seed(42)
data <- tibble(
x = c(rnorm(30, 2, 0.5), rnorm(30, 6, 0.5)),
y = c(rnorm(30, 2, 0.5), rnorm(30, 6, 0.5)))
# K-means clustering
kmeans_model <- kmeans(data, centers = 2, nstart = 25)
# Add cluster assignments
data_clustered <- data %>%
mutate(cluster = factor(kmeans_model$cluster))
# Visualize
ggplot(data_clustered, aes(x = x, y = y, color = cluster)) +
geom_point(size = 3) +
geom_point(data = tibble(x = kmeans_model$centers[, 1],
y = kmeans_model$centers[, 2]),
aes(color = NA), size = 5, shape = 3, color = "black") +
labs(title = "K-Means Clustering") +
theme_minimal()
# Elbow method for optimal k
withinss <- map_dbl(1:10, ~kmeans(data, centers = ., nstart = 25)$tot.withinss)
ggplot(tibble(k = 1:10, withinss = withinss), aes(x = k, y = withinss)) +
geom_line() +
geom_point() +
labs(title = "Elbow Plot", x = "Number of Clusters", y = "Within-cluster SS") +
theme_minimal()
Python Implementation:
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Sample data
np.random.seed(42)
X = np.vstack([
np.random.randn(30, 2) + np.array([2, 2]),
np.random.randn(30, 2) + np.array([6, 6])
])
# K-means clustering
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[labels == 0, 0], X[labels == 0, 1], label='Cluster 1', alpha=0.6)
plt.scatter(X[labels == 1, 0], X[labels == 1, 1], label='Cluster 2', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
marker='X', s=300, color='red', label='Centroids')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('K-Means Clustering')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
# Elbow method
inertias = []
silhouette_scores = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(alpha=0.3)
plt.show()
8.3.5.2 4.5.2 Hierarchical Clustering
Types: - Agglomerative: Bottom-up, start with individual points - Divisive: Top-down, start with all points
Linkage Methods: Single, Complete, Average, Ward
R Implementation:
# Sample data
set.seed(42)
data <- tibble(
x = c(rnorm(20, 2, 0.5), rnorm(20, 6, 0.5)),
y = c(rnorm(20, 2, 0.5), rnorm(20, 6, 0.5))
)
# Hierarchical clustering
dist_matrix <- dist(data)
hc_model <- hclust(dist_matrix, method = "ward.D2")
# Dendrogram
plot(hc_model, main = "Hierarchical Clustering Dendrogram")
abline(h = 3, col = "red", lty = 2)
# Cut dendrogram to get clusters
clusters <- cutree(hc_model, k = 2)
data_clustered <- data %>%
mutate(cluster = factor(clusters))
# Visualize
ggplot(data_clustered, aes(x = x, y = y, color = cluster)) +
geom_point(size = 3) +
labs(title = "Hierarchical Clustering") +
theme_minimal()
Python Implementation:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist
import numpy as np
import matplotlib.pyplot as plt
# Sample data
np.random.seed(42)
X = np.vstack([
np.random.randn(20, 2) + np.array([2, 2]),
np.random.randn(20, 2) + np.array([6, 6])
])
# Hierarchical clustering
linkage_matrix = linkage(X, method='ward')
# Dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.axhline(y=3, color='r', linestyle='--')
plt.show()
# Cut dendrogram
clusters = fcluster(linkage_matrix, t=2, criterion='maxclust')
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X[clusters == 1, 0], X[clusters == 1, 1], label='Cluster 1', alpha=0.6)
plt.scatter(X[clusters == 2, 0], X[clusters == 2, 1], label='Cluster 2', alpha=0.6)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Hierarchical Clustering Results')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
8.3.5.3 4.5.3 Principal Component Analysis (PCA)
Goal: Reduce dimensionality while preserving variance
Steps: 1. Standardize features 2. Compute covariance matrix 3. Calculate eigenvalues and eigenvectors 4. Select top components based on variance explained
R Implementation:
# Sample high-dimensional data
set.seed(42)
data <- scale(iris[, 1:4])
# PCA
pca_model <- prcomp(data, scale = FALSE, center = FALSE)
# Variance explained
variance_explained <- (pca_model$sdev^2 / sum(pca_model$sdev^2)) * 100
cumsum_variance <- cumsum(variance_explained)
# Scree plot
plot(cumsum_variance, type = "b",
main = "Cumulative Variance Explained",
xlab = "Principal Component",
ylab = "Cumulative Variance (%)")
# PC scores and loadings
pc_scores <- pca_model$x[, 1:2]
loadings <- pca_model$rotation[, 1:2]
# Biplot
biplot(pca_model, scale = 0)
Python Implementation:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
# Load and standardize data
iris = load_iris()
X = StandardScaler().fit_transform(iris.data)
# PCA
pca = PCA()
pc_scores = pca.fit_transform(X)
# Variance explained
variance_explained = pca.explained_variance_ratio_
cumsum_variance = np.cumsum(variance_explained)
# Scree plot
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.bar(range(1, len(variance_explained) + 1), variance_explained)
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')
plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumsum_variance) + 1), cumsum_variance, 'bo-')
plt.xlabel('Principal Component')
plt.ylabel('Cumulative Variance Explained')
plt.title('Cumulative Variance Explained')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Biplot
plt.figure(figsize=(10, 8))
plt.scatter(pc_scores[:, 0], pc_scores[:, 1], c=iris.target, cmap='viridis', alpha=0.6)
for i, feature in enumerate(iris.feature_names):
plt.arrow(0, 0, pca.components_[0, i]*3, pca.components_[1, i]*3,
head_width=0.1, head_length=0.1, fc='red', ec='red')
plt.text(pca.components_[0, i]*3.2, pca.components_[1, i]*3.2, feature)
plt.xlabel(f'PC1 ({variance_explained[0]:.1%})')
plt.ylabel(f'PC2 ({variance_explained[1]:.1%})')
plt.title('PCA Biplot')
plt.grid(alpha=0.3)
plt.show()
8.3.6 4.6 Model Evaluation and Selection
8.3.6.1 4.6.1 Performance Metrics
Classification Metrics:
- Confusion Matrix: TP, TN, FP, FN
- Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
- Precision: \(\frac{TP}{TP + FP}\) (relevance)
- Recall/Sensitivity: \(\frac{TP}{TP + FN}\) (coverage)
- F1-Score: \(2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
- AUC-ROC: Area under ROC curve
Regression Metrics:
- Mean Squared Error (MSE): \(\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\)
- Root Mean Squared Error (RMSE): \(\sqrt{\text{MSE}}\)
- Mean Absolute Error (MAE): \(\frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\)
- R-squared: \(1 - \frac{SS_{res}}{SS_{tot}}\)
8.3.6.2 4.6.2 Cross-Validation
Purpose: Estimate model performance on unseen data
Techniques: - k-Fold CV: Divide into k folds, train on k-1, test on 1 - Stratified CV: Maintain class distribution - Time Series CV: Respect temporal ordering
R Implementation:
library(caret)
# Define cross-validation strategy
ctrl <- trainControl(method = "cv", number = 5)
# Train model with cross-validation
model <- train(Species ~ .,
data = iris,
method = "rf",
trControl = ctrl)
# Model performance
model$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.96 0.94 0.04346135 0.06519202
## 2 3 0.96 0.94 0.04346135 0.06519202
## 3 4 0.96 0.94 0.04346135 0.06519202
model$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.33%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 50 0 0 0.00
## versicolor 0 47 3 0.06
## virginica 0 5 45 0.10
Python Implementation:
from sklearn.model_selection import cross_val_score, cross_validate, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# K-Fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# Model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Single metric
cv_scores = cross_val_score(rf, X, y, cv=kfold, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.4f} +/- {cv_scores.std():.4f}")
# Multiple metrics
scoring = {'accuracy': 'accuracy',
'precision': 'precision_weighted',
'recall': 'recall_weighted',
'f1': 'f1_weighted'}
cv_results = cross_validate(rf, X, y, cv=kfold, scoring=scoring)
for metric, scores in cv_results.items():
if metric.startswith('test_'):
print(f"{metric}: {scores.mean():.4f} +/- {scores.std():.4f}")
8.3.6.3 4.6.3 Hyperparameter Tuning
Grid Search: Exhaustive search over parameter grid
R Implementation:
library(caret)
# Define parameter grid
param_grid <- expand.grid(
mtry = c(2, 3, 4),
splitrule = "gini",
min.node.size = c(1, 5, 10)
)
# Grid search with cross-validation
ctrl <- trainControl(method = "cv", number = 5)
model <- train(Species ~ .,
data = iris,
method = "ranger",
trControl = ctrl,
tuneGrid = param_grid)
# Best model
print(model)
## Random Forest
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 120, 120, 120, 120, 120
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 2 1 0.9600000 0.94
## 2 5 0.9600000 0.94
## 2 10 0.9666667 0.95
## 3 1 0.9666667 0.95
## 3 5 0.9666667 0.95
## 3 10 0.9666667 0.95
## 4 1 0.9666667 0.95
## 4 5 0.9666667 0.95
## 4 10 0.9666667 0.95
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
## and min.node.size = 10.
plot(model)
Python Implementation:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]
}
# Grid search
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)
# Results
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Best model
best_model = grid_search.best_estimator_
8.4 MODULE 5: ADVANCED TOPICS AND APPLICATIONS
8.4.1 5.1 Time Series Analysis Basics
Key Concepts: - Temporal Dependency: Past values influence future values - Stationarity: Mean and variance constant over time - Autocorrelation: Correlation with own lags - Seasonality: Regular patterns repeating at fixed intervals
8.4.1.1 5.1.1 Exploratory Time Series Analysis
R Implementation:
library(tidyverse)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
# Create time series
ts_data <- ts(c(100, 102, 101, 103, 105, 104, 106, 108, 107, 110),
frequency = 4, start = c(2020, 1))
# Plot time series
plot(ts_data, main = "Time Series Plot")
# Decomposition
decomposed <- decompose(ts_data)
plot(decomposed)
# ACF and PACF
acf(ts_data)
pacf(ts_data)
# Stationarity test (ADF)
adf.test(ts_data)
## Warning in adf.test(ts_data): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: ts_data
## Dickey-Fuller = -9.1652, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
Python Implementation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
# Create time series
dates = pd.date_range('2020-01-01', periods=40, freq='Q')
values = [100, 102, 101, 103, 105, 104, 106, 108, 107, 110,
115, 113, 117, 120, 118, 122, 125, 123, 127, 130,
135, 133, 137, 140, 138, 142, 145, 143, 147, 150,
155, 153, 157, 160, 158, 162, 165, 163, 167, 170]
ts_data = pd.Series(values, index=dates)
# Plot time series
plt.figure(figsize=(12, 4))
plt.plot(ts_data)
plt.title('Time Series')
plt.ylabel('Value')
plt.grid(alpha=0.3)
plt.show()
# Decomposition
decomposition = seasonal_decompose(ts_data, model='additive', period=4)
decomposition.plot()
plt.show()
# ACF and PACF
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(ts_data, lags=20, ax=axes[0])
plot_pacf(ts_data, lags=20, ax=axes[1])
plt.tight_layout()
plt.show()
# Stationarity test (ADF)
adf_result = adfuller(ts_data)
print(f"ADF Statistic: {adf_result[0]:.6f}")
print(f"p-value: {adf_result[1]:.6f}")
8.4.1.2 5.1.2 ARIMA Models
ARIMA(p, d, q): - p: AR order (autoregressive) - d: Differencing order - q: MA order (moving average)
R Implementation:
library(forecast)
# Auto ARIMA
auto_model <- auto.arima(ts_data)
summary(auto_model)
# Manual ARIMA
manual_model <- arima(ts_data, order = c(1, 1, 1))
# Forecasting
forecasts <- forecast(auto_model, h = 8)
plot(forecasts, main = "ARIMA Forecast")
# Model diagnostics
checkresiduals(auto_model)
Python Implementation:
from statsmodels.tsa.arima.model import ARIMA
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Auto ARIMA
auto_model = auto_arima(ts_data, seasonal=False, stepwise=True)
print(auto_model.summary())
# Manual ARIMA
arima_model = ARIMA(ts_data, order=(1, 1, 1))
fitted_model = arima_model.fit()
print(fitted_model.summary())
# Forecasting
forecast_result = fitted_model.get_forecast(steps=8)
forecast_df = forecast_result.conf_int(alpha=0.05)
forecast_mean = forecast_result.predicted_mean
# Plot
plt.figure(figsize=(12, 6))
plt.plot(ts_data, label='Observed')
plt.plot(forecast_mean.index, forecast_mean, label='Forecast', color='red')
plt.fill_between(forecast_df.index,
forecast_df.iloc[:, 0],
forecast_df.iloc[:, 1], alpha=0.3)
plt.legend()
plt.title('ARIMA Forecast')
plt.grid(alpha=0.3)
plt.show()
# Residual diagnostics
fitted_model.plot_diagnostics(figsize=(12, 8))
plt.tight_layout()
plt.show()
8.4.2 5.2 Introduction to Text Mining and NLP
Common Tasks: - Text preprocessing - Sentiment analysis - Topic modeling - Named entity recognition - Text classification
8.4.2.1 5.2.1 Text Preprocessing
R Implementation:
library(tidytext)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(stringr)
# Sample text
texts <- c(
"Machine Learning is a subset of Artificial Intelligence.",
"Data Science involves statistics and programming.",
"Natural Language Processing is fascinating!"
)
# Create corpus and clean
corpus <- Corpus(VectorSource(texts))
corpus <- tm_map(corpus, tolower) # Lowercase
## Warning in tm_map.SimpleCorpus(corpus, tolower): transformation drops documents
corpus <- tm_map(corpus, removePunctuation) # Remove punctuation
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus <- tm_map(corpus, removeNumbers) # Remove numbers
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus <- tm_map(corpus, removeWords, stopwords("english")) # Remove stopwords
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
corpus <- tm_map(corpus, stripWhitespace) # Remove extra spaces
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation drops
## documents
# Document-term matrix
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
## <<DocumentTermMatrix (documents: 3, terms: 14)>>
## Non-/sparse entries: 14/28
## Sparsity : 67%
## Maximal term length: 12
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs artificial data intelligence involves learning machine programming science
## 1 1 0 1 0 1 1 0 0
## 2 0 1 0 1 0 0 1 1
## 3 0 0 0 0 0 0 0 0
## Terms
## Docs statistics subset
## 1 0 1
## 2 1 0
## 3 0 0
# Alternative with tidytext
tidy_texts <- tibble(
id = 1:length(texts),
text = texts
) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
Python Implementation:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
import pandas as pd
# Download required data
nltk.download('punkt')
nltk.download('stopwords')
# Sample texts
texts = [
"Machine Learning is a subset of Artificial Intelligence.",
"Data Science involves statistics and programming.",
"Natural Language Processing is fascinating!"
]
def preprocess_text(text):
# Lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Remove numbers
text = re.sub(r'\d+', '', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
return tokens
# Process texts
processed_texts = [preprocess_text(text) for text in texts]
print(processed_texts)
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(texts)
print(tfidf_matrix.toarray())
print(vectorizer.get_feature_names_out())
8.4.2.2 5.2.2 Sentiment Analysis
R Implementation:
library(tidytext)
# Sentiment lexicon
sentiment_df <- tibble(
text = c("I love this product!",
"This is terrible.",
"I am neutral about this."))
# Sentiment analysis using tidytext
#sentiment_scores <- sentiment_df %>%
# unnest_tokens(word, text) %>%
# inner_join(get_sentiments("bing")) %>%
# group_by(text) %>%
# summarise(sentiment = sum(ifelse(sentiment == "positive", 1, -1)))
Python Implementation:
from textblob import TextBlob
from transformers import pipeline
# Sample texts
texts = [
"I love this product! It's amazing.",
"This is terrible. Worst experience ever.",
"The product is okay, nothing special."
]
# Simple sentiment analysis using TextBlob
for text in texts:
blob = TextBlob(text)
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity
print(f"Text: {text}")
print(f"Polarity: {polarity:.2f}, Subjectivity: {subjectivity:.2f}\n")
# Advanced sentiment analysis with transformers
classifier = pipeline('sentiment-analysis')
for text in texts:
result = classifier(text)
print(f"Text: {text}")
print(f"Sentiment: {result[0]['label']}, Score: {result[0]['score']:.4f}\n")
8.4.2.3 5.2.3 Topic Modeling
Latent Dirichlet Allocation (LDA):
R Implementation:
library(tidytext)
library(topicmodels)
# Document-term matrix (from preprocessed texts)
dtm <- DocumentTermMatrix(corpus)
# LDA model
lda_model <- LDA(dtm, k = 2, control = list(seed = 1234))
# Topics
topics <- tidy(lda_model, matrix = "beta")
top_terms <- topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
arrange(topic, -beta)
# Document-topic distributions
doc_topics <- tidy(lda_model, matrix = "gamma")
Python Implementation:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
# Sample documents
documents = [
"Machine learning artificial intelligence deep learning",
"Data science statistics programming",
"Natural language processing text mining",
"Computer vision image processing",
"Big data hadoop spark"
]
# Create document-term matrix
vectorizer = CountVectorizer(stop_words='english', max_df=0.95, min_df=1)
dtm = vectorizer.fit_transform(documents)
# LDA model
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(dtm)
# Topics
feature_names = vectorizer.get_feature_names_out()
n_top_words = 5
for topic_idx, topic in enumerate(lda.components_):
top_indices = topic.argsort()[-n_top_words:][::-1]
top_words = [feature_names[i] for i in top_indices]
print(f"Topic {topic_idx}: {', '.join(top_words)}")
# Document-topic distribution
doc_topic_dist = lda.transform(dtm)
print(doc_topic_dist)
8.4.3 5.3 Ethical Considerations in Data Science
Key Issues: 1. Bias and Fairness: Models reflecting historical biases 2. Privacy: Protecting sensitive information 3. Transparency: Explainability of model decisions 4. Accountability: Responsibility for model outcomes 5. Consent: Ethical data collection 6. Disparate Impact: Differential outcomes for groups
Best Practices: - Audit models for bias across subgroups - Use differential privacy techniques - Document data provenance and quality - Implement explainability methods (SHAP, LIME) - Establish governance frameworks - Regular monitoring for model drift - Diverse team involvement in development
8.4.4 5.4 Case Studies and Real-World Applications
8.4.4.1 5.4.1 Customer Churn Prediction
Problem: Identify customers likely to leave
Data: Customer demographics, usage, support interactions
Approach: 1. EDA: Churn rate, feature distributions by churn status 2. Feature engineering: Create behavioral features 3. Model: Classification (logistic regression, random forest) 4. Evaluation: AUC-ROC, business metrics 5. Deployment: Score customer base, target interventions
8.4.4.2 5.4.2 Credit Risk Assessment
Problem: Assess probability of loan default
Data: Credit history, income, employment, loan details
Approach: 1. EDA: Default rates, feature relationships 2. Feature engineering: Ratio analysis, binning 3. Model: Logistic regression, gradient boosting 4. Evaluation: AUC, KS statistic, calibration 5. Deployment: Risk scoring, rate adjustment
8.4.4.3 5.4.3 Demand Forecasting
Problem: Predict future product demand
Data: Historical sales, seasonality, external factors
Approach: 1. EDA: Trends, patterns, outliers 2. Model: Time series (ARIMA), machine learning 3. Evaluation: RMSE, MAPE, directional accuracy 4. Deployment: Inventory planning, supply chain optimization
8.5 APPENDIX: Common Data Science Workflows
8.5.1 General Project Structure
project_name/
├── data/
│ ├── raw/
│ ├── processed/
│ └── external/
├── notebooks/
│ ├── 01_eda.ipynb
│ ├── 02_feature_engineering.ipynb
│ └── 03_modeling.ipynb
├── src/
│ ├── data_loading.py
│ ├── preprocessing.py
│ ├── features.py
│ └── models.py
├── results/
│ ├── figures/
│ └── models/
├── README.md
├── requirements.txt
├── .gitignore
└── environment.yml
8.5.2 Essential Libraries
R:
- tidyverse: Data manipulation and visualization
- caret: Machine learning
- ggplot2: Advanced visualization
- rmarkdown: Report generation
- shiny: Interactive apps
Python:
- pandas: Data manipulation
- numpy: Numerical computing
- scikit-learn: Machine learning
- matplotlib, seaborn: Visualization
- statsmodels: Statistical modeling
- jupyter: Interactive notebooks
8.6 Conclusion
This comprehensive course provides statisticians with essential tools and competencies for modern data science. Success requires practice through real datasets, collaboration with domain experts, and continuous learning as tools and techniques evolve.