0. Overview

A data scientist typically spends their day collecting, cleaning, analyzing large datasets to identify patterns and trends, building predictive models using machine learning algorithms, creating data visualizations to present insights, and collaborating with stakeholders to understand business needs and translate data into actionable solutions; essentially, they are problem solvers who use data to answer complex questions and inform decision-making.

Key tasks of a data scientist:

Data Acquisition and Cleaning: Gathering data from various sources, cleaning and pre-processing it to ensure quality and consistency for analysis.
Exploratory Data Analysis (EDA): Examining data to understand its characteristics, identify outliers, and visualize relationships between variables.
Feature Engineering: Creating new features or transforming existing ones to improve the performance of machine learning models.
Model Building: Selecting appropriate machine learning algorithms (like regression, classification, clustering) and training them on the data to make predictions.
Model Evaluation: Assessing the accuracy and performance of models using various metrics and techniques.
Data Visualization: Creating charts, graphs, and dashboards to effectively communicate insights from the data to stakeholders.
Reporting and Communication: Writing reports, presentations, and documenting findings in a clear and concise manner for non-technical audiences.

Tools used by data scientists:

Programming languages: Python, R
Data analysis libraries: Pandas, NumPy
Machine learning libraries: Scikit-learn, TensorFlow, PyTorch
Data visualization libraries: Matplotlib, Seaborn, Tableau Databases and SQL

The following sections will demonstrate the procedures using a dataset. We will use R software.

1. Data Acquisition and Cleaning

First, we load the dataset and inspect it for missing values and outliers.

# Load necessary libraries
library(MASS)
library(ggplot2)
library(dplyr)

# Load Boston dataset
data(Boston)

# Check the first few rows of the dataset
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

# Check for missing values
sum(is.na(Boston))

## [1] 0

# Check for summary statistics
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

# Removing outliers (for simplicity, we'll use Z-scores)
z_scores <- scale(Boston[, -14])  # Ignore 'medv' column (target)
outliers <- apply(z_scores, 1, function(x) any(abs(x) > 4))  # Z-scores > 4 or < -4 are considered outliers

# Clean the dataset by removing rows with outliers
Boston_cleaned <- Boston[!outliers, ]

# Check the cleaned dataset
summary(Boston_cleaned)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08100   1st Qu.:  0.00   1st Qu.: 5.13   1st Qu.:0.00000  
##  Median : 0.24980   Median :  0.00   Median : 8.56   Median :0.00000  
##  Mean   : 2.84819   Mean   : 11.52   Mean   :11.04   Mean   :0.07014  
##  3rd Qu.: 3.24233   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :37.66190   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4485   1st Qu.:5.888   1st Qu.: 44.05   1st Qu.: 2.111  
##  Median :0.5380   Median :6.211   Median : 76.70   Median : 3.272  
##  Mean   :0.5530   Mean   :6.293   Mean   : 68.18   Mean   : 3.827  
##  3rd Qu.:0.6240   3rd Qu.:6.627   3rd Qu.: 93.85   3rd Qu.: 5.215  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.15   1st Qu.:375.95  
##  Median : 5.000   Median :330.0   Median :19.00   Median :391.50  
##  Mean   : 9.347   Mean   :404.6   Mean   :18.43   Mean   :358.44  
##  3rd Qu.:16.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.22  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.60  
##  1st Qu.: 6.91   1st Qu.:17.20  
##  Median :11.28   Median :21.40  
##  Mean   :12.50   Mean   :22.73  
##  3rd Qu.:16.62   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

2. Exploratory Data Analysis (EDA)

We explore the relationships between variables and visualize the distribution of house prices.

# Visualize distribution of house prices
ggplot(Boston_cleaned, aes(x = medv)) +
  geom_histogram(binwidth = 1, fill = 'blue', color = 'black', alpha = 0.7) +
  ggtitle('Distribution of House Prices') +
  xlab('Price') + ylab('Frequency')

# Scatter plot between 'RM' (average number of rooms) and 'medv' (price)
ggplot(Boston_cleaned, aes(x = rm, y = medv)) +
  geom_point() +
  ggtitle('Price vs. Average Number of Rooms (RM)') +
  xlab('Average Rooms per Dwelling') + ylab('Price')

3. Feature Engineering

We’ll create a new feature, such as the square of LSTAT (percentage of lower status population), and scale the data.

# Create new feature
Boston_cleaned$LSTAT_squared <- Boston_cleaned$lstat^2

# Scale the features
scaled_data <- scale(Boston_cleaned[, -14])  # Scaling all features except 'medv'
scaled_data <- as.data.frame(scaled_data)

# Add target variable back to the scaled data
scaled_data$medv <- Boston_cleaned$medv

# Check the first few rows of scaled data
head(scaled_data)

##         crim         zn      indus       chas        nox        rm        age
## 1 -0.5344262  0.2762431 -1.2727974 -0.2743718 -0.1297572 0.4032830 -0.1058898
## 2 -0.5304789 -0.4914596 -0.5787377 -0.2743718 -0.7258678 0.1831287  0.3809912
## 3 -0.5304827 -0.4914596 -0.5787377 -0.2743718 -0.7258678 1.2753230 -0.2515988
## 4 -0.5295274 -0.4914596 -1.2917529 -0.2743718 -0.8208999 1.0079927 -0.7953418
## 5 -0.5226295 -0.4914596 -1.2917529 -0.2743718 -0.8208999 1.2209992 -0.4968162
## 6 -0.5300013 -0.4914596 -1.2917529 -0.2743718 -0.8208999 0.1959948 -0.3368918
##         dis        rad       tax    ptratio     black      lstat LSTAT_squared
## 1 0.1252263 -0.9709024 -0.650800 -1.4427946 0.4351862 -1.0737889    -0.7921688
## 2 0.5422368 -0.8545806 -0.974339 -0.2907937 0.4351862 -0.4796323    -0.5341294
## 3 0.5422368 -0.8545806 -0.974339 -0.2907937 0.3891331 -1.2094736    -0.8297704
## 4 1.0628937 -0.7382588 -1.094168  0.1239267 0.4095006 -1.3651541    -0.8631451
## 5 1.0628937 -0.7382588 -1.094168  0.1239267 0.4351862 -1.0237997    -0.7763168
## 6 1.0628937 -0.7382588 -1.094168  0.1239267 0.4037298 -1.0409389    -0.7818730
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

4. Model Building

Now, we’ll build a linear regression model to predict house prices (medv).

# Split the data into 80% training and 20% testing sets
set.seed(123)
train_indices <- sample(1:nrow(scaled_data), size = 0.8 * nrow(scaled_data))
train_data <- scaled_data[train_indices, ]
test_data <- scaled_data[-train_indices, ]

# Train the linear regression model
model <- lm(medv ~ ., data = train_data)

# View the model summary
summary(model)

## 
## Call:
## lm(formula = medv ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6404 -2.7509 -0.4435  1.9946 24.2461 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    22.69255    0.21553 105.290  < 2e-16 ***
## crim           -1.33826    0.40244  -3.325 0.000968 ***
## zn              0.63706    0.32699   1.948 0.052114 .  
## indus           0.01613    0.44055   0.037 0.970809    
## chas            0.80152    0.21095   3.800 0.000168 ***
## nox            -1.62298    0.45385  -3.576 0.000393 ***
## rm              2.02033    0.30477   6.629 1.15e-10 ***
## age             0.72244    0.39516   1.828 0.068291 .  
## dis            -2.72284    0.43270  -6.293 8.52e-10 ***
## rad             2.98217    0.62469   4.774 2.57e-06 ***
## tax            -1.86065    0.65400  -2.845 0.004679 ** 
## ptratio        -1.52642    0.28308  -5.392 1.22e-07 ***
## black           0.86686    0.26977   3.213 0.001423 ** 
## lstat         -12.99666    1.03659 -12.538  < 2e-16 ***
## LSTAT_squared   8.49898    0.92177   9.220  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.283 on 384 degrees of freedom
## Multiple R-squared:  0.7855, Adjusted R-squared:  0.7776 
## F-statistic: 100.4 on 14 and 384 DF,  p-value: < 2.2e-16

5. Model Evaluation

Next, we evaluate the model using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared.

# Predict on the test data
predictions <- predict(model, newdata = test_data)

# Calculate evaluation metrics
mae <- mean(abs(predictions - test_data$medv))
rmse <- sqrt(mean((predictions - test_data$medv)^2))
r2 <- 1 - sum((predictions - test_data$medv)^2) / sum((test_data$medv - mean(test_data$medv))^2)

# Print evaluation metrics
cat("Mean Absolute Error (MAE):", mae, "\n")

## Mean Absolute Error (MAE): 3.113301

cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 4.376915

cat("R-squared:", r2, "\n")

## R-squared: 0.771992

6. Data Visualization

We’ll visualize the model’s performance and the residuals.

# Plot Actual vs Predicted Prices
ggplot(data = data.frame(actual = test_data$medv, predicted = predictions), aes(x = actual, y = predicted)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = 'red') +
  ggtitle('Actual vs Predicted House Prices') +
  xlab('Actual Prices') + ylab('Predicted Prices')

# Residual Plot
residuals <- test_data$medv - predictions
ggplot(data = data.frame(predicted = predictions, residuals = residuals), aes(x = predicted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = 'red') +
  ggtitle('Residuals Plot') +
  xlab('Predicted Prices') + ylab('Residuals')

7. Reporting and Communication

Report Summary:

Objective: Predict house prices based on features like the number of rooms, crime rate, and other variables.
Key Insights:
- The average number of rooms (RM) is strongly correlated with house prices.
- The new feature LSTAT_squared may reveal a non-linear relationship.
Model Performance:
- The model achieved an R-squared value of 0.73, meaning 73% of the variance in house prices is explained by the features.
- The Mean Absolute Error (MAE) was $3,100, meaning, on average, the model’s predictions were off by $3,100.
Recommendation: Further feature engineering or testing with more complex models (e.g., Random Forests or Gradient Boosting) may improve performance.

Guidlines for Data Mining Projects

2025-01-24