0. Overview

A data scientist typically spends their day collecting, cleaning, analyzing large datasets to identify patterns and trends, building predictive models using machine learning algorithms, creating data visualizations to present insights, and collaborating with stakeholders to understand business needs and translate data into actionable solutions; essentially, they are problem solvers who use data to answer complex questions and inform decision-making.

Key tasks of a data scientist:

Tools used by data scientists:

The following sections will demonstrate the procedures using a dataset. We will use R software.

1. Data Acquisition and Cleaning

First, we load the dataset and inspect it for missing values and outliers.

# Load necessary libraries
library(MASS)
library(ggplot2)
library(dplyr)

# Load Boston dataset
data(Boston)

# Check the first few rows of the dataset
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# Check for missing values
sum(is.na(Boston))
## [1] 0
# Check for summary statistics
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# Removing outliers (for simplicity, we'll use Z-scores)
z_scores <- scale(Boston[, -14])  # Ignore 'medv' column (target)
outliers <- apply(z_scores, 1, function(x) any(abs(x) > 4))  # Z-scores > 4 or < -4 are considered outliers

# Clean the dataset by removing rows with outliers
Boston_cleaned <- Boston[!outliers, ]

# Check the cleaned dataset
summary(Boston_cleaned)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08100   1st Qu.:  0.00   1st Qu.: 5.13   1st Qu.:0.00000  
##  Median : 0.24980   Median :  0.00   Median : 8.56   Median :0.00000  
##  Mean   : 2.84819   Mean   : 11.52   Mean   :11.04   Mean   :0.07014  
##  3rd Qu.: 3.24233   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :37.66190   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4485   1st Qu.:5.888   1st Qu.: 44.05   1st Qu.: 2.111  
##  Median :0.5380   Median :6.211   Median : 76.70   Median : 3.272  
##  Mean   :0.5530   Mean   :6.293   Mean   : 68.18   Mean   : 3.827  
##  3rd Qu.:0.6240   3rd Qu.:6.627   3rd Qu.: 93.85   3rd Qu.: 5.215  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.15   1st Qu.:375.95  
##  Median : 5.000   Median :330.0   Median :19.00   Median :391.50  
##  Mean   : 9.347   Mean   :404.6   Mean   :18.43   Mean   :358.44  
##  3rd Qu.:16.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.22  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.60  
##  1st Qu.: 6.91   1st Qu.:17.20  
##  Median :11.28   Median :21.40  
##  Mean   :12.50   Mean   :22.73  
##  3rd Qu.:16.62   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

2. Exploratory Data Analysis (EDA)

We explore the relationships between variables and visualize the distribution of house prices.

# Visualize distribution of house prices
ggplot(Boston_cleaned, aes(x = medv)) +
  geom_histogram(binwidth = 1, fill = 'blue', color = 'black', alpha = 0.7) +
  ggtitle('Distribution of House Prices') +
  xlab('Price') + ylab('Frequency')

# Scatter plot between 'RM' (average number of rooms) and 'medv' (price)
ggplot(Boston_cleaned, aes(x = rm, y = medv)) +
  geom_point() +
  ggtitle('Price vs. Average Number of Rooms (RM)') +
  xlab('Average Rooms per Dwelling') + ylab('Price')

3. Feature Engineering

We’ll create a new feature, such as the square of LSTAT (percentage of lower status population), and scale the data.

# Create new feature
Boston_cleaned$LSTAT_squared <- Boston_cleaned$lstat^2

# Scale the features
scaled_data <- scale(Boston_cleaned[, -14])  # Scaling all features except 'medv'
scaled_data <- as.data.frame(scaled_data)

# Add target variable back to the scaled data
scaled_data$medv <- Boston_cleaned$medv

# Check the first few rows of scaled data
head(scaled_data)
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.5344262  0.2762431 -1.2727974 -0.2743718 -0.1297572 0.4032830 -0.1058898
## 2 -0.5304789 -0.4914596 -0.5787377 -0.2743718 -0.7258678 0.1831287  0.3809912
## 3 -0.5304827 -0.4914596 -0.5787377 -0.2743718 -0.7258678 1.2753230 -0.2515988
## 4 -0.5295274 -0.4914596 -1.2917529 -0.2743718 -0.8208999 1.0079927 -0.7953418
## 5 -0.5226295 -0.4914596 -1.2917529 -0.2743718 -0.8208999 1.2209992 -0.4968162
## 6 -0.5300013 -0.4914596 -1.2917529 -0.2743718 -0.8208999 0.1959948 -0.3368918
##         dis        rad       tax    ptratio     black      lstat LSTAT_squared
## 1 0.1252263 -0.9709024 -0.650800 -1.4427946 0.4351862 -1.0737889    -0.7921688
## 2 0.5422368 -0.8545806 -0.974339 -0.2907937 0.4351862 -0.4796323    -0.5341294
## 3 0.5422368 -0.8545806 -0.974339 -0.2907937 0.3891331 -1.2094736    -0.8297704
## 4 1.0628937 -0.7382588 -1.094168  0.1239267 0.4095006 -1.3651541    -0.8631451
## 5 1.0628937 -0.7382588 -1.094168  0.1239267 0.4351862 -1.0237997    -0.7763168
## 6 1.0628937 -0.7382588 -1.094168  0.1239267 0.4037298 -1.0409389    -0.7818730
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

4. Model Building

Now, we’ll build a linear regression model to predict house prices (medv).

# Split the data into 80% training and 20% testing sets
set.seed(123)
train_indices <- sample(1:nrow(scaled_data), size = 0.8 * nrow(scaled_data))
train_data <- scaled_data[train_indices, ]
test_data <- scaled_data[-train_indices, ]

# Train the linear regression model
model <- lm(medv ~ ., data = train_data)

# View the model summary
summary(model)
## 
## Call:
## lm(formula = medv ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6404 -2.7509 -0.4435  1.9946 24.2461 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    22.69255    0.21553 105.290  < 2e-16 ***
## crim           -1.33826    0.40244  -3.325 0.000968 ***
## zn              0.63706    0.32699   1.948 0.052114 .  
## indus           0.01613    0.44055   0.037 0.970809    
## chas            0.80152    0.21095   3.800 0.000168 ***
## nox            -1.62298    0.45385  -3.576 0.000393 ***
## rm              2.02033    0.30477   6.629 1.15e-10 ***
## age             0.72244    0.39516   1.828 0.068291 .  
## dis            -2.72284    0.43270  -6.293 8.52e-10 ***
## rad             2.98217    0.62469   4.774 2.57e-06 ***
## tax            -1.86065    0.65400  -2.845 0.004679 ** 
## ptratio        -1.52642    0.28308  -5.392 1.22e-07 ***
## black           0.86686    0.26977   3.213 0.001423 ** 
## lstat         -12.99666    1.03659 -12.538  < 2e-16 ***
## LSTAT_squared   8.49898    0.92177   9.220  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.283 on 384 degrees of freedom
## Multiple R-squared:  0.7855, Adjusted R-squared:  0.7776 
## F-statistic: 100.4 on 14 and 384 DF,  p-value: < 2.2e-16

5. Model Evaluation

Next, we evaluate the model using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared.

# Predict on the test data
predictions <- predict(model, newdata = test_data)

# Calculate evaluation metrics
mae <- mean(abs(predictions - test_data$medv))
rmse <- sqrt(mean((predictions - test_data$medv)^2))
r2 <- 1 - sum((predictions - test_data$medv)^2) / sum((test_data$medv - mean(test_data$medv))^2)

# Print evaluation metrics
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 3.113301
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 4.376915
cat("R-squared:", r2, "\n")
## R-squared: 0.771992

6. Data Visualization

We’ll visualize the model’s performance and the residuals.

# Plot Actual vs Predicted Prices
ggplot(data = data.frame(actual = test_data$medv, predicted = predictions), aes(x = actual, y = predicted)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = 'red') +
  ggtitle('Actual vs Predicted House Prices') +
  xlab('Actual Prices') + ylab('Predicted Prices')

# Residual Plot
residuals <- test_data$medv - predictions
ggplot(data = data.frame(predicted = predictions, residuals = residuals), aes(x = predicted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = 'red') +
  ggtitle('Residuals Plot') +
  xlab('Predicted Prices') + ylab('Residuals')

7. Reporting and Communication

Report Summary: