A data scientist typically spends their day collecting, cleaning, analyzing large datasets to identify patterns and trends, building predictive models using machine learning algorithms, creating data visualizations to present insights, and collaborating with stakeholders to understand business needs and translate data into actionable solutions; essentially, they are problem solvers who use data to answer complex questions and inform decision-making.
Key tasks of a data scientist:
Tools used by data scientists:
The following sections will demonstrate the procedures using a dataset. We will use R software.
First, we load the dataset and inspect it for missing values and outliers.
# Load necessary libraries
library(MASS)
library(ggplot2)
library(dplyr)
# Load Boston dataset
data(Boston)
# Check the first few rows of the dataset
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# Check for missing values
sum(is.na(Boston))
## [1] 0
# Check for summary statistics
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# Removing outliers (for simplicity, we'll use Z-scores)
z_scores <- scale(Boston[, -14]) # Ignore 'medv' column (target)
outliers <- apply(z_scores, 1, function(x) any(abs(x) > 4)) # Z-scores > 4 or < -4 are considered outliers
# Clean the dataset by removing rows with outliers
Boston_cleaned <- Boston[!outliers, ]
# Check the cleaned dataset
summary(Boston_cleaned)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08100 1st Qu.: 0.00 1st Qu.: 5.13 1st Qu.:0.00000
## Median : 0.24980 Median : 0.00 Median : 8.56 Median :0.00000
## Mean : 2.84819 Mean : 11.52 Mean :11.04 Mean :0.07014
## 3rd Qu.: 3.24233 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :37.66190 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4485 1st Qu.:5.888 1st Qu.: 44.05 1st Qu.: 2.111
## Median :0.5380 Median :6.211 Median : 76.70 Median : 3.272
## Mean :0.5530 Mean :6.293 Mean : 68.18 Mean : 3.827
## 3rd Qu.:0.6240 3rd Qu.:6.627 3rd Qu.: 93.85 3rd Qu.: 5.215
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.15 1st Qu.:375.95
## Median : 5.000 Median :330.0 Median :19.00 Median :391.50
## Mean : 9.347 Mean :404.6 Mean :18.43 Mean :358.44
## 3rd Qu.:16.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.22
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.60
## 1st Qu.: 6.91 1st Qu.:17.20
## Median :11.28 Median :21.40
## Mean :12.50 Mean :22.73
## 3rd Qu.:16.62 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We explore the relationships between variables and visualize the distribution of house prices.
# Visualize distribution of house prices
ggplot(Boston_cleaned, aes(x = medv)) +
geom_histogram(binwidth = 1, fill = 'blue', color = 'black', alpha = 0.7) +
ggtitle('Distribution of House Prices') +
xlab('Price') + ylab('Frequency')
# Scatter plot between 'RM' (average number of rooms) and 'medv' (price)
ggplot(Boston_cleaned, aes(x = rm, y = medv)) +
geom_point() +
ggtitle('Price vs. Average Number of Rooms (RM)') +
xlab('Average Rooms per Dwelling') + ylab('Price')
We’ll create a new feature, such as the square of LSTAT (percentage of lower status population), and scale the data.
# Create new feature
Boston_cleaned$LSTAT_squared <- Boston_cleaned$lstat^2
# Scale the features
scaled_data <- scale(Boston_cleaned[, -14]) # Scaling all features except 'medv'
scaled_data <- as.data.frame(scaled_data)
# Add target variable back to the scaled data
scaled_data$medv <- Boston_cleaned$medv
# Check the first few rows of scaled data
head(scaled_data)
## crim zn indus chas nox rm age
## 1 -0.5344262 0.2762431 -1.2727974 -0.2743718 -0.1297572 0.4032830 -0.1058898
## 2 -0.5304789 -0.4914596 -0.5787377 -0.2743718 -0.7258678 0.1831287 0.3809912
## 3 -0.5304827 -0.4914596 -0.5787377 -0.2743718 -0.7258678 1.2753230 -0.2515988
## 4 -0.5295274 -0.4914596 -1.2917529 -0.2743718 -0.8208999 1.0079927 -0.7953418
## 5 -0.5226295 -0.4914596 -1.2917529 -0.2743718 -0.8208999 1.2209992 -0.4968162
## 6 -0.5300013 -0.4914596 -1.2917529 -0.2743718 -0.8208999 0.1959948 -0.3368918
## dis rad tax ptratio black lstat LSTAT_squared
## 1 0.1252263 -0.9709024 -0.650800 -1.4427946 0.4351862 -1.0737889 -0.7921688
## 2 0.5422368 -0.8545806 -0.974339 -0.2907937 0.4351862 -0.4796323 -0.5341294
## 3 0.5422368 -0.8545806 -0.974339 -0.2907937 0.3891331 -1.2094736 -0.8297704
## 4 1.0628937 -0.7382588 -1.094168 0.1239267 0.4095006 -1.3651541 -0.8631451
## 5 1.0628937 -0.7382588 -1.094168 0.1239267 0.4351862 -1.0237997 -0.7763168
## 6 1.0628937 -0.7382588 -1.094168 0.1239267 0.4037298 -1.0409389 -0.7818730
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Now, we’ll build a linear regression model to predict house prices (medv).
# Split the data into 80% training and 20% testing sets
set.seed(123)
train_indices <- sample(1:nrow(scaled_data), size = 0.8 * nrow(scaled_data))
train_data <- scaled_data[train_indices, ]
test_data <- scaled_data[-train_indices, ]
# Train the linear regression model
model <- lm(medv ~ ., data = train_data)
# View the model summary
summary(model)
##
## Call:
## lm(formula = medv ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6404 -2.7509 -0.4435 1.9946 24.2461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.69255 0.21553 105.290 < 2e-16 ***
## crim -1.33826 0.40244 -3.325 0.000968 ***
## zn 0.63706 0.32699 1.948 0.052114 .
## indus 0.01613 0.44055 0.037 0.970809
## chas 0.80152 0.21095 3.800 0.000168 ***
## nox -1.62298 0.45385 -3.576 0.000393 ***
## rm 2.02033 0.30477 6.629 1.15e-10 ***
## age 0.72244 0.39516 1.828 0.068291 .
## dis -2.72284 0.43270 -6.293 8.52e-10 ***
## rad 2.98217 0.62469 4.774 2.57e-06 ***
## tax -1.86065 0.65400 -2.845 0.004679 **
## ptratio -1.52642 0.28308 -5.392 1.22e-07 ***
## black 0.86686 0.26977 3.213 0.001423 **
## lstat -12.99666 1.03659 -12.538 < 2e-16 ***
## LSTAT_squared 8.49898 0.92177 9.220 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.283 on 384 degrees of freedom
## Multiple R-squared: 0.7855, Adjusted R-squared: 0.7776
## F-statistic: 100.4 on 14 and 384 DF, p-value: < 2.2e-16
Next, we evaluate the model using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared.
# Predict on the test data
predictions <- predict(model, newdata = test_data)
# Calculate evaluation metrics
mae <- mean(abs(predictions - test_data$medv))
rmse <- sqrt(mean((predictions - test_data$medv)^2))
r2 <- 1 - sum((predictions - test_data$medv)^2) / sum((test_data$medv - mean(test_data$medv))^2)
# Print evaluation metrics
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 3.113301
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 4.376915
cat("R-squared:", r2, "\n")
## R-squared: 0.771992
We’ll visualize the model’s performance and the residuals.
# Plot Actual vs Predicted Prices
ggplot(data = data.frame(actual = test_data$medv, predicted = predictions), aes(x = actual, y = predicted)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = 'red') +
ggtitle('Actual vs Predicted House Prices') +
xlab('Actual Prices') + ylab('Predicted Prices')
# Residual Plot
residuals <- test_data$medv - predictions
ggplot(data = data.frame(predicted = predictions, residuals = residuals), aes(x = predicted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = 'red') +
ggtitle('Residuals Plot') +
xlab('Predicted Prices') + ylab('Residuals')
Report Summary: