Data Science Workflow Using Statistical and Machine Learning Techniques

  1. Load the Data: Begin by loading the dataset “budget_data.csv” into R to start the analysis.
  2. Check for Missing Values: Use the summary() function to check for any missing values in the dataset. If there are missing values, they need to be handled appropriately before proceeding.
  3. Exploratory Data Analysis: Perform exploratory data analysis (EDA) to gain insights into the dataset. This may involve visualizations, summary statistics, and understanding the distribution of variables.
  4. Preprocessing: Prepare the data for modeling by performing any necessary preprocessing steps such as feature scaling, encoding categorical variables, and handling outliers.
  5. Train-Test Split: Split the dataset into training and testing sets. The training set will be used to train the linear regression model, while the testing set will be used to evaluate its performance.
  6. Model Training: Train a linear regression model using the training data. Use the lm() function in R to fit the model to the data.
  7. Model Evaluation: Evaluate the performance of the trained model using the testing data. This may involve calculating metrics such as mean squared error (MSE), R-squared, and plotting the residuals.
  8. Interpretation: Interpret the results of the linear regression model. Identify significant predictors, assess the model’s goodness of fit, and draw conclusions based on the analysis.

R Demonstration

Using statistical techniques from Data Science, we’ll demonstrate the application of Linear Regression on the “budget_data.csv” dataset.

# Load the necessary library
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Load the data
budget_data <- read.csv("budget_data.csv")

# See data
summary(budget_data)
##       Year       TotalBudget     
##  Min.   :2015   Min.   :1000000  
##  1st Qu.:2017   1st Qu.:1350000  
##  Median :2020   Median :1650000  
##  Mean   :2020   Mean   :1610000  
##  3rd Qu.:2022   3rd Qu.:1875000  
##  Max.   :2024   Max.   :2100000

Linear Regression Model (LRM)

Linear regression in R is a key statistical method for modeling relationships between variables in data science. With R’s lm() function, users can easily fit linear regression models and extract detailed summaries. R offers diagnostic tools for model evaluation, visualization capabilities for exploring data relationships, and facilitates model comparison. Overall, linear regression in R is a powerful tool for making predictions and deriving insights from data with its simplicity and rich functionality.

# Train linear regression model
model <- lm(TotalBudget ~ Year, data = budget_data)

# Get the summary
summary(model)
## 
## Call:
## lm(formula = TotalBudget ~ Year, data = budget_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78182 -21364   -909  26364  67273 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -237058182   10166815  -23.32 1.22e-08 ***
## Year            118182       5034   23.48 1.15e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45730 on 8 degrees of freedom
## Multiple R-squared:  0.9857, Adjusted R-squared:  0.9839 
## F-statistic: 551.1 on 1 and 8 DF,  p-value: 1.153e-08

LRM Prediction

## Prediction for linear regression model
# Predict next year's budget
next_year <- data.frame(Year = 2025)  # Assuming we want to predict for the year 2025
predicted_budget <- predict(model, newdata = next_year)

LRM Result

predicted_budget
##       1 
## 2260000

Visualizing Data for LRM

## Visualization for linear regression model
# Combine original data with predicted values
budget_data$PredictedBudget <- predicted_budget

# Plot the data for linear regression
ggplot(budget_data, aes(x = Year)) +
  geom_point(aes(y = TotalBudget), color = "blue", size = 3) +  # Actual budget data points
  geom_line(aes(y = PredictedBudget), color = "red") +  # Predicted budget line
  labs(title = "Actual vs Predicted Budget Over Time",
       x = "Year",
       y = "Total Budget") +
  theme_minimal()

Random Forest Model (RFM)

Random Forest is a widely-used machine learning algorithm in data science for classification and regression tasks. In R, it’s implemented through the randomForest package, offering simplicity and high performance. Random Forest is prized for its ability to handle complex datasets, capture nonlinear relationships, and provide feature importance measures. With its ease of use and robustness, Random Forest is a valuable tool for predictive modeling in R, delivering accurate results across a range of applications.

# Train Random Forest model
model_rf <- randomForest(TotalBudget ~ Year, data = budget_data)

# Get the summary
summary(model_rf)
##                 Length Class  Mode     
## call              3    -none- call     
## type              1    -none- character
## predicted        10    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times        10    -none- numeric  
## importance        1    -none- numeric  
## importanceSD      0    -none- NULL     
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y                10    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

RFM Prediction

## Prediction for Random Forest model
# Predict next year's budget
predicted_budget_rf <- predict(model_rf, newdata = next_year)

RFM Result

predicted_budget_rf
##       1 
## 1953337

Visualizing Data for RFM

## Visualization for Random Forest model
# Combine original data with predicted values
budget_data$PredictedBudget_rf <- predicted_budget_rf

# Plot the data for Random Forest
ggplot(budget_data, aes(x = Year)) +
  geom_point(aes(y = TotalBudget), color = "blue", size = 3) +  # Actual budget data points
  geom_line(aes(y = PredictedBudget_rf), color = "red") +  # Predicted budget line
  labs(title = "Actual vs Predicted Budget Over Time (Random Forest)",
       x = "Year",
       y = "Total Budget") +
  theme_minimal()

Interpretation

In this code snippet, predictions from two distinct models, linear regression, and random forest, are obtained, resulting in slightly different values. Recognizing the variance between the predictions, the approach to reconcile these disparities involves taking the average of the two predicted values. By aggregating the predictions and computing their mean, a unified perspective emerges, balancing the insights derived from both models. The resultant averaged prediction serves as a synthesized estimate, potentially offering a more robust and representative outlook. Subsequently, visual representation further illuminates the reconciliation process, as a bar plot juxtaposes the individual predictions alongside the calculated average. This holistic approach underscores the pragmatic strategy of leveraging multiple models and consolidating their outputs to derive a comprehensive prediction, thereby enhancing the reliability and applicability of the analytical insights derived from the data.

# Predicted values from linear regression and random forest models
linear_regression_prediction <- predicted_budget
random_forest_prediction <- predicted_budget_rf

# Calculate the average
average_prediction <- (linear_regression_prediction + random_forest_prediction) / 2

# Final predicted value
print(average_prediction)
##       1 
## 2106668
# Create a data frame with predictions
predictions <- data.frame(Model = c("Linear Regression", "Random Forest", "Average"),
                          Prediction = c(linear_regression_prediction, random_forest_prediction, average_prediction))

# Round the predicted values to remove decimal places
predictions$Prediction <- round(predictions$Prediction)

# Plot the predictions
ggplot(predictions, aes(x = Model, y = Prediction)) +
  geom_bar(stat = "identity", fill = "skyblue", width = 0.5) +
  geom_text(aes(label = Prediction), vjust = -0.5, color = "black", size = 4) +  # Add labels on bars
  labs(title = "Comparison of Predicted Values",
       x = "Model",
       y = "Predicted Value") +
  theme_minimal()