Part A: Introduction

In the realm of real estate taxation, property assessments in Cook County, Illinois, serve as fundamental pillars for establishing the fair market value of properties. Conducted by the Cook County Assessor’s Office, these assessments utilize various factors including property size, location, and recent sale prices to determine valuation. Understanding this process is essential for leveraging machine learning and AI techniques to enhance property valuation and taxation systems.

Mean Assessed Value Over Time with Confidence Interval

The graph below illustrates the mean assessed property values over the years 2021 to 2023 in Cook County, Illinois, along with the corresponding confidence intervals. Sales data are filtered to exclude entries with a sale price less than $10,000.

The solid line represents the mean assessed value for each year, while the shaded area around the line depicts the range of values captured by the standard deviation. This visualization provides insights into the overall trend in property values over time and highlights the variability and uncertainty associated with the assessments.

library(ggplot2)
# Calculate mean and standard deviation of assessed value per year
mean_values <- ratios %>%
  group_by(TAX_YEAR) %>%
  summarise(mean_value = mean(ASSESSED_VALUE, na.rm = TRUE),
            sd_value = sd(ASSESSED_VALUE, na.rm = TRUE))

# Convert TAX_YEAR to integer
mean_values$TAX_YEAR <- as.integer(mean_values$TAX_YEAR)

# Create a ggplot object with interval plot for mean assessed value
ggplot(mean_values, aes(x = TAX_YEAR, y = mean_value)) +
  geom_line() +
  geom_ribbon(aes(ymin = mean_value - sd_value, ymax = mean_value + sd_value), alpha = 0.3) +  # Add confidence interval
  labs(x = "Year", y = "Mean Assessed Value", title = "Mean Assessed Value Over Time with Confidence Interval") +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)  # Disable scientific notation for y-axis labels

Selected Cook County Townships: Lyons & Stickney

For this semester, I have been assigned to Group 21 which includes the townships of Lyons (21) and Stickney (36). These townships provide various services to residents, including property assessment, road maintenance, and assistance programs. They are part of the broader township system in Illinois, which plays a role in local governance and service provision.

Lyons Township (Township 21) and Stickney Township (Township 36) are both located in Cook County, Illinois. Lyons Township is situated to the west of Chicago and includes several suburbs such as La Grange, Western Springs, and Burr Ridge.

Stickney Township, on the other hand, is located to the southwest of Chicago and includes areas like Cicero, Stickney, and Forest View.

Mean Assessed Property Value by Selected Cook County Townships Over Time (2021 to 2023)

This graph presents the mean assessed property value for Lyons Township (Township 21) and Stickney Township (Township 36) over the years 2021 to 2023. Each township is represented by a different color, allowing for easy comparison of trends over time.

# Convert TAX_YEAR to decimal
ratios$TAX_YEAR <- as.numeric(ratios$TAX_YEAR)

# Calculate mean and standard deviation of assessed value per township and year
mean_values <- ratios %>%
  filter(township_code %in% c("21", "36")) %>%
  group_by(TAX_YEAR, township_name) %>%
  summarise(mean_value = mean(ASSESSED_VALUE, na.rm = TRUE),
            sd_value = sd(ASSESSED_VALUE, na.rm = TRUE))

# Create a ggplot object with interval plot for mean assessed value by township
ggplot(mean_values, aes(x = TAX_YEAR, y = mean_value, group = township_name, color = as.factor(township_name))) +
  geom_line() +
  labs(x = "Year", y = "Mean Assessed Value", title = "Mean Assessed Value of Lyons & Stickney Townships Over Time") +
  theme_minimal() +
  scale_color_discrete(name = "Township Name") +
  scale_y_continuous(labels = comma) + # This prevents scientific notation
  scale_x_continuous(breaks = seq(2021, 2023, by = 1), labels = seq(2021, 2023, by = 1)) # Updated x-axis labels

Mean Assessed Value Individually: Lyons & Stickney Townships (2021 to 2023)

This table summarizes the mean assessed property values and standard deviations for Lyons and Stickney townships in Cook County, Illinois, from 2021 to 2023.

# Aggregate mean assessed values by township and year
mean_values_agg <- mean_values %>%
  group_by(TAX_YEAR, township_name) %>%
  summarise(mean_value = mean(mean_value, na.rm = TRUE),
            sd_value = mean(sd_value, na.rm = TRUE))

# Pivot the aggregated data to create one row per year
mean_values_table <- mean_values_agg %>%
  pivot_wider(names_from = township_name, values_from = c(mean_value, sd_value))

# Adjusting column names
colnames(mean_values_table) <- gsub("_mean_value", "", colnames(mean_values_table))
colnames(mean_values_table) <- gsub("_sd_value", "", colnames(mean_values_table))

# Printing the table
knitr::kable(mean_values_table, caption = "Mean Assessed Values: Lyons & Stickney Townships (2021 to 2023)")

Mean Assessed Values: Lyons & Stickney Townships (2021 to 2023)
TAX_YEAR	mean_value_Lyons	mean_value_Stickney	sd_value_Lyons	sd_value_Stickney
2021	396971.9	165957.1	283826.2	48464.01
2022	395338.7	173376.6	308066.7	66833.20
2023	485755.3	224647.6	335041.9	71394.94

Mean Assessed Property Values Combined: Lyons and Stickney Townships (2021 to 2023)

This table presents the mean assessed property values and their standard deviations for both Lyons and Stickney townships, aggregated over the years 2021 to 2023. Each row of the table represents a combination of mean values and standard deviations for a specific township-year pair.

For instance:

The column mean_value_2021 represents the mean assessed property value for Lyons and Stickney townships in the year 2021.
The column mean_value_2022 represents the mean assessed property value for Lyons and Stickney townships in the year 2022.
The column mean_value_2023 represents the mean assessed property value for Lyons and Stickney townships in the year 2023.

The standard deviation values follow the same pattern, with sd_value_2021, sd_value_2022, and sd_value_2023 representing the standard deviations for the respective years.

# Aggregate mean assessed values by township and year
mean_values_agg <- mean_values %>%
  group_by(TAX_YEAR, township_name) %>%
  summarise(mean_value = mean(mean_value, na.rm = TRUE),
            sd_value = mean(sd_value, na.rm = TRUE)) %>%
  group_by(TAX_YEAR) %>%
  summarise(across(c(mean_value, sd_value), mean, na.rm = TRUE))

# Pivot the aggregated data to create one row per year
mean_values_table <- mean_values_agg %>%
  pivot_wider(names_from = TAX_YEAR, values_from = c(mean_value, sd_value))

# Printing the table
knitr::kable(mean_values_table, caption = "Mean Assessed Value for Lyons & Stickney Townships Combined (2021 to 2023)")

Mean Assessed Value for Lyons & Stickney Townships Combined (2021 to 2023)
mean_value_2021	mean_value_2022	mean_value_2023	sd_value_2021	sd_value_2022	sd_value_2023
281464.5	284357.7	355201.4	166145.1	187450	203218.4

Sale Price & Assessed Value

The chart below demonstrates the sale price and assessed value of all properties in Lyons Township (Township 21) and Stickney Township (Township 36) from 2021 to 2023.

# Sale Price & Assessed Value for Stickney and Lyons
output_stickney_lyons[[2]]

IAAO Arm’s Length Standard

Using the cmfproperty package, the IAAO arm’s length standard was applied to the data. The graph below presents a conservative picture of the housing market in Lyons Township (Township 21) and Stickney Township (Township 36).

# IAAO Arm's Length Standard for Stickney and Lyons
output_stickney_lyons[[1]]

Assessment Accuracy

This data visualization shows the assessment accuracy of homes in Stickney and Lyons Townships combined. Assessment accuracy is calculated as the ratio of the median assessed value to the median sale price.

The boxplot provides a graphical summary of the distribution of assessment accuracy across the combined townships, allowing for the identification of potential outliers and the assessment accuracy’s overall variability. This visualization helps in understanding the consistency and reliability of property assessments in Stickney and Lyons Townships.

# Create a boxplot to visualize assessment accuracy
ggplot(stats, aes(x = "Combined Townships", y = assessment_accuracy)) +
  geom_boxplot() +
  labs(title = "Assessment Accuracy of Stickney and Lyons Townships Combined",
       x = "Townships", y = "Assessment Accuracy")

Homes Available from 2021 to 2023

The graph illustrates the number of homes available in Stickney and Lyons Townships combined from 2021 to 2023. It shows the annual count of homes listed for sale or otherwise available in the real estate market within these townships.

# Calculate the count of homes available each year
homes_count <- sales %>%
  filter(township_code %in% c("21", "36")) %>%
  group_by(year) %>%
  summarise(total_homes = n())

# Plot the number of homes available over the years
ggplot(homes_count, aes(x = year, y = total_homes)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Homes Available in Stickney and Lyons Townships Combined",
       x = "Year", y = "Number of Homes Available") +
  theme_minimal()

Part B: Modeling Overassessment

In real estate taxation, accurately assessing property values is crucial; however, disparities between assessed values and market prices can lead to overassessment. Modeling overassessment involves using statistical methods to identify properties likely to be overvalued. By employing advanced techniques like machine learning, we aim to develop models to improve assessment accuracy and reduce overassessment.

Visualization of Extreme Overassessment

The graph illustrates the rate of extreme overassessment (properties assessed at more than twice their sale price) by sale price decile, revealing the proportion of properties assessed at more than twice their sale price across different sale price categories in Lyons and Stickney Townships.

plot2 <- ratios_stickney_lyons %>%
  filter(SALE_YEAR == 2022) %>%
  group_by(sale_bin = ntile(SALE_PRICE, 10)) %>%
  summarize(`Percent Over 50%` = length(pin[RATIO > 0.5]) / n(),
            `Percent Over 100%` = length(pin[RATIO > 1]) / length(pin[RATIO > 0.5]))

ggplot(plot2, aes(x = sale_bin, y = `Percent Over 100%`)) +
  geom_point() + geom_line() +
  labs(x = 'Sale Decile', title = 'Rate of Overassessment in Lyons & Stickney Townships by Sale Price Decile',
       y = 'Share of Extremely Overassessed Properties') +
  scale_y_continuous(labels = scales::percent)

Count of Overassessed Properties by Sale Price Range

Next, we present a table displaying the count of overassessed properties categorized by sale price ranges. We categorize sale prices into buckets ranging from 0 to the maximum sale price observed in the dataset, with increments of 200,000, based on the criterion of a ratio greater than 0.5 indicating overassessment.

# Set options to display numbers without scientific notation
options(scipen = 999)

# Define the buckets for sale prices 
price_buckets <- seq(0, max(ratios_stickney_lyons$SALE_PRICE), by = 200000)

# Group the data into buckets and count the number of overassessed properties in each bucket
overassessed_counts <- ratios_stickney_lyons %>%
  filter(RATIO > 0.5) %>%
  mutate(price_bucket = cut(SALE_PRICE, breaks = price_buckets, include.lowest = TRUE, labels = paste0(price_buckets[-length(price_buckets)], "-", price_buckets[-1]))) %>%
  count(price_bucket) %>%
  arrange(price_bucket)

# Create a table of overassessed property counts
overassessed_table <- overassessed_counts %>%
  rename("Sale Price Range" = price_bucket, "Count of Overassessed Properties" = n)

# Printing the overassessed table
knitr::kable(overassessed_table, caption = "Count of Overassessed Properties by Sale Price Range: Lyons & Stickney Townships")

Count of Overassessed Properties by Sale Price Range: Lyons & Stickney Townships
Sale Price Range	Count of Overassessed Properties
0-200000	379
200000-400000	1967
400000-600000	640
600000-800000	399
800000-1000000	214
1000000-1200000	110
1200000-1400000	72
1400000-1600000	36
1600000-1800000	24
1800000-2000000	18
2000000-2200000	7
2200000-2400000	6
2400000-2600000	2
2600000-2800000	2
2800000-3000000	2
3000000-3200000	2

Adding a New Categorical Variable

We add a new categorical variable ‘class’ to indicate whether a property is overassessed or not based on the RATIO variable. We define overassessed as RATIO > 0.5 and underassessed as RATIO <= 0.5.

Model Preparation and Fitting

We prepare the data and fit a predictive model for classifying properties as overassessed or not overassessed using random forest classification. This involves data preprocessing steps, cross-validation, model specification, and fitting.

Model Formula

The formula for preprocessing in the overassessment model is given by: class ~ log(ASSESSED_VALUE) + char_bldg_sf + char_yrblt + char_bldg_sf:log(ASSESSED_VALUE) + char_yrblt:log(ASSESSED_VALUE) + Imputed(char_bldg_sf, char_yrblt)

Where:

$\log(\text{ASSESSED\_VALUE})$ represents the logarithmic transformation of ASSESSED_VALUE.
$\text{char\_bldg\_sf}:\text{ASSESSED\_VALUE}$ and $\text{char\_yrblt}:\text{ASSESSED\_VALUE}$ represent interaction terms between ASSESSED_VALUE and both char_bldg_sf, char_yrblt respectively.
$\text{char\_bldg\_sf}:\text{char\_yrblt}$ is the interaction term between char_bldg_sf and char_yrblt.
$\text{Imputed}(\text{char\_bldg\_sf}, \text{char\_yrblt})$ indicates that missing values in char_bldg_sf and char_yrblt are imputed, typically with median values.

# Filter data for 2022 sales in Lyons and Stickney Townships
joined_ratios <- ratios_stickney_lyons %>%
  filter(SALE_YEAR == 2022, township_name %in% c("Lyons", "Stickney")) 

# Create a recipe for preprocessing
class_recipe <- recipe(class ~ 
                         char_bldg_sf + char_yrblt + ASSESSED_VALUE, 
                       data = joined_ratios) %>%
  step_log(ASSESSED_VALUE) %>%
  step_interact(~c(ASSESSED_VALUE, char_bldg_sf, char_yrblt)) %>%
  step_impute_median(char_bldg_sf, char_yrblt) %>%
  prep()

# Prepare the data using the recipe
prepped_recipe <- prep(class_recipe) %>% bake(joined_ratios)

folds <- rsample::vfold_cv(prepped_recipe, v = 10)

# Define the classification model
class_model <- 
  rand_forest(trees = 500) %>%
  set_mode('classification') %>%
  set_engine('ranger')

# Fit the model using the preprocessed data
fitted_model <- fit(class_model, data = prepped_recipe, formula = class ~ .)

# Generate a summary of the recipe
class_recipe %>% summary() %>% view()

# Show the preprocessed data
bake(class_recipe, joined_ratios) %>% view()

Analysis of Sale Price Bins and Model Performance

The table below presents an analysis of sale price bins and the performance of our predictive model in identifying overassessed properties. The data is divided into 9 bins based on sale price deciles. Each bin represents a range of sale prices, with Bin 1 containing the lowest sale prices and Bin 9 containing the highest sale prices.

Interpretation of Results

Average Sale Price (avg_sp): This column displays the average sale price within each bin. It provides insight into the distribution of sale prices across the dataset.
Share of Correct Predictions (share_correct): This metric indicates the proportion of correctly predicted outcomes by our model within each bin. A higher value suggests that the model performs well in accurately classifying properties as overassessed or not overassessed within that price range.
Share of Overassessed Properties (share_over): This metric represents the proportion of properties identified as overassessed by the model within each bin. It indicates the model’s ability to correctly identify properties that are actually overassessed. A higher value implies that the model captures a larger percentage of overassessed properties within that price range.

Key Observations

Share of Correct Predictions: The share of correct predictions varies across different price bins. Generally, lower-priced properties (Bins 1 to 5) exhibit slightly lower accuracy compared to higher-priced properties (Bins 6 to 9). This suggests that the model may struggle more with classification in lower price ranges.
Share of Overassessed Properties: The proportion of overassessed properties identified by the model also varies across price bins. Interestingly, there is a slight fluctuation in the share of overassessed properties, with some bins showing higher percentages compared to others; however, it’s notable that higher-priced properties tend to have a higher share of overassessed properties identified by the model.

joined_ratios <- ratios_stickney_lyons %>%
  filter(SALE_YEAR == 2022, township_name %in% c("Lyons", "Stickney")) 
# Generate predictions on the training data
my_predictions <- predict(fitted_model, new_data = prepped_recipe)

# Make sure both datasets have the same number of rows
#joined_ratios <- joined_ratios %>%
  #slice(1:nrow(my_predictions))

# Some initial views on our model using training data:
joined_ratios %>%
  mutate(pred = my_predictions$.pred_class,
         bin = ntile(SALE_PRICE, 10)) %>%
  group_by(bin) %>%
  summarize(avg_sp = mean(SALE_PRICE),
            share_correct = sum(class == pred) / n(),
            share_over = sum(class == 'Overassessed') / n())  %>%
  kableExtra::kable() %>%
  kableExtra::kable_material(c("striped", "hover"))

bin	avg_sp	share_correct	share_over
1	171256.2	0.9707602	0.9064327
2	222655.0	1.0000000	0.8830409
3	253886.8	0.9941520	0.8713450
4	287023.7	0.9532164	0.8596491
5	322643.2	0.9239766	0.7836257
6	368252.6	0.8771930	0.7777778
7	448700.3	0.9181287	0.8947368
8	577073.3	0.9529412	0.9352941
9	776896.6	0.9176471	0.9117647
10	1417962.4	0.9235294	0.8529412

Model Evaluation and Performance Metrics

Classification Metrics

This code evaluates the performance of a classification model for identifying overassessed properties using testing data. It calculates classification metrics such as accuracy, precision, recall, and F1-score to assess the model’s effectiveness.

Based on the provided metrics, the performance of the classification model is as follows:

Recall (Sensitivity): The model demonstrates a high ability to correctly identify overassessed properties out of all actual positive cases. This indicates that the model is effective in minimizing false negatives.
Specificity: While the model’s ability to correctly identify non-overassessed properties is moderate, there is room for improvement. Improving specificity will reduce the number of false positives.
Precision: The majority of the model’s predictions of overassessed properties are accurate, suggesting that when the model predicts an overassessment, it is likely to be correct.
Accuracy: Overall, the model performs well in predicting both positive and negative cases. This metric provides a general sense of the model’s overall correctness across all classifications.
F-measure: The F1-score indicates a good balance between precision and recall, reflecting the model’s overall performance. This measure is particularly useful when the class distribution is uneven.

metrics_table

Metric	Estimate
Recall (Sensitivity)	0.9966239
Specificity	0.5929204
Precision	0.9413265
Accuracy	0.9431752
F-measure	0.9681863

ROC Curve

The code generates an ROC curve to visualize the model’s ability to discriminate between true positives and false positives across different probability thresholds.

# Generate ROC curve
# Extract predicted probabilities for the "Overassessed" class
predicted_probs <- as.numeric(my_predictions$.pred_class == "Overassessed")

# Compute ROC curve
roc_curve <- roc(as.numeric(joined_ratios$class == "Overassessed"), predicted_probs)

# Plot ROC curve
plot(roc_curve, col = "blue", main = "ROC Curve", xlab = "False Positive Rate", ylab = "True Positive Rate")

Analysis of Overassessment in Lyons and Stickney Townships in 2022

The analysis revealed significant insights into the overassessment of properties within these regions:

Count of Overassessed Properties: The majority of overassessed properties fall within mid-range sale prices. This suggests that properties in the middle of the market value spectrum are more likely to be overassessed than those at the very low or high ends.
Model Performance:
- The random forest classification model effectively identifies overassessed properties across various price ranges.
- Lower-priced properties exhibit slightly lower accuracy, but the model consistently identifies overassessed properties. This consistency is crucial for ensuring fairness and accuracy in property assessments.
Classification Metrics:
- The model demonstrates high recall and precision, indicating its effectiveness in identifying overassessed properties with minimal false negatives and a high rate of correct positive predictions.
- However, there’s room for improvement in specificity. Enhancing specificity would reduce the number of false positives, making the model more reliable in identifying properties that are not overassessed.

These findings highlight the strengths and areas for improvement in our approach to identifying overassessed properties in Lyons and Stickney Townships for the year 2022. By focusing on mid-range sale prices and enhancing the model’s specificity, we can improve the accuracy and fairness of property assessments.

Part C: Modeling Predictive Assessment

Generating my own assessments for homes in Lyons and Stickney Townships is very similar to modeling overassessment. Initially, I will use a similar recipe and formula, replacing the target variable ‘class’ with ‘sale price’.

Training and testing will occur across 2021 to 2022 with a 90/10 split based on time. Predictions will then be made for 2023 compared to the baseline of actual assessed values. Workflow is the same except that xgboost requires us to bake our data first.

Model Description and Preprocessing Steps Model Type: Linear regression model using XGBoost Objective: Predict the assessed value of properties in Lyons and Stickney Townships based on their property characteristics.

Create your workflow

Begin by setting up a workflow for building a model to generate 2023 assessments using market information.

# use df 'ratios_stickney_lyons", as it includes sales data from 2021 and 2022 in the Lyons and Stickney Townships 
# Step 1: Data Preparation
# Split data into training and testing sets
time_split <- rsample::initial_time_split(
  ratios_stickney_lyons %>% filter(SALE_YEAR < 2023) %>%
    arrange(sale_date), # Using data before 2023 for training
  0.9)  # 90% of data for training, 10% for testing

train <- training(time_split)
test <- testing(time_split)

Model Description & Preprocessing Steps

The model aims to predict the assessed value of properties in Lyons and Stickney Townships based on their building square footage (char_bldg_sf), year built (char_yrblt), township code (township_code), and neighborhood code (nbhd_code). By preprocessing the data with log transformation, converting categorical variables, and imputing missing values, the model is prepared to effectively capture the relationships between these predictors and the assessed values.

Preprocessing Steps:

Log Transformation: Log transformation is applied to the char_bldg_sf variable to handle skewness and make its distribution more symmetrical.
Convert to Numeric: The nbhd_code variable is converted to numeric format using the step_mutate function, as it was originally of factor type.
Missing Value Imputation: Median imputation is performed on all predictors to handle missing values. This ensures that there are no missing values in the dataset, which is necessary for training the model.
Dummy Variable Creation: Categorical variables are converted into dummy variables using one-hot encoding. This step transforms categorical variables like township_code and nbhd_code into binary indicators, allowing the model to interpret them properly during training.

Model Formula \[ \text{ASSESSED_VALUE} \sim \text{char_bldg_sf} + \text{char_yrblt} + \text{township_code} + \text{nbhd_code} \]

# Step 2: Model Selection
# Define the regression model (e.g., XGBoost)
reg_model <- boost_tree(trees = 200) %>%
  set_mode("regression") %>%
  set_engine("xgboost")

# Step 3: Recipe Creation
# Create a recipe for preprocessing
reg_recipe <- recipe(ASSESSED_VALUE ~ char_bldg_sf + char_yrblt + township_code + nbhd_code, data = joined_ratios) %>%
  step_log(char_bldg_sf) %>%  # Log transformation for appropriate variable
  step_mutate(nbhd_code = as.numeric(nbhd_code)) %>%  # Convert nbhd_code to numeric
  step_impute_median(all_predictors()) %>%  # Impute missing values
  step_dummy(all_nominal(), one_hot = TRUE) %>%  # Convert categorical variables to dummy variables
  prep()

# Prepare the data using the recipe
prepped_data <- bake(reg_recipe, new_data = joined_ratios)

Create testing/training data and evaluate

Split the data into training and testing sets, train the model, and evaluate its performance using numeric metrics such as RMSE and MAPE.

# Step 4: Training and Testing
# Preprocess training data using the recipe
prepped_train <- reg_recipe %>% bake(train)

# Train the model
fitted_model <- reg_model %>% fit(ASSESSED_VALUE ~ ., data = prepped_train)

# Preprocess testing data using the recipe
prepped_test <- reg_recipe %>% bake(test)
# Display sample of prepped testing data
prepped_test_sample <- head(prepped_test)
# Format assessed_value in prepped_test_sample to be in $ with 0 decimal points and commas
prepped_test_sample$ASSESSED_VALUE <- scales::dollar(prepped_test_sample$ASSESSED_VALUE, accuracy = 1)

# Generate the HTML table with the updated prepped_test_sample
prepped_test_table <- prepped_test_sample %>%
  kableExtra::kable(format = "html", align = "l") %>%
  kableExtra::kable_material(c("striped", "hover")) %>%
  kableExtra::add_header_above(c("Sample of Prepped Testing Data" = 5))

# Print the updated table
prepped_test_table

Sample of Prepped Testing Data
char_bldg_sf	char_yrblt	township_code	nbhd_code	ASSESSED_VALUE
7.302496	2001	36	36022	$230,560
7.347944	1957	36	36050	$178,890
7.160069	1946	36	36050	$167,350
7.788212	1968	21	21071	$538,970
6.883463	1959	36	36021	$133,620
7.564757	1951	21	21111	$244,700

RMSE and MAPE

Root Mean Squared Error (RMSE): This metric measures the average deviation of the predicted values from the actual values. In this case, the RMSE is approximately 78,527.81. Lower values of RMSE indicate better fit of the model to the data. So, a RMSE of 78,527.81 suggests that, on average, the model’s predictions are off by approximately $78,527.81.

Mean Absolute Percentage Error (MAPE): This metric represents the average of the absolute percentage differences between the predicted values and the actual values, expressed as a percentage. The MAPE value of 0.1776 means that, on average, the model’s predictions deviate from the actual values by approximately 17.76%. Lower values of MAPE indicate better accuracy, so a MAPE of 0.1776 suggests that, on average, the model’s predictions are off by approximately 17.76% from the actual values.

# Make predictions on testing data
predictions <- predict(fitted_model, new_data = prepped_test)

# Evaluate model performance using metrics like RMSE and MAPE
rmse <- sqrt(mean((predictions$.pred - prepped_test$ASSESSED_VALUE)^2))
mape <- mean(abs((predictions$.pred - prepped_test$ASSESSED_VALUE) / prepped_test$ASSESSED_VALUE))
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
cat("Mean Absolute Percentage Error (MAPE):", mape, "\n")

# Round RMSE to 2 decimal points and format as dollars
rmse_formatted <- sprintf("$%.2f", rmse)

# Round MAPE to 2 decimal points and format as percentage
mape_formatted <- sprintf("%.2f%%", mape * 100)

# Create a data frame with formatted metrics
evaluation_metrics <- data.frame(
  Metric = c("Root Mean Squared Error (RMSE)", "Mean Absolute Percentage Error (MAPE)"),
  Value = c(rmse_formatted, mape_formatted)
)

# Generate the HTML table
evaluation_metrics_table <- evaluation_metrics %>%
  kableExtra::kable(format = "html", align = "l") %>%
  kableExtra::kable_material()

# Print the table
evaluation_metrics_table

Metric	Value
Root Mean Squared Error (RMSE)	$78527.81
Mean Absolute Percentage Error (MAPE)	17.76%

Cook Part 2 Assignment

Michele Santana

2024-02-29