Introduction

This project analyzes housing data using R to identify important factors influencing house prices. The project includes data cleaning, exploratory data analysis, visualizations, and predictive modeling.

Load Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.5.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

library(corrplot)

corrplot 0.95 loaded

Advanced Data Manipulation Using dplyr

ames |>
  mutate(
    Price_Per_Sqft = Sale_Price / Gr_Liv_Area,
    House_Age = 2025 - Year_Built
  ) |>
  group_by(Bedroom_AbvGr) |>
  summarize(
    Avg_Price = mean(Sale_Price),
    Avg_Area = mean(Gr_Liv_Area),
    Avg_Price_Per_Sqft = mean(Price_Per_Sqft),
    Count = n()
  ) |>
  arrange(desc(Avg_Price))

# A tibble: 8 × 5
  Bedroom_AbvGr Avg_Price Avg_Area Avg_Price_Per_Sqft Count
          <int>     <dbl>    <dbl>              <dbl> <int>
1             0   218495.    1364               155.      8
2             4   216357.    2044.              102.    400
3             5   206244.    2351.               86.1    48
4             8   200000     3395                58.9     1
5             1   183017.    1155.              151.    112
6             3   179712.    1481.              122.   1597
7             2   162168.    1224.              130.    743
8             6   159702.    2127.               74.3    21

Create Sample Housing Dataset

set.seed(123)

housing <- tibble(
  Sale_Price = round(rnorm(200, mean = 180000, sd = 50000)),
  Gr_Liv_Area = round(rnorm(200, mean = 1500, sd = 400)),
  Overall_Qual = sample(1:10, 200, replace = TRUE),
  Garage_Cars = sample(0:4, 200, replace = TRUE),
  Year_Built = sample(1990:2023, 200, replace = TRUE)
)

glimpse(housing)

Rows: 200
Columns: 5
$ Sale_Price   <dbl> 151976, 168491, 257935, 183525, 186464, 265753, 203046, 1…
$ Gr_Liv_Area  <dbl> 2380, 2025, 1394, 1717, 1334, 1310, 1185, 1262, 2160, 147…
$ Overall_Qual <int> 9, 8, 7, 9, 3, 1, 7, 3, 7, 10, 3, 6, 7, 4, 10, 1, 8, 5, 3…
$ Garage_Cars  <int> 2, 0, 0, 0, 0, 2, 0, 1, 2, 3, 3, 2, 3, 1, 2, 3, 2, 3, 4, …
$ Year_Built   <int> 2008, 2014, 2006, 1998, 2012, 2020, 2022, 2005, 1997, 199…

Data Cleaning

sum(is.na(housing))

[1] 0

housing <- housing |>
  drop_na()

Exploratory Data Analysis

Summary Statistics

housing |>
  summarize(
    avg_price = mean(Sale_Price),
    max_price = max(Sale_Price),
    min_price = min(Sale_Price)
  )

# A tibble: 1 × 3
  avg_price max_price min_price
      <dbl>     <dbl>     <dbl>
1   179571.    342052     64542

Neighborhood Style Analysis Using dplyr

housing |>
  group_by(Overall_Qual) |>
  summarize(
    average_price = mean(Sale_Price),
    average_area = mean(Gr_Liv_Area),
    houses = n()
  )

# A tibble: 10 × 4
   Overall_Qual average_price average_area houses
          <int>         <dbl>        <dbl>  <int>
 1            1       187978.        1583.     16
 2            2       172552.        1290.     22
 3            3       173797.        1569.     21
 4            4       196524.        1514.     16
 5            5       170890.        1328.     20
 6            6       190235.        1498.     15
 7            7       179290.        1489.     25
 8            8       175654.        1603.     24
 9            9       171627.        1734.     17
10           10       184560.        1595.     24

Data Visualization

House Price Distribution

ggplot(housing, aes(x = Sale_Price)) +
  geom_histogram(fill = "skyblue", bins = 20) +
  labs(
    title = "Distribution of House Prices",
    x = "Sale Price",
    y = "Count"
  )

Living Area vs Sale Price

ggplot(housing, aes(x = Gr_Liv_Area, y = Sale_Price)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "Living Area vs Sale Price",
    x = "Living Area",
    y = "Sale Price"
  )

`geom_smooth()` using formula = 'y ~ x'

Correlation Analysis

numeric_data <- housing |>
  select(Sale_Price, Gr_Liv_Area, Overall_Qual, Garage_Cars)

cor_matrix <- cor(numeric_data)

corrplot(cor_matrix, method = "color")

Train-Test Split

split_data <- initial_split(housing, prop = 0.8)

train_data <- training(split_data)

test_data <- testing(split_data)

Linear Regression Model

model <- linear_reg() |>
  set_engine("lm")

model_fit <- model |> 
  fit(
    Sale_Price ~ Gr_Liv_Area + Overall_Qual + Garage_Cars,
    data = train_data
  )

model_fit

parsnip model object


Call:
stats::lm(formula = Sale_Price ~ Gr_Liv_Area + Overall_Qual + 
    Garage_Cars, data = data)

Coefficients:
 (Intercept)   Gr_Liv_Area  Overall_Qual   Garage_Cars  
   1.624e+05     4.661e+00     5.369e+02     3.338e+03

Predictions

predictions <- predict(model_fit, test_data) |>
  bind_cols(test_data)

head(predictions)

# A tibble: 6 × 6
    .pred Sale_Price Gr_Liv_Area Overall_Qual Garage_Cars Year_Built
    <dbl>      <dbl>       <dbl>        <int>       <int>      <int>
1 176085.     168491        2025            8           0       2014
2 185413.     200039        1993            7           3       2008
3 179537.     152208        1103           10           2       2005
4 172417.     187669        1213            2           1       1993
5 177328.     223907        1205            5           2       2001
6 188231.     248430        1536           10           4       1994

Actual vs Predicted Prices

ggplot(predictions,
       aes(x = Sale_Price,
           y = .pred)) +
  geom_point(color = "darkgreen") +
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "red"
  ) +
  labs(
    title = "Actual vs Predicted Prices",
    x = "Actual Prices",
    y = "Predicted Prices"
  )

Descriptive Statistics

ames |>
  summarize(
    Mean_Price = mean(Sale_Price),
    Median_Price = median(Sale_Price),
    Max_Price = max(Sale_Price),
    Min_Price = min(Sale_Price),
    Avg_Living_Area = mean(Gr_Liv_Area),
    Avg_Garage_Cars = mean(Garage_Cars)
  )

# A tibble: 1 × 6
  Mean_Price Median_Price Max_Price Min_Price Avg_Living_Area Avg_Garage_Cars
       <dbl>        <dbl>     <int>     <int>           <dbl>           <dbl>
1    180796.       160000    755000     12789           1500.            1.77

Garage Capacity vs Sale Price

ggplot(ames,
       aes(x = Garage_Cars,
           y = Sale_Price)) +
  geom_point(color = "purple") +
  geom_smooth(method = "lm",
              color = "black") +
  labs(
    title = "Garage Capacity vs Sale Price",
    x = "Garage Cars",
    y = "Sale Price"
  )

`geom_smooth()` using formula = 'y ~ x'

Recipes and Workflows

recipe_model <-
  recipe(
    Sale_Price ~ Gr_Liv_Area + Garage_Cars,
    data = train_data
  ) |>
  step_normalize(all_numeric_predictors())

workflow_model <-
  workflow() |>
  add_recipe(recipe_model) |>
  add_model(model)

workflow_fit <-
  workflow_model |>
  fit(data = train_data)

workflow_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)  Gr_Liv_Area  Garage_Cars  
     179072         1956         4869

Conclusion

This project successfully applied data analytics and predictive modeling techniques in R to analyze residential housing data and identify important factors affecting house prices. The analysis included data cleaning, feature engineering, descriptive statistics, exploratory visualizations, and advanced data manipulation using dplyr.

Machine learning concepts from tidymodels, recipes, workflows, and parsnip were implemented to create a structured predictive analytics pipeline. The linear regression model demonstrated how variables such as living area, garage capacity, and housing characteristics can influence sale price predictions.

The project also highlighted the importance of storytelling and data visualization in communicating analytical insights effectively. Overall, this analysis demonstrates practical applications of predictive analytics and machine learning techniques for real estate decision-making and housing price estimation.

Project Story and Interpretation

This project focuses on understanding the factors that influence residential house prices using predictive analytics in R. The analysis begins with data preparation, cleaning, and transformation techniques to improve the quality of the dataset and create meaningful variables for analysis.

Several exploratory visualizations were created to identify relationships between housing features and sale price. Variables such as living area, garage capacity, number of bedrooms, and house age showed a strong influence on property value. The charts help explain market trends and provide business insights into housing characteristics that impact pricing.

Advanced data manipulation techniques using dplyr were applied to summarize housing trends and calculate important statistics such as average sale price, average living area, and price per square foot. Additional descriptive statistics were used to better understand the distribution and behavior of the housing data.

The predictive modeling section uses tidymodels, recipes, workflows, and linear regression techniques to build a machine learning pipeline for house price prediction. Recipes were used for preprocessing and normalization, while workflows combined preprocessing and modeling into a structured analytical process.

The final model demonstrates how predictive analytics can be used to estimate house prices based on important housing features. The Actual vs Predicted visualization shows the relationship between model predictions and real housing prices, helping evaluate model performance and prediction accuracy.