This project analyzes housing data using R to identify important factors influencing house prices. The project includes data cleaning, exploratory data analysis, visualizations, and predictive modeling.
Load Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(housing, aes(x = Sale_Price)) +geom_histogram(fill ="skyblue", bins =20) +labs(title ="Distribution of House Prices",x ="Sale Price",y ="Count" )
Living Area vs Sale Price
ggplot(housing, aes(x = Gr_Liv_Area, y = Sale_Price)) +geom_point(color ="blue") +geom_smooth(method ="lm", color ="red") +labs(title ="Living Area vs Sale Price",x ="Living Area",y ="Sale Price" )
This project successfully applied data analytics and predictive modeling techniques in R to analyze residential housing data and identify important factors affecting house prices. The analysis included data cleaning, feature engineering, descriptive statistics, exploratory visualizations, and advanced data manipulation using dplyr.
Machine learning concepts from tidymodels, recipes, workflows, and parsnip were implemented to create a structured predictive analytics pipeline. The linear regression model demonstrated how variables such as living area, garage capacity, and housing characteristics can influence sale price predictions.
The project also highlighted the importance of storytelling and data visualization in communicating analytical insights effectively. Overall, this analysis demonstrates practical applications of predictive analytics and machine learning techniques for real estate decision-making and housing price estimation.
Project Story and Interpretation
This project focuses on understanding the factors that influence residential house prices using predictive analytics in R. The analysis begins with data preparation, cleaning, and transformation techniques to improve the quality of the dataset and create meaningful variables for analysis.
Several exploratory visualizations were created to identify relationships between housing features and sale price. Variables such as living area, garage capacity, number of bedrooms, and house age showed a strong influence on property value. The charts help explain market trends and provide business insights into housing characteristics that impact pricing.
Advanced data manipulation techniques using dplyr were applied to summarize housing trends and calculate important statistics such as average sale price, average living area, and price per square foot. Additional descriptive statistics were used to better understand the distribution and behavior of the housing data.
The predictive modeling section uses tidymodels, recipes, workflows, and linear regression techniques to build a machine learning pipeline for house price prediction. Recipes were used for preprocessing and normalization, while workflows combined preprocessing and modeling into a structured analytical process.
The final model demonstrates how predictive analytics can be used to estimate house prices based on important housing features. The Actual vs Predicted visualization shows the relationship between model predictions and real housing prices, helping evaluate model performance and prediction accuracy.