Portfolio
I’m Mani Sharath Chandra, currently pursuing a Master’s in Data Science and Analytics. After earning my bachelor’s degree in Computer Science and Engineering, I worked at a product-based AI company. During that time, I realized my skills and knowledge in modeling and AI tasks were somewhat limited. Now, as part of my master’s program, I have enhanced my analytical abilities through learning regression methods. To take my skills further, I plan to apply these techniques to a real-world problem. I’ve chosen a data set on house prices to implement a Generalized Linear Model (GLM). My goal is to predict changes in house prices and create visualizations to better understand and communicate these predictions.
Objective
The primary objective of this project is to develop a robust understanding of linear regression analysis through its application in predicting real estate prices. This involves comprehensively exploring and preprocessing a data set, selecting relevant features, and applying statistical models to make informed predictions. The project aims to bridge theoretical knowledge with practical application, allowing students to grasp the complexities and nuances of real-world data analysis. Through this endeavor, students are expected to enhance their analytical skills, gain proficiency in using statistical software, and foster an ability to interpret and communicate their findings effectively. Additionally, this project seeks to cultivate critical thinking by challenging students to assess the accuracy and reliability of their models, understand the limitations of their analysis, and explore potential improvements. Ultimately, the project serves as a platform for students to apply statistical concepts and methodologies in a meaningful context, preparing them for future research or professional roles that require data-driven decision-making.
Project Overview
The project titled “Using Linear Regression to Predict the Prices of Houses” aims to build a statistical model that can predict residential property sale prices with a significant degree of accuracy. Utilizing a rich data set, the project applies linear regression, a foundational technique in statistical modeling and machine learning, which presumes a linear relationship between independent variables (such as square footage, number of bedrooms, and other house features) and the dependent variable (the sale price of the house). The data set includes a variety of features from lot size and neighborhood to physical attributes like overall quality and year built, providing a comprehensive set of factors that are hypothesized to influence house pricing. The analysis begins with a meticulous exploratory data examination to understand the underlying distributions and relationships, followed by rigorous data preprocessing to handle missing values, outliers, and categorical variables. After establishing a clean and informative data set, the project progresses through feature selection to identify the most impactful predictors, model fitting using the linear regression algorithm, and model validation to assess predictive power and accuracy. Ultimately, the endeavor not only enhances predictive modeling techniques but also offers valuable insights into the real estate market, aiding buyers, sellers, and investors in making informed decisions based on key property characteristics. The project exemplifies the synthesis of statistical theory with practical application, underlined by the robust use of R programming to manage, analyze, and model complex data sets.
Methods of Approach
In this project, several methodological steps are employed to construct a reliable linear regression model capable of predicting house prices. The journey begins with exploratory data analysis (EDA), a crucial phase where visual and statistical tools are utilized to get a deep understanding of the dataset. This includes summarizing key statistics, visualizing data distributions, and identifying potential relationships between variables. Following EDA, data preprocessing takes place, which is a multifaceted process involving the treatment of missing data through imputation or removal, outlier detection and handling to ensure they do not skew our model.
Subsequent to the preprocessing, feature selection is undertaken, which is a technique used to select those variables that contribute most to the prediction of the sale price, effectively reducing the model’s complexity and improving its interpretability. The core of the project revolves around fitting a linear regression model, a statistical technique that estimates the relationships between the dependent variable (house price) and one or more independent variables (house characteristics). The model assumes a linear relationship between these variables and finds the line of best fit by minimizing the sum of the squares of the differences between the observed values and the values predicted by the model.
Finally, the model’s performance is validated using metrics such as R-squared, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variables, and Mean Square Error (MSE), which measures the average magnitude of the errors between predicted and observed sale prices. The model is fine-tuned through iteration, where the steps from feature selection to validation may be repeated to refine the model’s predictive power. All these methods are conducted within the R programming environment, leveraging its comprehensive suite of packages and functions for statistical analysis and modeling, ensuring a robust and systematic approach to predicting house prices.
Importing required libraries
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.2.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.2.0
## ✔ recipes 1.0.8
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
##
## The following object is masked from 'package:purrr':
##
## lift
## corrplot 0.92 loaded
##
## Attaching package: 'Metrics'
##
## The following objects are masked from 'package:caret':
##
## precision, recall
##
## The following objects are masked from 'package:yardstick':
##
## accuracy, mae, mape, mase, precision, recall, rmse, smape
I have used above libraries which serves for different purposes in R.
ggplot2, GGally are used for data
visualization. MASS, caret,
tidymodels are used for statistical modeling,
tidyverse, dplyr are for manipulating the
data. corrplot is for correlation matrix visualization and
Metrics is for model performance metrics.
Loading dataset
About Dataset: This dataset contains 1460 observations(rows) of 81 variables(columns).
SalePrice OverallQual MSSubClass YearBuilt TotalBsmtSF GrLivArea GarageCars These are the main factors considered in the data set.
Data Exploration: It is the process of examining and analyzing a dataset to understand its structure, patterns, and relationships between variables. It involves summarizing the main characteristics of the data, identifying trends or anomalies, and gaining insights that can inform further analysis or decision-making.
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.099 7.000 10.000
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
data_clean <- na.omit(data[c("SalePrice", "OverallQual")])
model <- lm(SalePrice ~ OverallQual, data = data_clean)Here it will remove all rows from the ‘data’ data frame that contain missing values (NA) in either the ‘SalePrice’ or ‘OverallQual’ columns, creating a new data frame ‘data_clean’ that contains only complete cases for these two variables. Finally, we will get a cleaner data set which can be used for further analysis or modeling which does not contain any incomplete records.
Distribution of Sale Prices
ggplot(data, aes(x = SalePrice)) +
geom_histogram(bins = 30, fill = "#AFEEEE", color = "black") +
theme_minimal() +
labs(title = "Distribution of Sale Prices", x = "Sale Price", y = "Frequency")The highest frequency of sale prices concentrated in a particular range. This suggests that there is a common sale price range where most of the houses fall. The sale prices have a wide range, extending from lower values to several higher values, indicating variability in the housing prices. The width of the histogram shows that there is a considerable spread in sale prices, which can be attributed to the different sizes, qualities, and locations of the houses, among other factors.The histogram is right-skewed, meaning there are a few houses with sale prices significantly higher than the typical price range. These could be luxury or larger houses that are not as common as the moderately priced ones.
Scatteplot of OverallQuality and Sale Price
ggplot(data, aes(x = OverallQual, y = SalePrice)) +
geom_point(alpha = 0.5) +
theme_minimal() +
labs(title = "Sale Price vs Overall Quality",
x = "Overall Quality",
y = "Sale Price") As the overall quality rating increases, there is a general trend of increasing sale price. This suggests that higher quality houses tend to sell for more. there is also considerable variability at each level of overall quality. This indicates that other factors in addition to overall quality likely affect the sale price and that there’s a range within which prices vary for homes of similar quality. The scatter plot reveals potential outliers, particularly at higher quality levels where some homes have much higher sale prices than the rest. These could be exceptional properties or could indicate data errors or special features not captured by the overall quality variable alone.
Correlation Heatmap
A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The value of a correlation coefficient ranges from -1 to 1. A coefficient close to 1 implies a strong positive correlation: as one variable increases, the other tends to also increase. A coefficient close to -1 indicates a strong negative correlation: as one variable increases, the other tends to decrease. A coefficient around 0 implies little to no linear relationship between the variables. A correlation matrix can be visualized through a heatmap, which makes it easier to spot relationships by using color coding to represent the strength and direction of the correlations. This method is widely used in exploratory data analysis, providing a quick overview of how different variables relate to each other. In the below code, the dataset is being subset to include only a selection of columns that are believed to be significant in predicting the SalePrice. The columns selected are OverallQual (overall material and finish quality), MSSubClass (identifies the type of dwelling involved in the sale), YearBuilt (original construction date), TotalBsmtSF (total square feet of basement area), GrLivArea (above-grade (ground) living area square feet), and GarageCars (size of the garage in car capacity). These variables are thought to be influential in the real estate market and could provide insights into factors that affect house prices.
# Subset the data to include only the main factors
main_factors <- data[, c("SalePrice", "OverallQual", "MSSubClass", "YearBuilt", "TotalBsmtSF", "GrLivArea", "GarageCars")]
# Calculate the correlation matrix for the subset
cor_matrix <- cor(main_factors, use = "complete.obs")
# Create the correlation heatmap
corrplot(cor_matrix, method = "color",
tl.col = "black",
tl.srt = 45,
tl.cex = 0.8)This correlation heatmap is an excellent visualization that clearly illustrates how certain features of a house are related to its sale price. Notably, the overall quality of a house strongly correlates with its sale price, aligning with the intuitive notion that higher quality translates into higher value. Similarly, practical areas of a home such as the basement size (TotalBsmtSF), the above-ground living area (GrLivArea), and the garage capacity (GarageCars) also show significant positive correlations with sale price, suggesting they’re key factors buyers consider when evaluating a property.
# Visualize relationship between LotArea and SalePrice
ggplot(data, aes(x = LotArea, y = SalePrice)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
xlim(0, 50000) # Limiting LotArea for better visualization## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values (`geom_point()`).
This scatter plot with a fitted regression line illustrates the relationship between ‘LotArea’ and ‘SalePrice’. The plot indicates a positive correlation as ‘LotArea’ increases, ‘SalePrice’ also tends to increase. The regression line summarizes this relationship, and the shaded area around the line likely represents a confidence interval, suggesting where the true regression line might fall with a certain level of confidence.
Splitting data sets into train and test data.
1. Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation
Probability theory is the mathematical foundation underlying statistical modeling, enabling us to quantify uncertainty, make predictions about unknown events, and draw inferences from sample data about a population. In statistical modeling, probability provides the framework for understanding how likely various outcomes are, given a set of assumptions and observed data.Statistical inference involves drawing conclusions about populations from sample data. It typically includes estimating population parameters (like means and variances), testing hypotheses, and making predictions. Probability theory underpins inference by allowing us to calculate the likelihood of observing our sample data under different assumptions about the population.For example, if we assume that a dataset follows a normal distribution, probability theory lets us calculate how likely we are to observe a sample mean within a certain range. This calculation can inform our confidence in the sample mean as an estimate of the population mean.
MLE is a method used to estimate the parameters of a statistical model. The “likelihood” refers to the probability of observing the sample data given a set of parameters. In MLE, we choose the parameters that maximize this likelihood, under the assumption that the sample we have is the most probable representation of the underlying population. The logic behind MLE is intuitive: it finds the parameter values that make the observed data most likely. This approach is widely used across various types of statistical models, including both simple linear regression and more complex models.
# Simple linear regression using MLE
model <- lm(SalePrice ~ OverallQual, data = data)
# Summarize the model to see MLE estimates of parameters
summary(model)##
## Call:
## lm(formula = SalePrice ~ OverallQual, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -198152 -29409 -1845 21463 396848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -96206.1 5756.4 -16.71 <2e-16 ***
## OverallQual 45435.8 920.4 49.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48620 on 1458 degrees of freedom
## Multiple R-squared: 0.6257, Adjusted R-squared: 0.6254
## F-statistic: 2437 on 1 and 1458 DF, p-value: < 2.2e-16
2. Determine and Apply the Appropriate Generalized Linear Model
This demonstrates application of a Generalized Linear Model (GLM) to predict a continuous outcome, specifically the sale price of houses (SalePrice), using predictors such as lot area (LotArea), overall quality (OverallQual), and year built (YearBuilt). This approach models the relationship between the target variable and selected features using a linear regression framework, which is a subset of GLM tailored for continuous outcomes. The glm function in R is utilized with the Gaussian family and an identity link function, aligning with the assumptions of linear regression. The GLM approach for predicting house prices offers a flexible framework for exploring linear relationships between a continuous target variable and a set of predictors. The choice of predictors should be guided by both theoretical reasoning and exploratory data analysis, while the interpretation of the model summary offers insights into the dynamics influencing house prices.
# Applying a GLM with a Gaussian family
glm_model <- glm(SalePrice ~ LotArea + OverallQual + YearBuilt, data = data, family = gaussian())
# Model summary
summary(glm_model)##
## Call:
## glm(formula = SalePrice ~ LotArea + OverallQual + YearBuilt,
## family = gaussian(), data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.898e+05 9.218e+04 -7.483 1.25e-13 ***
## LotArea 1.494e+00 1.211e-01 12.334 < 2e-16 ***
## OverallQual 4.044e+04 1.066e+03 37.940 < 2e-16 ***
## YearBuilt 3.086e+02 4.854e+01 6.358 2.72e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2101496473)
##
## Null deviance: 9.2079e+12 on 1459 degrees of freedom
## Residual deviance: 3.0598e+12 on 1456 degrees of freedom
## AIC: 35490
##
## Number of Fisher Scoring iterations: 2
3. Conduct Model Selection for a Set of Candidate Models
Conducting model selection involves comparing different statistical models to choose the best one for predicting your target variable, based on criteria like simplicity, predictive performance, and the underlying assumptions of each model. In the context of predicting house prices, you might have several candidate models that include different subsets of predictors or use different forms of the predictors (e.g., polynomial terms, interaction terms).Model selection is an iterative and critical step in the modeling process, allowing you to refine your approach based on empirical evidence. By carefully considering different models and using objective criteria for comparison
The R script I’ve crafted serves as a practical application of theoretical concepts learned in class, specifically around model building and selection using generalized linear models (GLMs) to predict house prices. In the script, I construct three different models with varying levels of complexity, starting with simpler models and progressively adding more predictors. Model2 incorporates LotArea and OverallQual as predictors, while Model3 further includes YearBuilt and YearRemodAdd, allowing me to explore how the inclusion of additional variables affects the model’s performance.
I then employ the Akaike Information Criterion (AIC) to evaluate and compare these models. The AIC helps me understand which model achieves the best balance between accuracy and complexity, providing a quantitative measure to guide my selection. This process not only reinforces my understanding of statistical methods but also enhances my analytical skills by applying these methods to real-world data. It’s an invaluable exercise in seeing firsthand how adding more variables can impact a model’s ability to predict outcomes accurately, preparing me for more advanced studies or professional tasks in data analysis and modeling.
# Fit additional models
model2 <- glm(SalePrice ~ LotArea + OverallQual, data = data, family = gaussian())
model3 <- glm(SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd, data = data, family = gaussian())
# Calculate AIC for each model
aic_values <- sapply(list(model, model2, model3), AIC)
names(aic_values) <- c("model", "model2", "model3")
# Calculate RMSE for each model
predictions_model1 <- predict(model, newdata = test_data)
predictions_model2 <- predict(model2, newdata = test_data)
predictions_model3 <- predict(model3, newdata = test_data)
rmse_values <- c(
model1 = rmse(test_data$SalePrice, predictions_model1),
model2 = rmse(test_data$SalePrice, predictions_model2),
model3 = rmse(test_data$SalePrice, predictions_model3)
)
# Print AIC values
print(aic_values)## model model2 model3
## 35659.49 35527.52 35474.54
## model1 model2 model3
## 46840.06 43672.18 42526.54
# Identify the model with the lowest AIC and RMSE
best_model_aic <- names(which.min(aic_values))
best_model_rmse <- names(which.min(rmse_values))
# Output the best models
cat("The best model based on AIC is:", best_model_aic, "\n")## The best model based on AIC is: model3
## The best model based on RMSE is: model3
AIC is a measure of the relative quality of statistical models for a given set of data. A lower AIC value indicates a better fit of the model to the data when balancing goodness-of-fit with model complexity. RMSE is a measure of how accurately the model predicts the response variable, with a lower RMSE indicating better predictive accuracy. The output indicates that model3 has the lowest Akaike Information Criterion (AIC) and Root Mean Squared Error (RMSE) values among the three models, suggesting it is the best model both in terms of fit to the data and predictive accuracy.
4. Communicate the Results
As a stats student diving into the practical application of
statistical concepts, the task of using R to analyze the relationship
between house prices and various predictors offers a rich educational
experience. The process of examining model coefficients through
coef(summary(glm_model)) is particularly instructive, as it
allows me to quantitatively assess how features such as OverallQual
influence SalePrice. This analysis not only deepens my understanding of
the impact of each feature within the model but also provides a
foundation for predicting outcomes and testing hypotheses in real-world
contexts. Furthermore, visualizing these relationships using ggplot2
enhances my ability to interpret and present data effectively,
showcasing the practical importance of visual data exploration in
revealing complex relationships and trends within the dataset.
The application of a generalized linear model (GLM) and the
integration of a smoothed trend line in the scatter plot using
geom_smooth() with a GLM further solidifies my grasp of
statistical modeling techniques. This hands-on approach helps me
understand how different model specifications, such as choosing a
Gaussian family for continuous outcomes like SalePrice, can be tailored
to analyze specific types of data. Moreover, crafting these visual and
numerical outputs provides an invaluable practice in communicating
statistical findings, a critical skill for any aspiring data scientist.
This exercise not only reinforces technical skills but also enhances my
ability to articulate complex analyses clearly and effectively,
preparing me for future roles that require adept data analysis and clear
communication.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.897716e+05 92179.695326 -7.482902 1.252075e-13
## LotArea 1.493851e+00 0.121115 12.334161 2.563956e-33
## OverallQual 4.043791e+04 1065.833576 37.940173 1.327982e-219
## YearBuilt 3.086028e+02 48.535679 6.358266 2.724419e-10
# Visualization
library(ggplot2)
ggplot(data, aes(x = OverallQual, y = SalePrice)) +
geom_point(aes(color = factor(YearBuilt)), alpha = 0.5) +
geom_smooth(method = "glm", method.args = list(family = gaussian()), se = FALSE, color = "red") +
labs(title = "Sale Price vs. Overall Quality")## `geom_smooth()` using formula = 'y ~ x'
5. Use R to Fit and Assess Statistical Models
The below snippet is used to train and evaluate a generalized linear model (GLM) for predicting house prices based on the features LotArea, OverallQual, and YearBuilt. The process includes splitting the data into training and testing sets to ensure the model can be evaluated on unseen data. After fitting the model on the training set, predictions are made on the testing set, and the model’s performance is assessed using the Mean Squared Error (MSE), which quantifies the average prediction error squared. This procedure helps validate the accuracy of the model and checks its generalizability to new data.
# Fit model on training data
fit_model <- glm(SalePrice ~ LotArea + OverallQual + YearBuilt, data = train_data, family = gaussian())
# Predict on test data
predictions <- predict(fit_model, newdata = test_data)
# Assessing performance using Mean Squared Error (MSE)
mse <- mean((predictions - test_data$SalePrice)^2)
print(paste("MeanSquaredError:", mse))## [1] "MeanSquaredError: 1847446755.33903"
Conclusion
Throughout the project, a rich dataset featuring a wide array of house characteristics was carefully explored and processed. Initial data analysis involved summarizing the dataset, scrutinizing the distribution of the target variable ‘SalePrice’, and ensuring data quality by handling missing values and potential outliers. This foundational step was critical for establishing a reliable dataset for modeling. Feature selection pinpointed influential variables such as ‘LotArea’, ‘OverallQual’, and ‘YearBuilt’, which were then utilized to fit a Generalized Linear Model (GLM). The model assumed a Gaussian distribution—a natural choice for a continuous outcome variable like house prices. Diagnostic plots, particularly the residual plot, offered a visual assessment of the model fit, revealing whether the assumptions of linear regression were satisfied. To refine the model, additional predictors and transformations were considered, with the Akaike Information Criterion (AIC) serving as a statistical guide to compare and select the best-performing model. The chosen model was then subjected to a critical evaluation of its predictive power on a testing set, using the Mean Squared Error (MSE) as a measure of accuracy.The Multiple R-squared value of 0.6257 indicates that approximately 62.57% of the variability in sale prices can be explained by the model. Throughout the project, the methods and statistical tests applied demonstrate that the overall quality of a house, its size (both lot and living area), and its age are significant predictors of its sale price. The models provide valuable insights into the factors that contribute to the value of a house and underscore the importance of using statistical evidence to guide real estate price predictions. With an iterative approach to model building, selection, and validation, the project exemplifies how linear regression can be a powerful tool for understanding complex real estate market dynamics. However, the residual analysis suggests that there is still unexplained variability that could be addressed with additional predictors or a more complex model.
Reflection
Throughout our course, I did more than just show up, turn in assignments, and interacted on Teams. I often took part in online class discussions, asking questions that helped me and my classmates learn better. This also made our classes more interactive. My excitement didn’t just stay within the classes. I really enjoyed the small competitions we had, where I could use what we learned in a fun and cooperative way. These activities improved my problem-solving and teamwork skills. I also spent time on GitHub, where I helped and learned from others. I gave useful advice and learned how to use the site well, which helped all of us learn more together. In short, I fully participated in many parts of our course—from online discussions and competitions to assignments and helping on GitHub. I aimed to improve both my own skills and the learning of everyone in our course group, truly showing what it means to take part actively.