Portfolio

I’m Mani Sharath Chandra, currently pursuing a Master’s in Data Science and Analytics. After earning my bachelor’s degree in Computer Science and Engineering, I worked at a product-based AI company. During that time, I realized my skills and knowledge in modeling and AI tasks were somewhat limited. Now, as part of my master’s program, I have enhanced my analytical abilities through learning regression methods. To take my skills further, I plan to apply these techniques to a real-world problem. I’ve chosen a data set on house prices to implement a Generalized Linear Model (GLM). My goal is to predict changes in house prices and create visualizations to better understand and communicate these predictions.

Objective

The primary objective of this project is to develop a robust understanding of linear regression analysis through its application in predicting real estate prices. This involves comprehensively exploring and preprocessing a data set, selecting relevant features, and applying statistical models to make informed predictions. The project aims to bridge theoretical knowledge with practical application, allowing students to grasp the complexities and nuances of real-world data analysis. Through this endeavor, students are expected to enhance their analytical skills, gain proficiency in using statistical software, and foster an ability to interpret and communicate their findings effectively. Additionally, this project seeks to cultivate critical thinking by challenging students to assess the accuracy and reliability of their models, understand the limitations of their analysis, and explore potential improvements. Ultimately, the project serves as a platform for students to apply statistical concepts and methodologies in a meaningful context, preparing them for future research or professional roles that require data-driven decision-making.

Project Overview

The project titled “Using Linear Regression to Predict the Prices of Houses” aims to build a statistical model that can predict residential property sale prices with a significant degree of accuracy. Utilizing a rich data set, the project applies linear regression, a foundational technique in statistical modeling and machine learning, which presumes a linear relationship between independent variables (such as square footage, number of bedrooms, and other house features) and the dependent variable (the sale price of the house). The data set includes a variety of features from lot size and neighborhood to physical attributes like overall quality and year built, providing a comprehensive set of factors that are hypothesized to influence house pricing. The analysis begins with a meticulous exploratory data examination to understand the underlying distributions and relationships, followed by rigorous data preprocessing to handle missing values, outliers, and categorical variables. After establishing a clean and informative data set, the project progresses through feature selection to identify the most impactful predictors, model fitting using the linear regression algorithm, and model validation to assess predictive power and accuracy. Ultimately, the endeavor not only enhances predictive modeling techniques but also offers valuable insights into the real estate market, aiding buyers, sellers, and investors in making informed decisions based on key property characteristics. The project exemplifies the synthesis of statistical theory with practical application, underlined by the robust use of R programming to manage, analyze, and model complex data sets.

Methods of Approach

In this project, several methodological steps are employed to construct a reliable linear regression model capable of predicting house prices. The journey begins with exploratory data analysis (EDA), a crucial phase where visual and statistical tools are utilized to get a deep understanding of the dataset. This includes summarizing key statistics, visualizing data distributions, and identifying potential relationships between variables. Following EDA, data preprocessing takes place, which is a multifaceted process involving the treatment of missing data through imputation or removal, outlier detection and handling to ensure they do not skew our model.

Subsequent to the preprocessing, feature selection is undertaken, which is a technique used to select those variables that contribute most to the prediction of the sale price, effectively reducing the model’s complexity and improving its interpretability. The core of the project revolves around fitting a linear regression model, a statistical technique that estimates the relationships between the dependent variable (house price) and one or more independent variables (house characteristics). The model assumes a linear relationship between these variables and finds the line of best fit by minimizing the sum of the squares of the differences between the observed values and the values predicted by the model.

Finally, the model’s performance is validated using metrics such as R-squared, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variables, and Mean Square Error (MSE), which measures the average magnitude of the errors between predicted and observed sale prices. The model is fine-tuned through iteration, where the steps from feature selection to validation may be repeated to refine the model’s predictive power. All these methods are conducted within the R programming environment, leveraging its comprehensive suite of packages and functions for statistical analysis and modeling, ensuring a robust and systematic approach to predicting house prices.

Importing required libraries

library(ggplot2)
library(MASS)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ dplyr::select()   masks MASS::select()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(dplyr)
library(corrplot)
## corrplot 0.92 loaded
library(Metrics)
## 
## Attaching package: 'Metrics'
## 
## The following objects are masked from 'package:caret':
## 
##     precision, recall
## 
## The following objects are masked from 'package:yardstick':
## 
##     accuracy, mae, mape, mase, precision, recall, rmse, smape

I have used above libraries which serves for different purposes in R. ggplot2, GGally are used for data visualization. MASS, caret, tidymodels are used for statistical modeling, tidyverse, dplyr are for manipulating the data. corrplot is for correlation matrix visualization and Metrics is for model performance metrics.

Loading dataset

data= read.csv("C:\\Users\\smsha\\OneDrive - Grand Valley State University\\Documents\\STA-631\\train.csv")

About Dataset: This dataset contains 1460 observations(rows) of 81 variables(columns).

SalePrice OverallQual MSSubClass YearBuilt TotalBsmtSF GrLivArea GarageCars These are the main factors considered in the data set.

Data Exploration: It is the process of examining and analyzing a dataset to understand its structure, patterns, and relationships between variables. It involves summarizing the main characteristics of the data, identifying trends or anomalies, and gaining insights that can inform further analysis or decision-making.

summary(data)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 
summary(data$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
summary(data$OverallQual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.099   7.000  10.000
str(data)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
data_clean <- na.omit(data[c("SalePrice", "OverallQual")])
model <- lm(SalePrice ~ OverallQual, data = data_clean)

Here it will remove all rows from the ‘data’ data frame that contain missing values (NA) in either the ‘SalePrice’ or ‘OverallQual’ columns, creating a new data frame ‘data_clean’ that contains only complete cases for these two variables. Finally, we will get a cleaner data set which can be used for further analysis or modeling which does not contain any incomplete records.

Distribution of Sale Prices

ggplot(data, aes(x = SalePrice)) +
  geom_histogram(bins = 30, fill = "#AFEEEE", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Sale Prices", x = "Sale Price", y = "Frequency")

The highest frequency of sale prices concentrated in a particular range. This suggests that there is a common sale price range where most of the houses fall. The sale prices have a wide range, extending from lower values to several higher values, indicating variability in the housing prices. The width of the histogram shows that there is a considerable spread in sale prices, which can be attributed to the different sizes, qualities, and locations of the houses, among other factors.The histogram is right-skewed, meaning there are a few houses with sale prices significantly higher than the typical price range. These could be luxury or larger houses that are not as common as the moderately priced ones.

Scatteplot of OverallQuality and Sale Price

ggplot(data, aes(x = OverallQual, y = SalePrice)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(title = "Sale Price vs Overall Quality",
       x = "Overall Quality",
       y = "Sale Price") 

As the overall quality rating increases, there is a general trend of increasing sale price. This suggests that higher quality houses tend to sell for more. there is also considerable variability at each level of overall quality. This indicates that other factors in addition to overall quality likely affect the sale price and that there’s a range within which prices vary for homes of similar quality. The scatter plot reveals potential outliers, particularly at higher quality levels where some homes have much higher sale prices than the rest. These could be exceptional properties or could indicate data errors or special features not captured by the overall quality variable alone.

Correlation Heatmap

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The value of a correlation coefficient ranges from -1 to 1. A coefficient close to 1 implies a strong positive correlation: as one variable increases, the other tends to also increase. A coefficient close to -1 indicates a strong negative correlation: as one variable increases, the other tends to decrease. A coefficient around 0 implies little to no linear relationship between the variables. A correlation matrix can be visualized through a heatmap, which makes it easier to spot relationships by using color coding to represent the strength and direction of the correlations. This method is widely used in exploratory data analysis, providing a quick overview of how different variables relate to each other. In the below code, the dataset is being subset to include only a selection of columns that are believed to be significant in predicting the SalePrice. The columns selected are OverallQual (overall material and finish quality), MSSubClass (identifies the type of dwelling involved in the sale), YearBuilt (original construction date), TotalBsmtSF (total square feet of basement area), GrLivArea (above-grade (ground) living area square feet), and GarageCars (size of the garage in car capacity). These variables are thought to be influential in the real estate market and could provide insights into factors that affect house prices.

# Subset the data to include only the main factors
main_factors <- data[, c("SalePrice", "OverallQual", "MSSubClass", "YearBuilt", "TotalBsmtSF", "GrLivArea", "GarageCars")]

# Calculate the correlation matrix for the subset
cor_matrix <- cor(main_factors, use = "complete.obs")

# Create the correlation heatmap
corrplot(cor_matrix, method = "color",
         tl.col = "black",      
         tl.srt = 45,         
         tl.cex = 0.8)

This correlation heatmap is an excellent visualization that clearly illustrates how certain features of a house are related to its sale price. Notably, the overall quality of a house strongly correlates with its sale price, aligning with the intuitive notion that higher quality translates into higher value. Similarly, practical areas of a home such as the basement size (TotalBsmtSF), the above-ground living area (GrLivArea), and the garage capacity (GarageCars) also show significant positive correlations with sale price, suggesting they’re key factors buyers consider when evaluating a property.

# Visualize relationship between LotArea and SalePrice
ggplot(data, aes(x = LotArea, y = SalePrice)) + 
  geom_point() + 
  geom_smooth(method = "lm", color = "blue") +
  xlim(0, 50000) # Limiting LotArea for better visualization
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 11 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values (`geom_point()`).

This scatter plot with a fitted regression line illustrates the relationship between ‘LotArea’ and ‘SalePrice’. The plot indicates a positive correlation as ‘LotArea’ increases, ‘SalePrice’ also tends to increase. The regression line summarizes this relationship, and the shaded area around the line likely represents a confidence interval, suggesting where the true regression line might fall with a certain level of confidence.

Splitting data sets into train and test data.

# Split data into training and testing sets
set.seed(123) # for reproducibility
training_indices <- sample(1:nrow(data), 0.8 * nrow(data))
train_data <- data[training_indices, ]
test_data <- data[-training_indices, ]

1. Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation

Probability theory is the mathematical foundation underlying statistical modeling, enabling us to quantify uncertainty, make predictions about unknown events, and draw inferences from sample data about a population. In statistical modeling, probability provides the framework for understanding how likely various outcomes are, given a set of assumptions and observed data.Statistical inference involves drawing conclusions about populations from sample data. It typically includes estimating population parameters (like means and variances), testing hypotheses, and making predictions. Probability theory underpins inference by allowing us to calculate the likelihood of observing our sample data under different assumptions about the population.For example, if we assume that a dataset follows a normal distribution, probability theory lets us calculate how likely we are to observe a sample mean within a certain range. This calculation can inform our confidence in the sample mean as an estimate of the population mean.

MLE is a method used to estimate the parameters of a statistical model. The “likelihood” refers to the probability of observing the sample data given a set of parameters. In MLE, we choose the parameters that maximize this likelihood, under the assumption that the sample we have is the most probable representation of the underlying population. The logic behind MLE is intuitive: it finds the parameter values that make the observed data most likely. This approach is widely used across various types of statistical models, including both simple linear regression and more complex models.

# Simple linear regression using MLE
model <- lm(SalePrice ~ OverallQual, data = data)

# Summarize the model to see MLE estimates of parameters
summary(model)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -198152  -29409   -1845   21463  396848 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -96206.1     5756.4  -16.71   <2e-16 ***
## OverallQual  45435.8      920.4   49.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48620 on 1458 degrees of freedom
## Multiple R-squared:  0.6257, Adjusted R-squared:  0.6254 
## F-statistic:  2437 on 1 and 1458 DF,  p-value: < 2.2e-16

2. Determine and Apply the Appropriate Generalized Linear Model

This demonstrates application of a Generalized Linear Model (GLM) to predict a continuous outcome, specifically the sale price of houses (SalePrice), using predictors such as lot area (LotArea), overall quality (OverallQual), and year built (YearBuilt). This approach models the relationship between the target variable and selected features using a linear regression framework, which is a subset of GLM tailored for continuous outcomes. The glm function in R is utilized with the Gaussian family and an identity link function, aligning with the assumptions of linear regression. The GLM approach for predicting house prices offers a flexible framework for exploring linear relationships between a continuous target variable and a set of predictors. The choice of predictors should be guided by both theoretical reasoning and exploratory data analysis, while the interpretation of the model summary offers insights into the dynamics influencing house prices.

# Applying a GLM with a Gaussian family
glm_model <- glm(SalePrice ~ LotArea + OverallQual + YearBuilt, data = data, family = gaussian())

# Model summary
summary(glm_model)
## 
## Call:
## glm(formula = SalePrice ~ LotArea + OverallQual + YearBuilt, 
##     family = gaussian(), data = data)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.898e+05  9.218e+04  -7.483 1.25e-13 ***
## LotArea      1.494e+00  1.211e-01  12.334  < 2e-16 ***
## OverallQual  4.044e+04  1.066e+03  37.940  < 2e-16 ***
## YearBuilt    3.086e+02  4.854e+01   6.358 2.72e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2101496473)
## 
##     Null deviance: 9.2079e+12  on 1459  degrees of freedom
## Residual deviance: 3.0598e+12  on 1456  degrees of freedom
## AIC: 35490
## 
## Number of Fisher Scoring iterations: 2

3. Conduct Model Selection for a Set of Candidate Models

Conducting model selection involves comparing different statistical models to choose the best one for predicting your target variable, based on criteria like simplicity, predictive performance, and the underlying assumptions of each model. In the context of predicting house prices, you might have several candidate models that include different subsets of predictors or use different forms of the predictors (e.g., polynomial terms, interaction terms).Model selection is an iterative and critical step in the modeling process, allowing you to refine your approach based on empirical evidence. By carefully considering different models and using objective criteria for comparison

The R script I’ve crafted serves as a practical application of theoretical concepts learned in class, specifically around model building and selection using generalized linear models (GLMs) to predict house prices. In the script, I construct three different models with varying levels of complexity, starting with simpler models and progressively adding more predictors. Model2 incorporates LotArea and OverallQual as predictors, while Model3 further includes YearBuilt and YearRemodAdd, allowing me to explore how the inclusion of additional variables affects the model’s performance.

I then employ the Akaike Information Criterion (AIC) to evaluate and compare these models. The AIC helps me understand which model achieves the best balance between accuracy and complexity, providing a quantitative measure to guide my selection. This process not only reinforces my understanding of statistical methods but also enhances my analytical skills by applying these methods to real-world data. It’s an invaluable exercise in seeing firsthand how adding more variables can impact a model’s ability to predict outcomes accurately, preparing me for more advanced studies or professional tasks in data analysis and modeling.

# Fit additional models 
model2 <- glm(SalePrice ~ LotArea + OverallQual, data = data, family = gaussian())
model3 <- glm(SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd, data = data, family = gaussian())

# Calculate AIC for each model 
aic_values <- sapply(list(model, model2, model3), AIC)
names(aic_values) <- c("model", "model2", "model3")

# Calculate RMSE for each model
predictions_model1 <- predict(model, newdata = test_data)
predictions_model2 <- predict(model2, newdata = test_data)
predictions_model3 <- predict(model3, newdata = test_data)

rmse_values <- c(
  model1 = rmse(test_data$SalePrice, predictions_model1),
  model2 = rmse(test_data$SalePrice, predictions_model2),
  model3 = rmse(test_data$SalePrice, predictions_model3)
)

# Print AIC values
print(aic_values)
##    model   model2   model3 
## 35659.49 35527.52 35474.54
# Print RMSE values
print(rmse_values)
##   model1   model2   model3 
## 46840.06 43672.18 42526.54
# Identify the model with the lowest AIC and RMSE
best_model_aic <- names(which.min(aic_values))
best_model_rmse <- names(which.min(rmse_values))

# Output the best models
cat("The best model based on AIC is:", best_model_aic, "\n")
## The best model based on AIC is: model3
cat("The best model based on RMSE is:", best_model_rmse, "\n")
## The best model based on RMSE is: model3

AIC is a measure of the relative quality of statistical models for a given set of data. A lower AIC value indicates a better fit of the model to the data when balancing goodness-of-fit with model complexity. RMSE is a measure of how accurately the model predicts the response variable, with a lower RMSE indicating better predictive accuracy. The output indicates that model3 has the lowest Akaike Information Criterion (AIC) and Root Mean Squared Error (RMSE) values among the three models, suggesting it is the best model both in terms of fit to the data and predictive accuracy.

4. Communicate the Results

As a stats student diving into the practical application of statistical concepts, the task of using R to analyze the relationship between house prices and various predictors offers a rich educational experience. The process of examining model coefficients through coef(summary(glm_model)) is particularly instructive, as it allows me to quantitatively assess how features such as OverallQual influence SalePrice. This analysis not only deepens my understanding of the impact of each feature within the model but also provides a foundation for predicting outcomes and testing hypotheses in real-world contexts. Furthermore, visualizing these relationships using ggplot2 enhances my ability to interpret and present data effectively, showcasing the practical importance of visual data exploration in revealing complex relationships and trends within the dataset.

The application of a generalized linear model (GLM) and the integration of a smoothed trend line in the scatter plot using geom_smooth() with a GLM further solidifies my grasp of statistical modeling techniques. This hands-on approach helps me understand how different model specifications, such as choosing a Gaussian family for continuous outcomes like SalePrice, can be tailored to analyze specific types of data. Moreover, crafting these visual and numerical outputs provides an invaluable practice in communicating statistical findings, a critical skill for any aspiring data scientist. This exercise not only reinforces technical skills but also enhances my ability to articulate complex analyses clearly and effectively, preparing me for future roles that require adept data analysis and clear communication.

# Model coefficients
coef(summary(glm_model))
##                  Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept) -6.897716e+05 92179.695326 -7.482902  1.252075e-13
## LotArea      1.493851e+00     0.121115 12.334161  2.563956e-33
## OverallQual  4.043791e+04  1065.833576 37.940173 1.327982e-219
## YearBuilt    3.086028e+02    48.535679  6.358266  2.724419e-10
# Visualization
library(ggplot2)
ggplot(data, aes(x = OverallQual, y = SalePrice)) +
  geom_point(aes(color = factor(YearBuilt)), alpha = 0.5) +
  geom_smooth(method = "glm", method.args = list(family = gaussian()), se = FALSE, color = "red") +
  labs(title = "Sale Price vs. Overall Quality")
## `geom_smooth()` using formula = 'y ~ x'

5. Use R to Fit and Assess Statistical Models

The below snippet is used to train and evaluate a generalized linear model (GLM) for predicting house prices based on the features LotArea, OverallQual, and YearBuilt. The process includes splitting the data into training and testing sets to ensure the model can be evaluated on unseen data. After fitting the model on the training set, predictions are made on the testing set, and the model’s performance is assessed using the Mean Squared Error (MSE), which quantifies the average prediction error squared. This procedure helps validate the accuracy of the model and checks its generalizability to new data.

# Fit model on training data
fit_model <- glm(SalePrice ~ LotArea + OverallQual + YearBuilt, data = train_data, family = gaussian())

# Predict on test data
predictions <- predict(fit_model, newdata = test_data)

# Assessing performance using Mean Squared Error (MSE)
mse <- mean((predictions - test_data$SalePrice)^2)
print(paste("MeanSquaredError:", mse))
## [1] "MeanSquaredError: 1847446755.33903"

Conclusion

Throughout the project, a rich dataset featuring a wide array of house characteristics was carefully explored and processed. Initial data analysis involved summarizing the dataset, scrutinizing the distribution of the target variable ‘SalePrice’, and ensuring data quality by handling missing values and potential outliers. This foundational step was critical for establishing a reliable dataset for modeling. Feature selection pinpointed influential variables such as ‘LotArea’, ‘OverallQual’, and ‘YearBuilt’, which were then utilized to fit a Generalized Linear Model (GLM). The model assumed a Gaussian distribution—a natural choice for a continuous outcome variable like house prices. Diagnostic plots, particularly the residual plot, offered a visual assessment of the model fit, revealing whether the assumptions of linear regression were satisfied. To refine the model, additional predictors and transformations were considered, with the Akaike Information Criterion (AIC) serving as a statistical guide to compare and select the best-performing model. The chosen model was then subjected to a critical evaluation of its predictive power on a testing set, using the Mean Squared Error (MSE) as a measure of accuracy.The Multiple R-squared value of 0.6257 indicates that approximately 62.57% of the variability in sale prices can be explained by the model. Throughout the project, the methods and statistical tests applied demonstrate that the overall quality of a house, its size (both lot and living area), and its age are significant predictors of its sale price. The models provide valuable insights into the factors that contribute to the value of a house and underscore the importance of using statistical evidence to guide real estate price predictions. With an iterative approach to model building, selection, and validation, the project exemplifies how linear regression can be a powerful tool for understanding complex real estate market dynamics. However, the residual analysis suggests that there is still unexplained variability that could be addressed with additional predictors or a more complex model.

Reflection

Throughout our course, I did more than just show up, turn in assignments, and interacted on Teams. I often took part in online class discussions, asking questions that helped me and my classmates learn better. This also made our classes more interactive. My excitement didn’t just stay within the classes. I really enjoyed the small competitions we had, where I could use what we learned in a fun and cooperative way. These activities improved my problem-solving and teamwork skills. I also spent time on GitHub, where I helped and learned from others. I gave useful advice and learned how to use the site well, which helped all of us learn more together. In short, I fully participated in many parts of our course—from online discussions and competitions to assignments and helping on GitHub. I aimed to improve both my own skills and the learning of everyone in our course group, truly showing what it means to take part actively.