About Report

This RMarkdown file contains the final report of the data analysis done for this project on building a linear regression model for the txhousing data set. It contains analysis such as data exploration, computing summary statistics, creating plots, and building the predictive model. The final report was completed on Mon Apr 7 16:41:42 2025.

The data analysis seeks to find answer to these questions are: 1. Is there a relationship between number of sales and total value of sales? 2. How much of the variations in total value of sales can be explained by number of sales?

The interest of the analysis is to get a sense of the house sale prices, including total sales value and the number of sales, across different cities.

To achieve the aim of this project, we will use the txhousing data set: * Contains information about the housing market in Texas provided by the TAMU real estate center (https://www.recenter.tamu.edu/). * This data set contains 8,602 rows and 9 variables. * We will work majorly with two variable “sales” which is number of sales as the x-variable and “volume” which is the total value of sales as the y-variable.

Task One: Import packages and dataset

In this task, you will import the required packages and data for this project

## Importing required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(ggpubr)
library(broom)
library(ggfortify)
library(skimr)

## Import the built-in R data set
data("txhousing")

## View and check the dimension of the data set
View(txhousing)
dim(txhousing)
## [1] 8602    9
## Check the column names for the data set
names(txhousing)
## [1] "city"      "year"      "month"     "sales"     "volume"    "median"   
## [7] "listings"  "inventory" "date"

Task Two: Use R functions to describe the data

In this task, you will learn how to explore and clean the data using R functions

## Take a peek using the head and tail functions
head(txhousing)
## # A tibble: 6 × 9
##   city     year month sales   volume median listings inventory  date
##   <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
## 1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
## 2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
## 3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
## 4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
## 5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
## 6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
tail(txhousing)
## # A tibble: 6 × 9
##   city           year month sales   volume median listings inventory  date
##   <chr>         <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
## 1 Wichita Falls  2015     2   100 11646765  94000      795       6.8 2015.
## 2 Wichita Falls  2015     3   152 16716584  89200      818       6.8 2015.
## 3 Wichita Falls  2015     4   129 15482194 105300      760       6.4 2015.
## 4 Wichita Falls  2015     5   174 19188181 100000      776       6.4 2015.
## 5 Wichita Falls  2015     6   143 18820752 118800      770       6.2 2015.
## 6 Wichita Falls  2015     7   172 23850905 116700      811       6.5 2016.
## Check the internal structure of the data frame
glimpse(txhousing)
## Rows: 8,602
## Columns: 9
## $ city      <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abil…
## $ year      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
## $ month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, …
## $ sales     <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, …
## $ volume    <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 1263…
## $ median    <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 6450…
## $ listings  <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, …
## $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, …
## $ date      <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, …
## Create a broad overview of a data set
summary(txhousing)
##      city                year          month            sales       
##  Length:8602        Min.   :2000   Min.   : 1.000   Min.   :   6.0  
##  Class :character   1st Qu.:2003   1st Qu.: 3.000   1st Qu.:  86.0  
##  Mode  :character   Median :2007   Median : 6.000   Median : 169.0  
##                     Mean   :2007   Mean   : 6.406   Mean   : 549.6  
##                     3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.: 467.0  
##                     Max.   :2015   Max.   :12.000   Max.   :8945.0  
##                                                     NA's   :568     
##      volume              median          listings       inventory     
##  Min.   :8.350e+05   Min.   : 50000   Min.   :    0   Min.   : 0.000  
##  1st Qu.:1.084e+07   1st Qu.:100000   1st Qu.:  682   1st Qu.: 4.900  
##  Median :2.299e+07   Median :123800   Median : 1283   Median : 6.200  
##  Mean   :1.069e+08   Mean   :128131   Mean   : 3217   Mean   : 7.175  
##  3rd Qu.:7.512e+07   3rd Qu.:150000   3rd Qu.: 2954   3rd Qu.: 8.150  
##  Max.   :2.568e+09   Max.   :304200   Max.   :43107   Max.   :55.900  
##  NA's   :568         NA's   :616      NA's   :1424    NA's   :1467    
##       date     
##  Min.   :2000  
##  1st Qu.:2004  
##  Median :2008  
##  Mean   :2008  
##  3rd Qu.:2012  
##  Max.   :2016  
## 
skim(txhousing)
Data summary
Name txhousing
Number of rows 8602
Number of columns 9
_______________________
Column type frequency:
character 1
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1 4 21 0 46 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2007.30 4.50 2000 2003.00 2007.00 2011.00 2015.0 ▇▆▆▆▅
month 0 1.00 6.41 3.44 1 3.00 6.00 9.00 12.0 ▇▅▅▅▇
sales 568 0.93 549.56 1110.74 6 86.00 169.00 467.00 8945.0 ▇▁▁▁▁
volume 568 0.93 106858620.78 244933668.97 835000 10840000.00 22986824.00 75121388.75 2568156780.0 ▇▁▁▁▁
median 616 0.93 128131.44 37359.58 50000 100000.00 123800.00 150000.00 304200.0 ▅▇▃▁▁
listings 1424 0.83 3216.90 5968.33 0 682.00 1283.00 2953.75 43107.0 ▇▁▁▁▁
inventory 1467 0.83 7.17 4.61 0 4.90 6.20 8.15 55.9 ▇▁▁▁▁
date 0 1.00 2007.75 4.50 2000 2003.83 2007.75 2011.67 2015.5 ▇▇▇▇▇
## Drop the missing values in sales, volume, median
tx_data <- txhousing %>%
  drop_na(sales, volume, median)

## Create the age variable
tx_data$age <- 2023 - tx_data$year

## Create a broad overview of a data set
skim(tx_data)
Data summary
Name tx_data
Number of rows 7985
Number of columns 10
_______________________
Column type frequency:
character 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1 4 21 0 46 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2007.63 4.44 2000 2004.00 2008.00 2011.00 2015.0 ▇▇▇▇▆
month 0 1.00 6.40 3.44 1 3.00 6.00 9.00 12.0 ▇▆▅▅▇
sales 0 1.00 552.40 1113.54 6 87.00 170.00 470.00 8945.0 ▇▁▁▁▁
volume 0 1.00 107448783.10 245566897.28 835000 10904804.00 23170000.00 76002315.00 2568156780.0 ▇▁▁▁▁
median 0 1.00 128135.28 37360.34 50000 100000.00 123800.00 150000.00 304200.0 ▅▇▃▁▁
listings 817 0.90 3220.01 5971.82 0 682.00 1286.00 2954.25 43107.0 ▇▁▁▁▁
inventory 859 0.89 7.17 4.61 0 4.90 6.20 8.10 55.9 ▇▁▁▁▁
date 0 1.00 2008.08 4.44 2000 2004.33 2008.25 2011.92 2015.5 ▆▇▇▇▇
age 0 1.00 15.37 4.44 8 12.00 15.00 19.00 23.0 ▇▇▆▆▆

Task Three: Create data visualization using ggplot

In this task, you will learn how to create a scatter plot to visualize the variables for model building

## Find the correlation between the variables
cor(tx_data$sales, tx_data$volume)
## [1] 0.981039
## Plot a scatter plot for the variables with sales on the x-axis
## volume on the y-axis
ggplot(tx_data, aes(x = sales, y = volume)) +
         geom_point() +
         geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Task Four: Load and describe a dataset

Data Source: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

## Import the house sales data
house_sales <- read_csv("house_sales_prices.csv")
## Rows: 1460 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Street, CentralAir, SaleType, SaleCondition
## dbl (11): LotFrontage, LotArea, YearBuilt, TotalBsmtSF, BedroomAbvGr, Kitche...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Get a glimpse of the data set
glimpse(house_sales)
## Rows: 1,460
## Columns: 15
## $ LotFrontage   <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
## $ LotArea       <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
## $ YearBuilt     <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
## $ TotalBsmtSF   <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
## $ BedroomAbvGr  <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
## $ KitchenAbvGr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ TotRmsAbvGrd  <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ GarageCars    <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
## $ GarageArea    <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
## $ YrSold        <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm…
## $ SalePrice     <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …
## Create a broad overview of a data set
summary(house_sales)
##   LotFrontage        LotArea         YearBuilt     TotalBsmtSF    
##  Min.   : 21.00   Min.   :  1300   Min.   :1872   Min.   :   0.0  
##  1st Qu.: 59.00   1st Qu.:  7554   1st Qu.:1954   1st Qu.: 795.8  
##  Median : 69.00   Median :  9478   Median :1973   Median : 991.5  
##  Mean   : 70.05   Mean   : 10517   Mean   :1971   Mean   :1057.4  
##  3rd Qu.: 80.00   3rd Qu.: 11602   3rd Qu.:2000   3rd Qu.:1298.2  
##  Max.   :313.00   Max.   :215245   Max.   :2010   Max.   :6110.0  
##  NA's   :259                                                      
##   BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd       Street         
##  Min.   :0.000   Min.   :0.000   Min.   : 2.000   Length:1460       
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   Class :character  
##  Median :3.000   Median :1.000   Median : 6.000   Mode  :character  
##  Mean   :2.866   Mean   :1.047   Mean   : 6.518                     
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000                     
##  Max.   :8.000   Max.   :3.000   Max.   :14.000                     
##                                                                     
##   CentralAir          GarageCars      GarageArea         YrSold    
##  Length:1460        Min.   :0.000   Min.   :   0.0   Min.   :2006  
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:2007  
##  Mode  :character   Median :2.000   Median : 480.0   Median :2008  
##                     Mean   :1.767   Mean   : 473.0   Mean   :2008  
##                     3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:2009  
##                     Max.   :4.000   Max.   :1418.0   Max.   :2010  
##                                                                    
##    SaleType         SaleCondition        SalePrice     
##  Length:1460        Length:1460        Min.   : 34900  
##  Class :character   Class :character   1st Qu.:129975  
##  Mode  :character   Mode  :character   Median :163000  
##                                        Mean   :180921  
##                                        3rd Qu.:214000  
##                                        Max.   :755000  
## 
skim(house_sales)
Data summary
Name house_sales
Number of rows 1460
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Street 0 1 4 4 0 2 0
CentralAir 0 1 1 1 0 2 0
SaleType 0 1 2 5 0 9 0
SaleCondition 0 1 6 7 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
LotFrontage 259 0.82 70.05 24.28 21 59.00 69.0 80.00 313 ▇▃▁▁▁
LotArea 0 1.00 10516.83 9981.26 1300 7553.50 9478.5 11601.50 215245 ▇▁▁▁▁
YearBuilt 0 1.00 1971.27 30.20 1872 1954.00 1973.0 2000.00 2010 ▁▂▃▆▇
TotalBsmtSF 0 1.00 1057.43 438.71 0 795.75 991.5 1298.25 6110 ▇▃▁▁▁
BedroomAbvGr 0 1.00 2.87 0.82 0 2.00 3.0 3.00 8 ▁▇▂▁▁
KitchenAbvGr 0 1.00 1.05 0.22 0 1.00 1.0 1.00 3 ▁▇▁▁▁
TotRmsAbvGrd 0 1.00 6.52 1.63 2 5.00 6.0 7.00 14 ▂▇▇▁▁
GarageCars 0 1.00 1.77 0.75 0 1.00 2.0 2.00 4 ▁▃▇▂▁
GarageArea 0 1.00 472.98 213.80 0 334.50 480.0 576.00 1418 ▂▇▃▁▁
YrSold 0 1.00 2007.82 1.33 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▅
SalePrice 0 1.00 180921.20 79442.50 34900 129975.00 163000.0 214000.00 755000 ▇▅▁▁▁

Task Five: Build a simple regression model

In this task, you will build a simple linear regression with one dependent and one independent variable and interpret the results

The linear model equation can be written as follow: volume = b0 + b1 * sales

## Create a simple linear regression model using the variables
simple_model <- lm(volume ~ sales, data = tx_data)
summary(simple_model)
## 
## Call:
## lm(formula = volume ~ sales, data = tx_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -323981922   -7019959    1652053    7709296  674388390 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.206e+07  5.946e+05  -20.28   <2e-16 ***
## sales        2.163e+05  4.784e+02  452.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47600000 on 7983 degrees of freedom
## Multiple R-squared:  0.9624, Adjusted R-squared:  0.9624 
## F-statistic: 2.045e+05 on 1 and 7983 DF,  p-value: < 2.2e-16
## Plot the regression line for the model
ggplot(tx_data, aes(x = sales, y = volume)) +
  geom_point() +
  stat_smooth(method = lm) +
  labs(title = "Regression of Sales on Volume of Housing sales in TX",
       x = "Sales", y = "Volume") +
  scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'

From the output above:

The estimated regression line equation can be written as follow: volume = -12060071 + 216346*sales

Task Six: Perform diagnostic checks on fitted model

In this task, you will use diagnostic plots to check whether the assumptions of linear regression model are satisfied

Assumptions:

Linear regression makes several assumptions about the data, such as: * Linearity of the data: The relationship between the predictor (x) and the outcome (y) is assumed to be linear.

*Independence of residuals error terms.

## Plotting the fitted model

plot(simple_model)

## Return the first diagnostic plot for the model

#plot(simple_model, which = 1)
#plot(simple_model, which = 2)
#plot(simple_model, which = 3)

## Create all four plots at once

autoplot(simple_model) +
  labs(title = "Diagnostic plots for the fitted model",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

Task Seven: Perform model fit assessment

In this task, you will learn how to assess how well the model fit and significance of the predictor variable

## Assess the summary of the fitted model

summary(simple_model)
## 
## Call:
## lm(formula = volume ~ sales, data = tx_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -323981922   -7019959    1652053    7709296  674388390 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.206e+07  5.946e+05  -20.28   <2e-16 ***
## sales        2.163e+05  4.784e+02  452.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47600000 on 7983 degrees of freedom
## Multiple R-squared:  0.9624, Adjusted R-squared:  0.9624 
## F-statistic: 2.045e+05 on 1 and 7983 DF,  p-value: < 2.2e-16
## Calculate the confidence interval for the coefficients

confint(simple_model)
##                   2.5 %      97.5 %
## (Intercept) -13225619.6 -10894523.0
## sales          215408.6    217284.1

Task Nine: Build a simple regression model with transformation

In this task, you will build a simple linear regression model using a square root or log transformation on the independent variable

## Build a log transformed regression model

log_model <- lm(log10(volume) ~ sales, data = tx_data)


## Return the summary of the model

summary(log_model)
## 
## Call:
## lm(formula = log10(volume) ~ sales, data = tx_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70214 -0.28613 -0.00098  0.33364  0.85241 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.249e+00  5.198e-03  1394.6   <2e-16 ***
## sales       4.296e-04  4.182e-06   102.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4161 on 7983 degrees of freedom
## Multiple R-squared:  0.5693, Adjusted R-squared:  0.5693 
## F-statistic: 1.055e+04 on 1 and 7983 DF,  p-value: < 2.2e-16
## Return the first diagnostic plot for the model

#plot(log_model, which = 1)
#plot(log_model, which = 2)
#plot(log_model, which = 3)


confint(log_model)
##                    2.5 %       97.5 %
## (Intercept) 7.2392377311 7.2596178191
## sales       0.0004214485 0.0004378449
autoplot(log_model) +
  labs(title = "Diagnostic plots for the fitted model",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

Task Ten: Make predictions using the fitted model

In this task, you will learn how to check for metrics from the fitted model and make predictions given new values of the independent variable

## Find the fitted values of the simple regression model

fitted_values <- predict.lm(simple_model)
#head(fitted_values, 3)

## Return the model metrics


model_metrics <- augment(simple_model)
#model_metrics
  
## Predict new values using the model

predict(simple_model, newdata = data.frame(sales = c(210, 27, 140)))
##        1        2        3 
## 33372661 -6218720 18228417

Task Eleven: Multiple Regression

In this task, you will build a multiple regression model with one dependent variable and three independent variables and interpret the results

## Build the multiple regression model with volume as the y variable and sales, median and age on the x variables

multiple_reg <- lm(volume ~ sales + median + age, data = tx_data)


## This prints the result of the model
multiple_reg
## 
## Call:
## lm(formula = volume ~ sales + median + age, data = tx_data)
## 
## Coefficients:
## (Intercept)        sales       median          age  
##  -3.262e+07    2.117e+05    3.992e+02   -1.823e+06
## Check the summary of the multiple regression model
summary(multiple_reg)
## 
## Call:
## lm(formula = volume ~ sales + median + age, data = tx_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -282127704  -13143251    -595883   13529089  660234903 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.262e+07  3.515e+06  -9.281   <2e-16 ***
## sales        2.117e+05  4.747e+02 445.937   <2e-16 ***
## median       3.992e+02  1.633e+01  24.445   <2e-16 ***
## age         -1.823e+06  1.289e+05 -14.142   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43410000 on 7981 degrees of freedom
## Multiple R-squared:  0.9688, Adjusted R-squared:  0.9688 
## F-statistic: 8.252e+04 on 3 and 7981 DF,  p-value: < 2.2e-16
## Plot the fitted multiple regression model
autoplot(multiple_reg)

Task Twelve: Create a model to predict house prices in Iowa

In this task, you will build a model from scratch to predict house prices using variables in a data set

## Create a broad overview of the data set
skim(house_sales)
Data summary
Name house_sales
Number of rows 1460
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Street 0 1 4 4 0 2 0
CentralAir 0 1 1 1 0 2 0
SaleType 0 1 2 5 0 9 0
SaleCondition 0 1 6 7 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
LotFrontage 259 0.82 70.05 24.28 21 59.00 69.0 80.00 313 ▇▃▁▁▁
LotArea 0 1.00 10516.83 9981.26 1300 7553.50 9478.5 11601.50 215245 ▇▁▁▁▁
YearBuilt 0 1.00 1971.27 30.20 1872 1954.00 1973.0 2000.00 2010 ▁▂▃▆▇
TotalBsmtSF 0 1.00 1057.43 438.71 0 795.75 991.5 1298.25 6110 ▇▃▁▁▁
BedroomAbvGr 0 1.00 2.87 0.82 0 2.00 3.0 3.00 8 ▁▇▂▁▁
KitchenAbvGr 0 1.00 1.05 0.22 0 1.00 1.0 1.00 3 ▁▇▁▁▁
TotRmsAbvGrd 0 1.00 6.52 1.63 2 5.00 6.0 7.00 14 ▂▇▇▁▁
GarageCars 0 1.00 1.77 0.75 0 1.00 2.0 2.00 4 ▁▃▇▂▁
GarageArea 0 1.00 472.98 213.80 0 334.50 480.0 576.00 1418 ▂▇▃▁▁
YrSold 0 1.00 2007.82 1.33 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▅
SalePrice 0 1.00 180921.20 79442.50 34900 129975.00 163000.0 214000.00 755000 ▇▅▁▁▁
## Drop the missing values in the LotFrontage variable

house_sales <- house_sales %>%
  drop_na(LotFrontage)


## Build the multiple regression model 

house_reg <- lm(SalePrice ~ ., data = house_sales)

## Check the summary of the multiple regression model

summary(house_reg)
## 
## Call:
## lm(formula = SalePrice ~ ., data = house_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483816  -23581   -2878   19681  394505 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -3.539e+06  2.014e+06  -1.758  0.07907 .  
## LotFrontage          -2.056e+01  6.546e+01  -0.314  0.75347    
## LotArea               8.656e-01  1.868e-01   4.635 3.98e-06 ***
## YearBuilt             4.636e+02  5.583e+01   8.305 2.73e-16 ***
## TotalBsmtSF           4.726e+01  3.779e+00  12.505  < 2e-16 ***
## BedroomAbvGr         -1.334e+04  2.299e+03  -5.800 8.52e-09 ***
## KitchenAbvGr         -5.403e+04  6.326e+03  -8.542  < 2e-16 ***
## TotRmsAbvGrd          2.267e+04  1.274e+03  17.793  < 2e-16 ***
## StreetPave            4.373e+04  2.101e+04   2.082  0.03760 *  
## CentralAirY           4.828e+03  5.667e+03   0.852  0.39442    
## GarageCars            2.110e+04  3.913e+03   5.393 8.36e-08 ***
## GarageArea            1.587e+01  1.370e+01   1.159  0.24669    
## YrSold                1.280e+03  1.004e+03   1.275  0.20245    
## SaleTypeCon           7.289e+04  3.313e+04   2.200  0.02802 *  
## SaleTypeConLD         1.959e+04  1.852e+04   1.058  0.29037    
## SaleTypeConLI         2.987e+04  2.405e+04   1.242  0.21462    
## SaleTypeConLw         3.950e+04  2.205e+04   1.791  0.07347 .  
## SaleTypeCWD           7.830e+04  2.423e+04   3.231  0.00127 ** 
## SaleTypeNew           7.085e+04  2.837e+04   2.498  0.01264 *  
## SaleTypeOth           3.766e+04  2.772e+04   1.359  0.17455    
## SaleTypeWD            2.762e+04  8.548e+03   3.232  0.00126 ** 
## SaleConditionAdjLand  1.814e+04  2.345e+04   0.774  0.43931    
## SaleConditionAlloca   4.724e+03  1.592e+04   0.297  0.76673    
## SaleConditionFamily  -2.546e+04  1.201e+04  -2.120  0.03422 *  
## SaleConditionNormal   4.647e+03  5.549e+03   0.837  0.40253    
## SaleConditionPartial -1.808e+04  2.717e+04  -0.665  0.50602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45110 on 1175 degrees of freedom
## Multiple R-squared:  0.7135, Adjusted R-squared:  0.7074 
## F-statistic:   117 on 25 and 1175 DF,  p-value: < 2.2e-16
## Perform diagnostic plots of the fitted multiple regression model

autoplot(house_reg) +
  labs(title = "Diagnostic plots for the fitted model",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()