This RMarkdown file contains the final report of the data analysis done for this project on building a linear regression model for the txhousing data set. It contains analysis such as data exploration, computing summary statistics, creating plots, and building the predictive model. The final report was completed on Mon Apr 7 16:41:42 2025.
The data analysis seeks to find answer to these questions are: 1. Is there a relationship between number of sales and total value of sales? 2. How much of the variations in total value of sales can be explained by number of sales?
The interest of the analysis is to get a sense of the house sale prices, including total sales value and the number of sales, across different cities.
To achieve the aim of this project, we will use the txhousing data set: * Contains information about the housing market in Texas provided by the TAMU real estate center (https://www.recenter.tamu.edu/). * This data set contains 8,602 rows and 9 variables. * We will work majorly with two variable “sales” which is number of sales as the x-variable and “volume” which is the total value of sales as the y-variable.
In this task, you will import the required packages and data for this project
## Importing required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(ggpubr)
library(broom)
library(ggfortify)
library(skimr)
## Import the built-in R data set
data("txhousing")
## View and check the dimension of the data set
View(txhousing)
dim(txhousing)
## [1] 8602 9
## Check the column names for the data set
names(txhousing)
## [1] "city" "year" "month" "sales" "volume" "median"
## [7] "listings" "inventory" "date"
In this task, you will learn how to explore and clean the data using R functions
## Take a peek using the head and tail functions
head(txhousing)
## # A tibble: 6 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
tail(txhousing)
## # A tibble: 6 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Wichita Falls 2015 2 100 11646765 94000 795 6.8 2015.
## 2 Wichita Falls 2015 3 152 16716584 89200 818 6.8 2015.
## 3 Wichita Falls 2015 4 129 15482194 105300 760 6.4 2015.
## 4 Wichita Falls 2015 5 174 19188181 100000 776 6.4 2015.
## 5 Wichita Falls 2015 6 143 18820752 118800 770 6.2 2015.
## 6 Wichita Falls 2015 7 172 23850905 116700 811 6.5 2016.
## Check the internal structure of the data frame
glimpse(txhousing)
## Rows: 8,602
## Columns: 9
## $ city <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abil…
## $ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, …
## $ sales <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, …
## $ volume <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 1263…
## $ median <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 6450…
## $ listings <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, …
## $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, …
## $ date <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, …
## Create a broad overview of a data set
summary(txhousing)
## city year month sales
## Length:8602 Min. :2000 Min. : 1.000 Min. : 6.0
## Class :character 1st Qu.:2003 1st Qu.: 3.000 1st Qu.: 86.0
## Mode :character Median :2007 Median : 6.000 Median : 169.0
## Mean :2007 Mean : 6.406 Mean : 549.6
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.: 467.0
## Max. :2015 Max. :12.000 Max. :8945.0
## NA's :568
## volume median listings inventory
## Min. :8.350e+05 Min. : 50000 Min. : 0 Min. : 0.000
## 1st Qu.:1.084e+07 1st Qu.:100000 1st Qu.: 682 1st Qu.: 4.900
## Median :2.299e+07 Median :123800 Median : 1283 Median : 6.200
## Mean :1.069e+08 Mean :128131 Mean : 3217 Mean : 7.175
## 3rd Qu.:7.512e+07 3rd Qu.:150000 3rd Qu.: 2954 3rd Qu.: 8.150
## Max. :2.568e+09 Max. :304200 Max. :43107 Max. :55.900
## NA's :568 NA's :616 NA's :1424 NA's :1467
## date
## Min. :2000
## 1st Qu.:2004
## Median :2008
## Mean :2008
## 3rd Qu.:2012
## Max. :2016
##
skim(txhousing)
| Name | txhousing |
| Number of rows | 8602 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| city | 0 | 1 | 4 | 21 | 0 | 46 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2007.30 | 4.50 | 2000 | 2003.00 | 2007.00 | 2011.00 | 2015.0 | ▇▆▆▆▅ |
| month | 0 | 1.00 | 6.41 | 3.44 | 1 | 3.00 | 6.00 | 9.00 | 12.0 | ▇▅▅▅▇ |
| sales | 568 | 0.93 | 549.56 | 1110.74 | 6 | 86.00 | 169.00 | 467.00 | 8945.0 | ▇▁▁▁▁ |
| volume | 568 | 0.93 | 106858620.78 | 244933668.97 | 835000 | 10840000.00 | 22986824.00 | 75121388.75 | 2568156780.0 | ▇▁▁▁▁ |
| median | 616 | 0.93 | 128131.44 | 37359.58 | 50000 | 100000.00 | 123800.00 | 150000.00 | 304200.0 | ▅▇▃▁▁ |
| listings | 1424 | 0.83 | 3216.90 | 5968.33 | 0 | 682.00 | 1283.00 | 2953.75 | 43107.0 | ▇▁▁▁▁ |
| inventory | 1467 | 0.83 | 7.17 | 4.61 | 0 | 4.90 | 6.20 | 8.15 | 55.9 | ▇▁▁▁▁ |
| date | 0 | 1.00 | 2007.75 | 4.50 | 2000 | 2003.83 | 2007.75 | 2011.67 | 2015.5 | ▇▇▇▇▇ |
## Drop the missing values in sales, volume, median
tx_data <- txhousing %>%
drop_na(sales, volume, median)
## Create the age variable
tx_data$age <- 2023 - tx_data$year
## Create a broad overview of a data set
skim(tx_data)
| Name | tx_data |
| Number of rows | 7985 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| city | 0 | 1 | 4 | 21 | 0 | 46 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2007.63 | 4.44 | 2000 | 2004.00 | 2008.00 | 2011.00 | 2015.0 | ▇▇▇▇▆ |
| month | 0 | 1.00 | 6.40 | 3.44 | 1 | 3.00 | 6.00 | 9.00 | 12.0 | ▇▆▅▅▇ |
| sales | 0 | 1.00 | 552.40 | 1113.54 | 6 | 87.00 | 170.00 | 470.00 | 8945.0 | ▇▁▁▁▁ |
| volume | 0 | 1.00 | 107448783.10 | 245566897.28 | 835000 | 10904804.00 | 23170000.00 | 76002315.00 | 2568156780.0 | ▇▁▁▁▁ |
| median | 0 | 1.00 | 128135.28 | 37360.34 | 50000 | 100000.00 | 123800.00 | 150000.00 | 304200.0 | ▅▇▃▁▁ |
| listings | 817 | 0.90 | 3220.01 | 5971.82 | 0 | 682.00 | 1286.00 | 2954.25 | 43107.0 | ▇▁▁▁▁ |
| inventory | 859 | 0.89 | 7.17 | 4.61 | 0 | 4.90 | 6.20 | 8.10 | 55.9 | ▇▁▁▁▁ |
| date | 0 | 1.00 | 2008.08 | 4.44 | 2000 | 2004.33 | 2008.25 | 2011.92 | 2015.5 | ▆▇▇▇▇ |
| age | 0 | 1.00 | 15.37 | 4.44 | 8 | 12.00 | 15.00 | 19.00 | 23.0 | ▇▇▆▆▆ |
In this task, you will learn how to create a scatter plot to visualize the variables for model building
## Find the correlation between the variables
cor(tx_data$sales, tx_data$volume)
## [1] 0.981039
## Plot a scatter plot for the variables with sales on the x-axis
## volume on the y-axis
ggplot(tx_data, aes(x = sales, y = volume)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Data Source: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
## Import the house sales data
house_sales <- read_csv("house_sales_prices.csv")
## Rows: 1460 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Street, CentralAir, SaleType, SaleCondition
## dbl (11): LotFrontage, LotArea, YearBuilt, TotalBsmtSF, BedroomAbvGr, Kitche...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Get a glimpse of the data set
glimpse(house_sales)
## Rows: 1,460
## Columns: 15
## $ LotFrontage <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
## $ LotArea <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
## $ YearBuilt <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
## $ TotalBsmtSF <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
## $ BedroomAbvGr <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
## $ KitchenAbvGr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ TotRmsAbvGrd <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ GarageCars <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
## $ GarageArea <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
## $ YrSold <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
## $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm…
## $ SalePrice <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …
## Create a broad overview of a data set
summary(house_sales)
## LotFrontage LotArea YearBuilt TotalBsmtSF
## Min. : 21.00 Min. : 1300 Min. :1872 Min. : 0.0
## 1st Qu.: 59.00 1st Qu.: 7554 1st Qu.:1954 1st Qu.: 795.8
## Median : 69.00 Median : 9478 Median :1973 Median : 991.5
## Mean : 70.05 Mean : 10517 Mean :1971 Mean :1057.4
## 3rd Qu.: 80.00 3rd Qu.: 11602 3rd Qu.:2000 3rd Qu.:1298.2
## Max. :313.00 Max. :215245 Max. :2010 Max. :6110.0
## NA's :259
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Street
## Min. :0.000 Min. :0.000 Min. : 2.000 Length:1460
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 Class :character
## Median :3.000 Median :1.000 Median : 6.000 Mode :character
## Mean :2.866 Mean :1.047 Mean : 6.518
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :8.000 Max. :3.000 Max. :14.000
##
## CentralAir GarageCars GarageArea YrSold
## Length:1460 Min. :0.000 Min. : 0.0 Min. :2006
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.:2007
## Mode :character Median :2.000 Median : 480.0 Median :2008
## Mean :1.767 Mean : 473.0 Mean :2008
## 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:2009
## Max. :4.000 Max. :1418.0 Max. :2010
##
## SaleType SaleCondition SalePrice
## Length:1460 Length:1460 Min. : 34900
## Class :character Class :character 1st Qu.:129975
## Mode :character Mode :character Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
skim(house_sales)
| Name | house_sales |
| Number of rows | 1460 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Street | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| CentralAir | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| SaleType | 0 | 1 | 2 | 5 | 0 | 9 | 0 |
| SaleCondition | 0 | 1 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| LotFrontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 | ▇▃▁▁▁ |
| LotArea | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 | ▇▁▁▁▁ |
| YearBuilt | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 | ▁▂▃▆▇ |
| TotalBsmtSF | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 | ▇▃▁▁▁ |
| BedroomAbvGr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
| KitchenAbvGr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
| TotRmsAbvGrd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 | ▂▇▇▁▁ |
| GarageCars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▃▇▂▁ |
| GarageArea | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 | ▂▇▃▁▁ |
| YrSold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▅ |
| SalePrice | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 | ▇▅▁▁▁ |
In this task, you will build a simple linear regression with one dependent and one independent variable and interpret the results
The linear model equation can be written as follow: volume = b0 + b1 * sales
## Create a simple linear regression model using the variables
simple_model <- lm(volume ~ sales, data = tx_data)
summary(simple_model)
##
## Call:
## lm(formula = volume ~ sales, data = tx_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -323981922 -7019959 1652053 7709296 674388390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.206e+07 5.946e+05 -20.28 <2e-16 ***
## sales 2.163e+05 4.784e+02 452.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47600000 on 7983 degrees of freedom
## Multiple R-squared: 0.9624, Adjusted R-squared: 0.9624
## F-statistic: 2.045e+05 on 1 and 7983 DF, p-value: < 2.2e-16
## Plot the regression line for the model
ggplot(tx_data, aes(x = sales, y = volume)) +
geom_point() +
stat_smooth(method = lm) +
labs(title = "Regression of Sales on Volume of Housing sales in TX",
x = "Sales", y = "Volume") +
scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'
From the output above:
The estimated regression line equation can be written as follow: volume = -12060071 + 216346*sales
The estimated regression line equation can be written as follow: volume = -12060071 + 216346*sales
The intercept (b0) is -12060071. It can be interpreted as the predicted total value of sales for a zero number of sales. Looks like with zero sales, there we will run into loss
The regression beta coefficient for the variable sales (b1), for every 1 unit increase in the number of sales, the required volume increases by 216346.
In this task, you will use diagnostic plots to check whether the assumptions of linear regression model are satisfied
Assumptions:
Linear regression makes several assumptions about the data, such as: * Linearity of the data: The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
Normality of residuals: The residual errors are assumed to be normally distributed.
Homogeneity of residuals variance: The residuals are assumed to have a constant variance (homoscedasticity)
*Independence of residuals error terms.
## Plotting the fitted model
plot(simple_model)
## Return the first diagnostic plot for the model
#plot(simple_model, which = 1)
#plot(simple_model, which = 2)
#plot(simple_model, which = 3)
## Create all four plots at once
autoplot(simple_model) +
labs(title = "Diagnostic plots for the fitted model",
x = "Fitted values", y = "Residuals") +
theme_minimal()
In this task, you will learn how to assess how well the model fit and significance of the predictor variable
## Assess the summary of the fitted model
summary(simple_model)
##
## Call:
## lm(formula = volume ~ sales, data = tx_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -323981922 -7019959 1652053 7709296 674388390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.206e+07 5.946e+05 -20.28 <2e-16 ***
## sales 2.163e+05 4.784e+02 452.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47600000 on 7983 degrees of freedom
## Multiple R-squared: 0.9624, Adjusted R-squared: 0.9624
## F-statistic: 2.045e+05 on 1 and 7983 DF, p-value: < 2.2e-16
## Calculate the confidence interval for the coefficients
confint(simple_model)
## 2.5 % 97.5 %
## (Intercept) -13225619.6 -10894523.0
## sales 215408.6 217284.1
In this task, you will build a simple linear regression model using a square root or log transformation on the independent variable
## Build a log transformed regression model
log_model <- lm(log10(volume) ~ sales, data = tx_data)
## Return the summary of the model
summary(log_model)
##
## Call:
## lm(formula = log10(volume) ~ sales, data = tx_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.70214 -0.28613 -0.00098 0.33364 0.85241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.249e+00 5.198e-03 1394.6 <2e-16 ***
## sales 4.296e-04 4.182e-06 102.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4161 on 7983 degrees of freedom
## Multiple R-squared: 0.5693, Adjusted R-squared: 0.5693
## F-statistic: 1.055e+04 on 1 and 7983 DF, p-value: < 2.2e-16
## Return the first diagnostic plot for the model
#plot(log_model, which = 1)
#plot(log_model, which = 2)
#plot(log_model, which = 3)
confint(log_model)
## 2.5 % 97.5 %
## (Intercept) 7.2392377311 7.2596178191
## sales 0.0004214485 0.0004378449
autoplot(log_model) +
labs(title = "Diagnostic plots for the fitted model",
x = "Fitted values", y = "Residuals") +
theme_minimal()
In this task, you will learn how to check for metrics from the fitted model and make predictions given new values of the independent variable
## Find the fitted values of the simple regression model
fitted_values <- predict.lm(simple_model)
#head(fitted_values, 3)
## Return the model metrics
model_metrics <- augment(simple_model)
#model_metrics
## Predict new values using the model
predict(simple_model, newdata = data.frame(sales = c(210, 27, 140)))
## 1 2 3
## 33372661 -6218720 18228417
In this task, you will build a multiple regression model with one dependent variable and three independent variables and interpret the results
## Build the multiple regression model with volume as the y variable and sales, median and age on the x variables
multiple_reg <- lm(volume ~ sales + median + age, data = tx_data)
## This prints the result of the model
multiple_reg
##
## Call:
## lm(formula = volume ~ sales + median + age, data = tx_data)
##
## Coefficients:
## (Intercept) sales median age
## -3.262e+07 2.117e+05 3.992e+02 -1.823e+06
## Check the summary of the multiple regression model
summary(multiple_reg)
##
## Call:
## lm(formula = volume ~ sales + median + age, data = tx_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -282127704 -13143251 -595883 13529089 660234903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.262e+07 3.515e+06 -9.281 <2e-16 ***
## sales 2.117e+05 4.747e+02 445.937 <2e-16 ***
## median 3.992e+02 1.633e+01 24.445 <2e-16 ***
## age -1.823e+06 1.289e+05 -14.142 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43410000 on 7981 degrees of freedom
## Multiple R-squared: 0.9688, Adjusted R-squared: 0.9688
## F-statistic: 8.252e+04 on 3 and 7981 DF, p-value: < 2.2e-16
## Plot the fitted multiple regression model
autoplot(multiple_reg)
In this task, you will build a model from scratch to predict house prices using variables in a data set
## Create a broad overview of the data set
skim(house_sales)
| Name | house_sales |
| Number of rows | 1460 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Street | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| CentralAir | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| SaleType | 0 | 1 | 2 | 5 | 0 | 9 | 0 |
| SaleCondition | 0 | 1 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| LotFrontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 | ▇▃▁▁▁ |
| LotArea | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 | ▇▁▁▁▁ |
| YearBuilt | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 | ▁▂▃▆▇ |
| TotalBsmtSF | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 | ▇▃▁▁▁ |
| BedroomAbvGr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
| KitchenAbvGr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
| TotRmsAbvGrd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 | ▂▇▇▁▁ |
| GarageCars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▃▇▂▁ |
| GarageArea | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 | ▂▇▃▁▁ |
| YrSold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▅ |
| SalePrice | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 | ▇▅▁▁▁ |
## Drop the missing values in the LotFrontage variable
house_sales <- house_sales %>%
drop_na(LotFrontage)
## Build the multiple regression model
house_reg <- lm(SalePrice ~ ., data = house_sales)
## Check the summary of the multiple regression model
summary(house_reg)
##
## Call:
## lm(formula = SalePrice ~ ., data = house_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483816 -23581 -2878 19681 394505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.539e+06 2.014e+06 -1.758 0.07907 .
## LotFrontage -2.056e+01 6.546e+01 -0.314 0.75347
## LotArea 8.656e-01 1.868e-01 4.635 3.98e-06 ***
## YearBuilt 4.636e+02 5.583e+01 8.305 2.73e-16 ***
## TotalBsmtSF 4.726e+01 3.779e+00 12.505 < 2e-16 ***
## BedroomAbvGr -1.334e+04 2.299e+03 -5.800 8.52e-09 ***
## KitchenAbvGr -5.403e+04 6.326e+03 -8.542 < 2e-16 ***
## TotRmsAbvGrd 2.267e+04 1.274e+03 17.793 < 2e-16 ***
## StreetPave 4.373e+04 2.101e+04 2.082 0.03760 *
## CentralAirY 4.828e+03 5.667e+03 0.852 0.39442
## GarageCars 2.110e+04 3.913e+03 5.393 8.36e-08 ***
## GarageArea 1.587e+01 1.370e+01 1.159 0.24669
## YrSold 1.280e+03 1.004e+03 1.275 0.20245
## SaleTypeCon 7.289e+04 3.313e+04 2.200 0.02802 *
## SaleTypeConLD 1.959e+04 1.852e+04 1.058 0.29037
## SaleTypeConLI 2.987e+04 2.405e+04 1.242 0.21462
## SaleTypeConLw 3.950e+04 2.205e+04 1.791 0.07347 .
## SaleTypeCWD 7.830e+04 2.423e+04 3.231 0.00127 **
## SaleTypeNew 7.085e+04 2.837e+04 2.498 0.01264 *
## SaleTypeOth 3.766e+04 2.772e+04 1.359 0.17455
## SaleTypeWD 2.762e+04 8.548e+03 3.232 0.00126 **
## SaleConditionAdjLand 1.814e+04 2.345e+04 0.774 0.43931
## SaleConditionAlloca 4.724e+03 1.592e+04 0.297 0.76673
## SaleConditionFamily -2.546e+04 1.201e+04 -2.120 0.03422 *
## SaleConditionNormal 4.647e+03 5.549e+03 0.837 0.40253
## SaleConditionPartial -1.808e+04 2.717e+04 -0.665 0.50602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45110 on 1175 degrees of freedom
## Multiple R-squared: 0.7135, Adjusted R-squared: 0.7074
## F-statistic: 117 on 25 and 1175 DF, p-value: < 2.2e-16
## Perform diagnostic plots of the fitted multiple regression model
autoplot(house_reg) +
labs(title = "Diagnostic plots for the fitted model",
x = "Fitted values", y = "Residuals") +
theme_minimal()