Predictive Analysis With Regression

About Report

This RMarkdown file contains the final report of the data analysis done for this project on building a linear regression model for the txhousing data set. It contains analysis such as data exploration, computing summary statistics, creating plots, and building the predictive model. The final report was completed on Mon Apr 7 16:41:42 2025.

The data analysis seeks to find answer to these questions are: 1. Is there a relationship between number of sales and total value of sales? 2. How much of the variations in total value of sales can be explained by number of sales?

The interest of the analysis is to get a sense of the house sale prices, including total sales value and the number of sales, across different cities.

To achieve the aim of this project, we will use the txhousing data set: * Contains information about the housing market in Texas provided by the TAMU real estate center (https://www.recenter.tamu.edu/). * This data set contains 8,602 rows and 9 variables. * We will work majorly with two variable “sales” which is number of sales as the x-variable and “volume” which is the total value of sales as the y-variable.

Task One: Import packages and dataset

In this task, you will import the required packages and data for this project

## Importing required packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(ggpubr)
library(broom)
library(ggfortify)
library(skimr)

## Import the built-in R data set
data("txhousing")

## View and check the dimension of the data set
View(txhousing)
dim(txhousing)

## [1] 8602    9

## Check the column names for the data set
names(txhousing)

## [1] "city"      "year"      "month"     "sales"     "volume"    "median"   
## [7] "listings"  "inventory" "date"

Task Two: Use R functions to describe the data

In this task, you will learn how to explore and clean the data using R functions

## Take a peek using the head and tail functions
head(txhousing)

## # A tibble: 6 × 9
##   city     year month sales   volume median listings inventory  date
##   <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
## 1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
## 2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
## 3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
## 4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
## 5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
## 6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.

tail(txhousing)

## # A tibble: 6 × 9
##   city           year month sales   volume median listings inventory  date
##   <chr>         <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
## 1 Wichita Falls  2015     2   100 11646765  94000      795       6.8 2015.
## 2 Wichita Falls  2015     3   152 16716584  89200      818       6.8 2015.
## 3 Wichita Falls  2015     4   129 15482194 105300      760       6.4 2015.
## 4 Wichita Falls  2015     5   174 19188181 100000      776       6.4 2015.
## 5 Wichita Falls  2015     6   143 18820752 118800      770       6.2 2015.
## 6 Wichita Falls  2015     7   172 23850905 116700      811       6.5 2016.

## Check the internal structure of the data frame
glimpse(txhousing)

## Rows: 8,602
## Columns: 9
## $ city      <chr> "Abilene", "Abilene", "Abilene", "Abilene", "Abilene", "Abil…
## $ year      <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, …
## $ month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, …
## $ sales     <dbl> 72, 98, 130, 98, 141, 156, 152, 131, 104, 101, 100, 92, 75, …
## $ volume    <dbl> 5380000, 6505000, 9285000, 9730000, 10590000, 13910000, 1263…
## $ median    <dbl> 71400, 58700, 58100, 68600, 67300, 66900, 73500, 75000, 6450…
## $ listings  <dbl> 701, 746, 784, 785, 794, 780, 742, 765, 771, 764, 721, 658, …
## $ inventory <dbl> 6.3, 6.6, 6.8, 6.9, 6.8, 6.6, 6.2, 6.4, 6.5, 6.6, 6.2, 5.7, …
## $ date      <dbl> 2000.000, 2000.083, 2000.167, 2000.250, 2000.333, 2000.417, …

## Create a broad overview of a data set
summary(txhousing)

##      city                year          month            sales       
##  Length:8602        Min.   :2000   Min.   : 1.000   Min.   :   6.0  
##  Class :character   1st Qu.:2003   1st Qu.: 3.000   1st Qu.:  86.0  
##  Mode  :character   Median :2007   Median : 6.000   Median : 169.0  
##                     Mean   :2007   Mean   : 6.406   Mean   : 549.6  
##                     3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.: 467.0  
##                     Max.   :2015   Max.   :12.000   Max.   :8945.0  
##                                                     NA's   :568     
##      volume              median          listings       inventory     
##  Min.   :8.350e+05   Min.   : 50000   Min.   :    0   Min.   : 0.000  
##  1st Qu.:1.084e+07   1st Qu.:100000   1st Qu.:  682   1st Qu.: 4.900  
##  Median :2.299e+07   Median :123800   Median : 1283   Median : 6.200  
##  Mean   :1.069e+08   Mean   :128131   Mean   : 3217   Mean   : 7.175  
##  3rd Qu.:7.512e+07   3rd Qu.:150000   3rd Qu.: 2954   3rd Qu.: 8.150  
##  Max.   :2.568e+09   Max.   :304200   Max.   :43107   Max.   :55.900  
##  NA's   :568         NA's   :616      NA's   :1424    NA's   :1467    
##       date     
##  Min.   :2000  
##  1st Qu.:2004  
##  Median :2008  
##  Mean   :2008  
##  3rd Qu.:2012  
##  Max.   :2016  
##

skim(txhousing)

Data summary
Name	txhousing
Number of rows	8602
Number of columns	9
_______________________
Column type frequency:
character	1
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
city	0	1	4	21	0	46	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2007.30	4.50	2000	2003.00	2007.00	2011.00	2015.0	▇▆▆▆▅
month	0	1.00	6.41	3.44	1	3.00	6.00	9.00	12.0	▇▅▅▅▇
sales	568	0.93	549.56	1110.74	6	86.00	169.00	467.00	8945.0	▇▁▁▁▁
volume	568	0.93	106858620.78	244933668.97	835000	10840000.00	22986824.00	75121388.75	2568156780.0	▇▁▁▁▁
median	616	0.93	128131.44	37359.58	50000	100000.00	123800.00	150000.00	304200.0	▅▇▃▁▁
listings	1424	0.83	3216.90	5968.33	0	682.00	1283.00	2953.75	43107.0	▇▁▁▁▁
inventory	1467	0.83	7.17	4.61	0	4.90	6.20	8.15	55.9	▇▁▁▁▁
date	0	1.00	2007.75	4.50	2000	2003.83	2007.75	2011.67	2015.5	▇▇▇▇▇

## Drop the missing values in sales, volume, median
tx_data <- txhousing %>%
  drop_na(sales, volume, median)

## Create the age variable
tx_data$age <- 2023 - tx_data$year

## Create a broad overview of a data set
skim(tx_data)

Data summary
Name	tx_data
Number of rows	7985
Number of columns	10
_______________________
Column type frequency:
character	1
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
city	0	1	4	21	0	46	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2007.63	4.44	2000	2004.00	2008.00	2011.00	2015.0	▇▇▇▇▆
month	0	1.00	6.40	3.44	1	3.00	6.00	9.00	12.0	▇▆▅▅▇
sales	0	1.00	552.40	1113.54	6	87.00	170.00	470.00	8945.0	▇▁▁▁▁
volume	0	1.00	107448783.10	245566897.28	835000	10904804.00	23170000.00	76002315.00	2568156780.0	▇▁▁▁▁
median	0	1.00	128135.28	37360.34	50000	100000.00	123800.00	150000.00	304200.0	▅▇▃▁▁
listings	817	0.90	3220.01	5971.82	0	682.00	1286.00	2954.25	43107.0	▇▁▁▁▁
inventory	859	0.89	7.17	4.61	0	4.90	6.20	8.10	55.9	▇▁▁▁▁
date	0	1.00	2008.08	4.44	2000	2004.33	2008.25	2011.92	2015.5	▆▇▇▇▇
age	0	1.00	15.37	4.44	8	12.00	15.00	19.00	23.0	▇▇▆▆▆

Task Three: Create data visualization using ggplot

In this task, you will learn how to create a scatter plot to visualize the variables for model building

## Find the correlation between the variables
cor(tx_data$sales, tx_data$volume)

## [1] 0.981039

## Plot a scatter plot for the variables with sales on the x-axis
## volume on the y-axis
ggplot(tx_data, aes(x = sales, y = volume)) +
         geom_point() +
         geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Task Four: Load and describe a dataset

Data Source: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

## Import the house sales data
house_sales <- read_csv("house_sales_prices.csv")

## Rows: 1460 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Street, CentralAir, SaleType, SaleCondition
## dbl (11): LotFrontage, LotArea, YearBuilt, TotalBsmtSF, BedroomAbvGr, Kitche...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Get a glimpse of the data set
glimpse(house_sales)

## Rows: 1,460
## Columns: 15
## $ LotFrontage   <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
## $ LotArea       <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
## $ YearBuilt     <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
## $ TotalBsmtSF   <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
## $ BedroomAbvGr  <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
## $ KitchenAbvGr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ TotRmsAbvGrd  <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ GarageCars    <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
## $ GarageArea    <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
## $ YrSold        <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm…
## $ SalePrice     <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …

## Create a broad overview of a data set
summary(house_sales)

##   LotFrontage        LotArea         YearBuilt     TotalBsmtSF    
##  Min.   : 21.00   Min.   :  1300   Min.   :1872   Min.   :   0.0  
##  1st Qu.: 59.00   1st Qu.:  7554   1st Qu.:1954   1st Qu.: 795.8  
##  Median : 69.00   Median :  9478   Median :1973   Median : 991.5  
##  Mean   : 70.05   Mean   : 10517   Mean   :1971   Mean   :1057.4  
##  3rd Qu.: 80.00   3rd Qu.: 11602   3rd Qu.:2000   3rd Qu.:1298.2  
##  Max.   :313.00   Max.   :215245   Max.   :2010   Max.   :6110.0  
##  NA's   :259                                                      
##   BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd       Street         
##  Min.   :0.000   Min.   :0.000   Min.   : 2.000   Length:1460       
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   Class :character  
##  Median :3.000   Median :1.000   Median : 6.000   Mode  :character  
##  Mean   :2.866   Mean   :1.047   Mean   : 6.518                     
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000                     
##  Max.   :8.000   Max.   :3.000   Max.   :14.000                     
##                                                                     
##   CentralAir          GarageCars      GarageArea         YrSold    
##  Length:1460        Min.   :0.000   Min.   :   0.0   Min.   :2006  
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:2007  
##  Mode  :character   Median :2.000   Median : 480.0   Median :2008  
##                     Mean   :1.767   Mean   : 473.0   Mean   :2008  
##                     3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:2009  
##                     Max.   :4.000   Max.   :1418.0   Max.   :2010  
##                                                                    
##    SaleType         SaleCondition        SalePrice     
##  Length:1460        Length:1460        Min.   : 34900  
##  Class :character   Class :character   1st Qu.:129975  
##  Mode  :character   Mode  :character   Median :163000  
##                                        Mean   :180921  
##                                        3rd Qu.:214000  
##                                        Max.   :755000  
##

skim(house_sales)

Data summary
Name	house_sales
Number of rows	1460
Number of columns	15
_______________________
Column type frequency:
character	4
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Street	1	4	4	2
CentralAir	1	1	1	2
SaleType	1	2	5	9
SaleCondition	1	6	7	6

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
LotFrontage	259	0.82	70.05	24.28	21	59.00	69.0	80.00	313	▇▃▁▁▁
LotArea	0	1.00	10516.83	9981.26	1300	7553.50	9478.5	11601.50	215245	▇▁▁▁▁
YearBuilt	0	1.00	1971.27	30.20	1872	1954.00	1973.0	2000.00	2010	▁▂▃▆▇
TotalBsmtSF	0	1.00	1057.43	438.71	0	795.75	991.5	1298.25	6110	▇▃▁▁▁
BedroomAbvGr	0	1.00	2.87	0.82	0	2.00	3.0	3.00	8	▁▇▂▁▁
KitchenAbvGr	0	1.00	1.05	0.22	0	1.00	1.0	1.00	3	▁▇▁▁▁
TotRmsAbvGrd	0	1.00	6.52	1.63	2	5.00	6.0	7.00	14	▂▇▇▁▁
GarageCars	0	1.00	1.77	0.75	0	1.00	2.0	2.00	4	▁▃▇▂▁
GarageArea	0	1.00	472.98	213.80	0	334.50	480.0	576.00	1418	▂▇▃▁▁
YrSold	0	1.00	2007.82	1.33	2006	2007.00	2008.0	2009.00	2010	▇▇▇▇▅
SalePrice	0	1.00	180921.20	79442.50	34900	129975.00	163000.0	214000.00	755000	▇▅▁▁▁

Task Five: Build a simple regression model

In this task, you will build a simple linear regression with one dependent and one independent variable and interpret the results

The linear model equation can be written as follow: volume = b0 + b1 * sales

b0 and b1 are known as the regression beta coefficients or parameters:
b0 is the intercept of the regression line; that is the predicted value when x = 0.
b1 is the slope of the regression line.

## Create a simple linear regression model using the variables
simple_model <- lm(volume ~ sales, data = tx_data)
summary(simple_model)

## 
## Call:
## lm(formula = volume ~ sales, data = tx_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -323981922   -7019959    1652053    7709296  674388390 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.206e+07  5.946e+05  -20.28   <2e-16 ***
## sales        2.163e+05  4.784e+02  452.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47600000 on 7983 degrees of freedom
## Multiple R-squared:  0.9624, Adjusted R-squared:  0.9624 
## F-statistic: 2.045e+05 on 1 and 7983 DF,  p-value: < 2.2e-16

## Plot the regression line for the model
ggplot(tx_data, aes(x = sales, y = volume)) +
  geom_point() +
  stat_smooth(method = lm) +
  labs(title = "Regression of Sales on Volume of Housing sales in TX",
       x = "Sales", y = "Volume") +
  scale_y_continuous(labels = scales::comma)

## `geom_smooth()` using formula = 'y ~ x'

From the output above:

The estimated regression line equation can be written as follow: volume = -12060071 + 216346*sales

The estimated regression line equation can be written as follow: volume = -12060071 + 216346*sales
The intercept (b0) is -12060071. It can be interpreted as the predicted total value of sales for a zero number of sales. Looks like with zero sales, there we will run into loss
The regression beta coefficient for the variable sales (b1), for every 1 unit increase in the number of sales, the required volume increases by 216346.

Task Six: Perform diagnostic checks on fitted model

In this task, you will use diagnostic plots to check whether the assumptions of linear regression model are satisfied

Assumptions:

Linear regression makes several assumptions about the data, such as: * Linearity of the data: The relationship between the predictor (x) and the outcome (y) is assumed to be linear.

Normality of residuals: The residual errors are assumed to be normally distributed.
Homogeneity of residuals variance: The residuals are assumed to have a constant variance (homoscedasticity)

*Independence of residuals error terms.

## Plotting the fitted model

plot(simple_model)

## Return the first diagnostic plot for the model

#plot(simple_model, which = 1)
#plot(simple_model, which = 2)
#plot(simple_model, which = 3)

## Create all four plots at once

autoplot(simple_model) +
  labs(title = "Diagnostic plots for the fitted model",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

Task Seven: Perform model fit assessment

In this task, you will learn how to assess how well the model fit and significance of the predictor variable

## Assess the summary of the fitted model

summary(simple_model)

## 
## Call:
## lm(formula = volume ~ sales, data = tx_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -323981922   -7019959    1652053    7709296  674388390 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.206e+07  5.946e+05  -20.28   <2e-16 ***
## sales        2.163e+05  4.784e+02  452.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47600000 on 7983 degrees of freedom
## Multiple R-squared:  0.9624, Adjusted R-squared:  0.9624 
## F-statistic: 2.045e+05 on 1 and 7983 DF,  p-value: < 2.2e-16

## Calculate the confidence interval for the coefficients

confint(simple_model)

##                   2.5 %      97.5 %
## (Intercept) -13225619.6 -10894523.0
## sales          215408.6    217284.1

Task Nine: Build a simple regression model with transformation

In this task, you will build a simple linear regression model using a square root or log transformation on the independent variable

## Build a log transformed regression model

log_model <- lm(log10(volume) ~ sales, data = tx_data)


## Return the summary of the model

summary(log_model)

## 
## Call:
## lm(formula = log10(volume) ~ sales, data = tx_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70214 -0.28613 -0.00098  0.33364  0.85241 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.249e+00  5.198e-03  1394.6   <2e-16 ***
## sales       4.296e-04  4.182e-06   102.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4161 on 7983 degrees of freedom
## Multiple R-squared:  0.5693, Adjusted R-squared:  0.5693 
## F-statistic: 1.055e+04 on 1 and 7983 DF,  p-value: < 2.2e-16

## Return the first diagnostic plot for the model

#plot(log_model, which = 1)
#plot(log_model, which = 2)
#plot(log_model, which = 3)


confint(log_model)

##                    2.5 %       97.5 %
## (Intercept) 7.2392377311 7.2596178191
## sales       0.0004214485 0.0004378449

autoplot(log_model) +
  labs(title = "Diagnostic plots for the fitted model",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

Task Ten: Make predictions using the fitted model

In this task, you will learn how to check for metrics from the fitted model and make predictions given new values of the independent variable

## Find the fitted values of the simple regression model

fitted_values <- predict.lm(simple_model)
#head(fitted_values, 3)

## Return the model metrics


model_metrics <- augment(simple_model)
#model_metrics
  
## Predict new values using the model

predict(simple_model, newdata = data.frame(sales = c(210, 27, 140)))

##        1        2        3 
## 33372661 -6218720 18228417

Task Eleven: Multiple Regression

In this task, you will build a multiple regression model with one dependent variable and three independent variables and interpret the results

## Build the multiple regression model with volume as the y variable and sales, median and age on the x variables

multiple_reg <- lm(volume ~ sales + median + age, data = tx_data)


## This prints the result of the model
multiple_reg

## 
## Call:
## lm(formula = volume ~ sales + median + age, data = tx_data)
## 
## Coefficients:
## (Intercept)        sales       median          age  
##  -3.262e+07    2.117e+05    3.992e+02   -1.823e+06

## Check the summary of the multiple regression model
summary(multiple_reg)

## 
## Call:
## lm(formula = volume ~ sales + median + age, data = tx_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -282127704  -13143251    -595883   13529089  660234903 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.262e+07  3.515e+06  -9.281   <2e-16 ***
## sales        2.117e+05  4.747e+02 445.937   <2e-16 ***
## median       3.992e+02  1.633e+01  24.445   <2e-16 ***
## age         -1.823e+06  1.289e+05 -14.142   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43410000 on 7981 degrees of freedom
## Multiple R-squared:  0.9688, Adjusted R-squared:  0.9688 
## F-statistic: 8.252e+04 on 3 and 7981 DF,  p-value: < 2.2e-16

## Plot the fitted multiple regression model
autoplot(multiple_reg)

Task Twelve: Create a model to predict house prices in Iowa

In this task, you will build a model from scratch to predict house prices using variables in a data set

## Create a broad overview of the data set
skim(house_sales)

Data summary
Name	house_sales
Number of rows	1460
Number of columns	15
_______________________
Column type frequency:
character	4
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Street	1	4	4	2
CentralAir	1	1	1	2
SaleType	1	2	5	9
SaleCondition	1	6	7	6

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
LotFrontage	259	0.82	70.05	24.28	21	59.00	69.0	80.00	313	▇▃▁▁▁
LotArea	0	1.00	10516.83	9981.26	1300	7553.50	9478.5	11601.50	215245	▇▁▁▁▁
YearBuilt	0	1.00	1971.27	30.20	1872	1954.00	1973.0	2000.00	2010	▁▂▃▆▇
TotalBsmtSF	0	1.00	1057.43	438.71	0	795.75	991.5	1298.25	6110	▇▃▁▁▁
BedroomAbvGr	0	1.00	2.87	0.82	0	2.00	3.0	3.00	8	▁▇▂▁▁
KitchenAbvGr	0	1.00	1.05	0.22	0	1.00	1.0	1.00	3	▁▇▁▁▁
TotRmsAbvGrd	0	1.00	6.52	1.63	2	5.00	6.0	7.00	14	▂▇▇▁▁
GarageCars	0	1.00	1.77	0.75	0	1.00	2.0	2.00	4	▁▃▇▂▁
GarageArea	0	1.00	472.98	213.80	0	334.50	480.0	576.00	1418	▂▇▃▁▁
YrSold	0	1.00	2007.82	1.33	2006	2007.00	2008.0	2009.00	2010	▇▇▇▇▅
SalePrice	0	1.00	180921.20	79442.50	34900	129975.00	163000.0	214000.00	755000	▇▅▁▁▁

## Drop the missing values in the LotFrontage variable

house_sales <- house_sales %>%
  drop_na(LotFrontage)


## Build the multiple regression model 

house_reg <- lm(SalePrice ~ ., data = house_sales)

## Check the summary of the multiple regression model

summary(house_reg)

## 
## Call:
## lm(formula = SalePrice ~ ., data = house_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483816  -23581   -2878   19681  394505 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -3.539e+06  2.014e+06  -1.758  0.07907 .  
## LotFrontage          -2.056e+01  6.546e+01  -0.314  0.75347    
## LotArea               8.656e-01  1.868e-01   4.635 3.98e-06 ***
## YearBuilt             4.636e+02  5.583e+01   8.305 2.73e-16 ***
## TotalBsmtSF           4.726e+01  3.779e+00  12.505  < 2e-16 ***
## BedroomAbvGr         -1.334e+04  2.299e+03  -5.800 8.52e-09 ***
## KitchenAbvGr         -5.403e+04  6.326e+03  -8.542  < 2e-16 ***
## TotRmsAbvGrd          2.267e+04  1.274e+03  17.793  < 2e-16 ***
## StreetPave            4.373e+04  2.101e+04   2.082  0.03760 *  
## CentralAirY           4.828e+03  5.667e+03   0.852  0.39442    
## GarageCars            2.110e+04  3.913e+03   5.393 8.36e-08 ***
## GarageArea            1.587e+01  1.370e+01   1.159  0.24669    
## YrSold                1.280e+03  1.004e+03   1.275  0.20245    
## SaleTypeCon           7.289e+04  3.313e+04   2.200  0.02802 *  
## SaleTypeConLD         1.959e+04  1.852e+04   1.058  0.29037    
## SaleTypeConLI         2.987e+04  2.405e+04   1.242  0.21462    
## SaleTypeConLw         3.950e+04  2.205e+04   1.791  0.07347 .  
## SaleTypeCWD           7.830e+04  2.423e+04   3.231  0.00127 ** 
## SaleTypeNew           7.085e+04  2.837e+04   2.498  0.01264 *  
## SaleTypeOth           3.766e+04  2.772e+04   1.359  0.17455    
## SaleTypeWD            2.762e+04  8.548e+03   3.232  0.00126 ** 
## SaleConditionAdjLand  1.814e+04  2.345e+04   0.774  0.43931    
## SaleConditionAlloca   4.724e+03  1.592e+04   0.297  0.76673    
## SaleConditionFamily  -2.546e+04  1.201e+04  -2.120  0.03422 *  
## SaleConditionNormal   4.647e+03  5.549e+03   0.837  0.40253    
## SaleConditionPartial -1.808e+04  2.717e+04  -0.665  0.50602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45110 on 1175 degrees of freedom
## Multiple R-squared:  0.7135, Adjusted R-squared:  0.7074 
## F-statistic:   117 on 25 and 1175 DF,  p-value: < 2.2e-16

## Perform diagnostic plots of the fitted multiple regression model

autoplot(house_reg) +
  labs(title = "Diagnostic plots for the fitted model",
       x = "Fitted values", y = "Residuals") +
  theme_minimal()

Predictive Analysis With Regression

Peter Thompson

2025-04-07

About Report

Task One: Import packages and dataset

Task Two: Use R functions to describe the data

Task Three: Create data visualization using ggplot

Task Four: Load and describe a dataset

Task Five: Build a simple regression model

Task Six: Perform diagnostic checks on fitted model

Task Seven: Perform model fit assessment

Task Nine: Build a simple regression model with transformation

Task Ten: Make predictions using the fitted model

Task Eleven: Multiple Regression

Task Twelve: Create a model to predict house prices in Iowa