# Import BD
bd<- read.csv("C:\\Users\\Silva\\Downloads\\real_estate_data.csv")
#Installing libraries
#library(pysch)
library(tidyverse)
library(ggplot2)
library(corrplot)
library(gmodels)
library(effects)
library(stargazer)
library(olsrr)
library(kableExtra)
library(jtools)
library(fastmap)
library(dlookr)
library(Hmisc)
library(naniar)
library(glmnet)
library(caret)
#Identify missing values
missing_values = colSums(is.na(bd))
missing_values
## medv cmedv crim zn indus chas nox rm age dis
## 0 0 0 0 0 0 0 0 0 0
## rad tax ptratio b lstat
## 0 0 0 0 0
#Display data set structure
str(bd)
## 'data.frame': 506 obs. of 15 variables:
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ cmedv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
# Include descriptive statistics (mean, median, standard deviation, minimum, maximum)
summary(bd)
## medv cmedv crim zn
## Min. : 5.00 Min. : 5.00 Min. : 0.00632 Min. : 0.00
## 1st Qu.:17.02 1st Qu.:17.02 1st Qu.: 0.08205 1st Qu.: 0.00
## Median :21.20 Median :21.20 Median : 0.25651 Median : 0.00
## Mean :22.53 Mean :22.53 Mean : 3.61352 Mean : 11.36
## 3rd Qu.:25.00 3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.: 12.50
## Max. :50.00 Max. :50.00 Max. :88.97620 Max. :100.00
## indus chas nox rm
## Min. : 0.46 Min. :0.00000 Min. :0.3850 Min. :3.561
## 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886
## Median : 9.69 Median :0.00000 Median :0.5380 Median :6.208
## Mean :11.14 Mean :0.06917 Mean :0.5547 Mean :6.285
## 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623
## Max. :27.74 Max. :1.00000 Max. :0.8710 Max. :8.780
## age dis rad tax
## Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0
## 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0
## Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0
## Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2
## 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0
## Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0
## ptratio b lstat
## Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :19.05 Median :391.44 Median :11.36
## Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :22.00 Max. :396.90 Max. :37.97
# Transform variables if required
#bd$chas = as.factor(bd$chas)
Build at least 2 pair-wised graphs between the dependent variable and independent variables
# Histogram NOX (graph 1)
hist1=ggplot(data = bd, aes(x = nox))+
geom_histogram(bins = 10, fill = "lightblue", color = "black", boundary = 15) + labs(title = "Home values vs nitric oxides concentration)", x="NOX concentration", y="Home values")+ theme(plot.title = element_text(hjust = 0.5))
hist1
# Histogram CRIME RATE (graph 2)
hist2=ggplot(data = bd, aes(x = crim))+
geom_histogram(bins = 10, fill = "purple", color = "black", boundary = 15) + labs(title = "Home values vs Crime rate", x="Crime rate", y="Home values")+ theme(plot.title = element_text(hjust = 0.5))
hist2
# Histogram DISTANCE (graph3)
hist3=ggplot(data = bd, aes(x = dis))+
geom_histogram(bins = 10, fill = "red", color = "black", boundary = 15) + labs(title = "Home values vs Distance to five Boston employment centers", x="Distance", y="Home values")+ theme(plot.title = element_text(hjust = 0.5))
hist3
# Histogram Rooms (graph 4)
hist4=ggplot(data = bd, aes(x = rm))+
geom_histogram(bins = 10, fill = "pink", color = "black", boundary = 15) + labs(title = "Home values vs Number od rooms", x="Number of rooms", y="Home values")+ theme(plot.title = element_text(hjust = 0.5))
hist4
# Display a histogram of dependent variable (graph 5)
ggplot(data = bd, aes(x = medv))+
geom_histogram(bins = 10, fill = "lightblue", color = "black", boundary = 15) + labs(title = "Frequency of home values (USD 1,000´s)", x="Media_values", y="Frequency")+ theme(plot.title = element_text(hjust = 0.5))
# Display a histogram of dependent variable tranformed as natural logaritmic (graph 6)
ggplot(data = bd, aes(x = log(medv)))+
geom_histogram(bins = 10, fill = "lightblue", color = "black", boundary = 15) + labs(title = "Frequency of home values (USD 1,000´s)", x="Media_values", y="Frequency")+ theme(plot.title = element_text(hjust = 0.5))
# Display a correlation plot (graph 7)
correlation_matrix <- cor(bd,use = "everything")
corrplot(correlation_matrix, method = "color", type = "upper", tl.cex = 0.9)
H0: having a school near has a significant impact on the houses
values.
H1: having a school near has no impact on the houses values.
H0:having a high percentage of crime is not significant on the house
value.
H1:having a high percentage of crime is significant on the house
value.
H0:The distance has a linear impact on the house price
H1: The distance has no linear or impact on the house price.
Estimate 3 different linear regression models
## Model 1: Multiple linear regression
modelol <- lm(medv ~ indus + nox + rad + chas + tax + rm + crim + dis + lstat, data = bd)
summary(modelol)
##
## Call:
## lm(formula = medv ~ indus + nox + rad + chas + tax + rm + crim +
## dis + lstat, data = bd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.922 -3.164 -1.019 2.128 27.880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.037486 3.996667 3.512 0.000485 ***
## indus -0.102369 0.065751 -1.557 0.120127
## nox -9.128940 3.797284 -2.404 0.016579 *
## rad 0.135520 0.069856 1.940 0.052947 .
## chas 3.602529 0.934492 3.855 0.000131 ***
## tax -0.010006 0.004004 -2.499 0.012790 *
## rm 4.603069 0.431139 10.677 < 2e-16 ***
## crim -0.095412 0.035470 -2.690 0.007388 **
## dis -1.071326 0.184055 -5.821 1.05e-08 ***
## lstat -0.575164 0.051430 -11.184 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.186 on 496 degrees of freedom
## Multiple R-squared: 0.6878, Adjusted R-squared: 0.6821
## F-statistic: 121.4 on 9 and 496 DF, p-value: < 2.2e-16
## Model 2: Linear model (logarithmic)
modelo2 <- lm(log(medv) ~ indus + nox + rad + chas + tax + rm + crim + dis + lstat, data = bd)
summary(modelo2)
##
## Call:
## lm(formula = log(medv) ~ indus + nox + rad + chas + tax + rm +
## crim + dis + lstat, data = bd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75215 -0.11530 -0.02392 0.11428 0.85862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2723333 0.1583533 20.665 < 2e-16 ***
## indus -0.0019455 0.0026051 -0.747 0.455539
## nox -0.4302928 0.1504535 -2.860 0.004415 **
## rad 0.0079011 0.0027678 2.855 0.004489 **
## chas 0.1369810 0.0370258 3.700 0.000240 ***
## tax -0.0005753 0.0001587 -3.626 0.000318 ***
## rm 0.1187548 0.0170823 6.952 1.14e-11 ***
## crim -0.0100107 0.0014054 -7.123 3.73e-12 ***
## dis -0.0386950 0.0072925 -5.306 1.69e-07 ***
## lstat -0.0308904 0.0020377 -15.159 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2055 on 496 degrees of freedom
## Multiple R-squared: 0.7519, Adjusted R-squared: 0.7474
## F-statistic: 167 on 9 and 496 DF, p-value: < 2.2e-16
## Model 3: polynomial – multiple linear regression
polymodel = lm(log(medv) ~ indus + nox + rad + chas + rm + I(rm^2) + dis + tax + crim + lstat, data=bd)
summary(polymodel)
##
## Call:
## lm(formula = log(medv) ~ indus + nox + rad + chas + rm + I(rm^2) +
## dis + tax + crim + lstat, data = bd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96456 -0.10702 -0.00925 0.09530 0.88754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8502983 0.4015860 17.058 < 2e-16 ***
## indus 0.0006635 0.0024114 0.275 0.783318
## nox -0.4980196 0.1385524 -3.594 0.000358 ***
## rad 0.0076340 0.0025457 2.999 0.002847 **
## chas 0.1183804 0.0341080 3.471 0.000564 ***
## rm -1.0108149 0.1191905 -8.481 2.60e-16 ***
## I(rm^2) 0.0876424 0.0091672 9.560 < 2e-16 ***
## dis -0.0300090 0.0067681 -4.434 1.14e-05 ***
## tax -0.0005642 0.0001459 -3.867 0.000125 ***
## crim -0.0108652 0.0012956 -8.386 5.26e-16 ***
## lstat -0.0313507 0.0018747 -16.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.189 on 495 degrees of freedom
## Multiple R-squared: 0.7905, Adjusted R-squared: 0.7863
## F-statistic: 186.8 on 10 and 495 DF, p-value: < 2.2e-16
# Model comparison
stargazer(modelol,modelo2,polymodel,type="text",title="OLS Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.9)
##
## OLS Regression Results
## ===============================================================================================
## Dependent variable:
## ---------------------------------------------------------------------------
## medv log(medv)
## (1) (2) (3)
## -----------------------------------------------------------------------------------------------
## indus -0.102 (0.066) -0.002 (0.003) 0.001 (0.002)
## nox -9.129** (3.797) -0.430*** (0.150) -0.498*** (0.139)
## rad 0.136* (0.070) 0.008*** (0.003) 0.008*** (0.003)
## chas 3.603*** (0.934) 0.137*** (0.037) 0.118*** (0.034)
## tax -0.010** (0.004) -0.001*** (0.0002) -0.001*** (0.0001)
## rm 4.603*** (0.431) 0.119*** (0.017) -1.011*** (0.119)
## I(rm2) 0.088*** (0.009)
## crim -0.095*** (0.035) -0.010*** (0.001) -0.011*** (0.001)
## dis -1.071*** (0.184) -0.039*** (0.007) -0.030*** (0.007)
## lstat -0.575*** (0.051) -0.031*** (0.002) -0.031*** (0.002)
## Constant 14.037*** (3.997) 3.272*** (0.158) 6.850*** (0.402)
## -----------------------------------------------------------------------------------------------
## Observations 506 506 506
## R2 0.688 0.752 0.791
## Adjusted R2 0.682 0.747 0.786
## Residual Std. Error 5.186 (df = 496) 0.205 (df = 496) 0.189 (df = 495)
## F Statistic 121.396*** (df = 9; 496) 166.981*** (df = 9; 496) 186.814*** (df = 10; 495)
## ===============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# Model graphs (graph 8)
par(mfrow = c(1,2))
plot(x=predict(modelol),y=bd$medv,
xlab='Predicted values',ylab='Observed values',
main='Model 1')
abline(a=0,b=1,col="blue")
plot(x=predict(modelo2),y=bd$medv,
xlab='Predicted values',ylab='Observed values',
main='Model 3')
abline(a=0,b=1,col="blue")
# Effect plots (graph 9)
effect_plot(polymodel,pred=rm,interval=TRUE,main='Linear model')
## Using data bd from global environment. This could cause incorrect results
## if bd has been altered since the model was fit. You can manually provide
## the data to the "data =" argument.
# Show the level of accuracy for each linear regression model
# Model 1 - level of accuracy
AIC(modelol)
## [1] 3113.49
# Model 2 - level of accuracy
AIC(modelo2)
## [1] -153.6378
# Model 3 - level of accuracy
AIC(polymodel)
## [1] -237.3781
# Lasso regression model
## Creation of data samples for train & test
set.seed(123)
training.samples<-bd$medv %>%
createDataPartition(p=0.75,list=FALSE)
train.data<-bd[training.samples, ]
test.data<-bd[-training.samples, ]
selected_model = lm(log(medv) ~ indus + nox + rad + chas + rm + I(rm^2) + dis + tax + crim + lstat, data=bd)
summary(selected_model)
##
## Call:
## lm(formula = log(medv) ~ indus + nox + rad + chas + rm + I(rm^2) +
## dis + tax + crim + lstat, data = bd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96456 -0.10702 -0.00925 0.09530 0.88754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8502983 0.4015860 17.058 < 2e-16 ***
## indus 0.0006635 0.0024114 0.275 0.783318
## nox -0.4980196 0.1385524 -3.594 0.000358 ***
## rad 0.0076340 0.0025457 2.999 0.002847 **
## chas 0.1183804 0.0341080 3.471 0.000564 ***
## rm -1.0108149 0.1191905 -8.481 2.60e-16 ***
## I(rm^2) 0.0876424 0.0091672 9.560 < 2e-16 ***
## dis -0.0300090 0.0067681 -4.434 1.14e-05 ***
## tax -0.0005642 0.0001459 -3.867 0.000125 ***
## crim -0.0108652 0.0012956 -8.386 5.26e-16 ***
## lstat -0.0313507 0.0018747 -16.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.189 on 495 degrees of freedom
## Multiple R-squared: 0.7905, Adjusted R-squared: 0.7863
## F-statistic: 186.8 on 10 and 495 DF, p-value: < 2.2e-16
RMSE(selected_model$fitted.values,test.data$medv)
## Warning in pred - obs: longitud de objeto mayor no es múltiplo de la longitud
## de uno menor
## [1] 23.18747
# Creation of the lasso model
x = model.matrix(log(medv) ~ indus + nox + rad + chas + rm + I(rm^2) + dis + tax + crim + lstat, train.data)[,-1]
y = train.data$medv
set.seed(123)
cv.lasso<-cv.glmnet(x,y,alpha=1)
cv.lasso$lambda.min
## [1] 0.002175466
lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
coef(lassomodel)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 113.02252202
## indus -0.02689376
## nox -11.19359043
## rad 0.07845893
## chas 1.55776739
## rm -27.20726443
## I(rm^2) 2.47732516
## dis -0.63931869
## tax -0.01046509
## crim -0.10082665
## lstat -0.49493861
x.test<-model.matrix(log(medv) ~ indus + nox + rad + chas + rm + I(rm^2) + dis + tax + crim + lstat,test.data)[,-1]
lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()
data.frame(
RMSE = RMSE(lassopredictions, test.data$medv),
Rsquare = R2(lassopredictions, test.data$medv))
## RMSE Rsquare
## 1 6.190498 0.7101066
# Lasso model graph
lbs_fun <- function(fit, offset_x=1, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])+ offset_x
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
lasso<-glmnet(scale(x),y,alpha=1)
plot(lasso,xvar="lambda",label=T)
lbs_fun(lasso)
abline(v=cv.lasso$lambda.min,col="red",lty=2)
abline(v=cv.lasso$lambda.1se,col="blue",lty=2)
One of the coefficients of determination, is the R-squared (R²), which is a statistical measure used to assess the goodness of fit of a model to a given set of data. It provides information about how well the independent variables in a regression model explain the variation in the dependent variable. In this case the model 3 has the highest R^2.
According to Statology (2022), AIC is a statistical measure that is used to evaluate the goodness of fit of a linear regression model. It is based on the principle of parsimony, which states that simpler models are generally better than more complex ones.there is no value for AIC that can be considered “good” or “bad” because we simply use AIC as a way to compare regression models. The model with the lowest AIC offers the best fit, in this case the model 3 has the least AIC. The absolute value of the AIC value is not important.
Having these 2 factors of significance, the AIC is more important than the R^2.
Interpretation and conclusions of model 3
In model 3 we transformed the dependent variable to a natural logarithm
(log(medv)), so as not to bias the regression estimates due to the
distribution of the variable.
all the variables used for the estimation of the model turned out to be significant with up to 99% confidence, except the variable “indus”, which defines the proportion of non-retail business acres per town
The rooms (rm) variable is the one that has a greater percentage impact on the model and on the value of the houses. This impact is negative, which would mean that the greater the number of rooms, the value of the same decreases.
The impact of the rooms (rm) variable is not linear, since its square is also significant in the model and in this case it points to a positive impact, contrary to that shown in the variable before being transformed. this phenomenon could be visualized from the exploration of the data in graph 4.
variables that had a significant negative impact:
variables that had a significant positive impact:
In the lasso regression model, the variables that showed significance throughout the model corrections were: rm^2, lstat, rm and tax, making them the most important variables in the model.
how linear regression analysis can contribute to improve predictive analytics
With linear regression analysis we can model the relationship between a dependent variable(y) and independent variables (x) by adjusting a linear equation to the observed data.
When a linear regression model is built and validated, it can be helpful to make predictions. Given new values of the independent variables, the model can provide predictions of the dependent variable. This technique can help us in the forecasting of sales, predicting customer behavior, or estimating housing prices. It uses linear relationships between a dependent variable (target) and one or more independent variables (predictors) to predict the future of the target.
An advantage of using linear regression for predictive analytics is that it is flexible and adaptable. You can use linear regression to model different types of relationships, such as linear, polynomial, logarithmic, exponential, or inverse.
(IBM 2022)
IBM Documentation. (2022, November 3). Ibm.com. https://www.ibm.com/docs/en/db2-warehouse?topic=procedures-linear-regression#:~:text=Linear%20regression%20is%20the%20most,the%20future%20of%20the%20target
What are the advantages and disadvantages of using linear regression for predictive analytics?(2023). Linkedin.com. https://www.linkedin.com/advice/1/what-advantages-disadvantages-using-linear-1e#:~:text=It%20is%20a%20statistical%20technique,expectancy%20based%20on%20health%20factors
```