Equipo:
A00834241 | Regina Rodríguez Chávez A00833617 | Yessica Acosta Blancheth A01275763 | Eli Gabriel Hernández Medina A00833172 | Genaro Rodríguez Alcántara
The first step is load the libraries
library(foreign)
library(dplyr) # data manipulation
library(forcats) # to work with categorical variables
library(ggplot2) # data visualization
library(readr) # read specific csv files
library(janitor) # data exploration and cleaning
library(Hmisc) # several useful functions for data analysis
library(psych) # functions for multivariate analysis
library(naniar) # summaries and visualization of missing values NA's
library(dlookr) # summaries and visualization of missing values NA's
library(corrplot) # correlation plots
library(jtools) # presentation of regression analysis
library(lmtest) # diagnostic checks - linear regression analysis
library(car) # diagnostic checks - linear regression analysis
library(olsrr) # diagnostic checks - linear regression analysis
library(naniar) # identifying missing values
#library(funModeling) # create frequency tables
library(stargazer) # create publication quality tables
library(effects) # displays for linear and other regression models
library(tidyverse) # collection of R packages designed for data scienc
library(caret) # Classification and Regression Training
library(glmnet)
library(ggplot2)
library(car)
library(lmtest)
It is necessary to load the database to begin the analysis.
#file.choose()
dataset <- read.csv("/Users/gabrielmedina/Downloads/Real_estate_data (1).csv")
It is important to start analyzing the structure of the database, as well as the type of data held in each variable, in order to fully understand the data we are working with.
#dataset
str(dataset)
## 'data.frame': 506 obs. of 15 variables:
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ cmedv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
Analyzing the structure of the database, we can realize that the chas variable is a binary variable of factor type, which is in an integer type, therefore it is necessary to transform the data.
dataset$chas <- as.factor(dataset$chas)
sum(is.na(dataset))
## [1] 0
Since no null data were found, it is important to have an idea of the main values of each variable through a descriptive statistical analysis.
summary(dataset)
## medv cmedv crim zn
## Min. : 5.00 Min. : 5.00 Min. : 0.00632 Min. : 0.00
## 1st Qu.:17.02 1st Qu.:17.02 1st Qu.: 0.08205 1st Qu.: 0.00
## Median :21.20 Median :21.20 Median : 0.25651 Median : 0.00
## Mean :22.53 Mean :22.53 Mean : 3.61352 Mean : 11.36
## 3rd Qu.:25.00 3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.: 12.50
## Max. :50.00 Max. :50.00 Max. :88.97620 Max. :100.00
## indus chas nox rm age
## Min. : 0.46 0:471 Min. :0.3850 Min. :3.561 Min. : 2.90
## 1st Qu.: 5.19 1: 35 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02
## Median : 9.69 Median :0.5380 Median :6.208 Median : 77.50
## Mean :11.14 Mean :0.5547 Mean :6.285 Mean : 68.57
## 3rd Qu.:18.10 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08
## Max. :27.74 Max. :0.8710 Max. :8.780 Max. :100.00
## dis rad tax ptratio
## Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60
## 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40
## Median : 3.207 Median : 5.000 Median :330.0 Median :19.05
## Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46
## 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20
## Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00
## b lstat
## Min. : 0.32 Min. : 1.73
## 1st Qu.:375.38 1st Qu.: 6.95
## Median :391.44 Median :11.36
## Mean :356.67 Mean :12.65
## 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :396.90 Max. :37.97
statistics_dt = summary(dataset)
mean_value <- statistics_dt[2, ]
median_value <- statistics_dt[3, ]
standard_deviation <- statistics_dt[4, ]
minimum_value <- statistics_dt[5, ]
maximum_value <- statistics_dt[6, ]
# Print the descriptive statistics
cat("Mean:", mean_value, "\n")
## Mean: 1st Qu.:17.02 1st Qu.:17.02 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
cat("Median:", median_value, "\n")
## Median: Median :21.20 Median :21.20 Median : 0.25651 Median : 0.00 Median : 9.69 NA Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Median :11.36
cat("Standard Deviation:", standard_deviation, "\n")
## Standard Deviation: Mean :22.53 Mean :22.53 Mean : 3.61352 Mean : 11.36 Mean :11.14 NA Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
cat("Minimum:", minimum_value, "\n")
## Minimum: 3rd Qu.:25.00 3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 NA 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
cat("Maximum:", maximum_value, "\n")
## Maximum: Max. :50.00 Max. :50.00 Max. :88.97620 Max. :100.00 Max. :27.74 NA Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
The results of the descriptive statistics reveal valuable information about the variables in my data set. Regarding the target variable “medv” (average value of homes), it is observed that its values range between 5.00 and 50.00, with a median of 21.20 and a slightly higher mean of 22.53. This suggests a relatively wide dispersion of house prices. When examining other variables, such as “crim” (crime rate) and “nox” (concentration of nitrogen oxides), wide variations in their values are found, indicating significant differences in levels of crime and air quality in the studied areas. Additionally, the variable “chas” is mostly categorical, with most records having a value of 0.
On the other hand, statistics highlight key characteristics of the predictor variables. For example, “rm” (average number of rooms per dwelling) has values that vary between 3,561 and 8,780, with a median of 6,208, suggesting variability in the size of dwellings. The variable “tax” shows wide variation, with values ranging between 187.0 and 711.0. Furthermore, “lstat” (percentage of population of lower status) presents values that vary from 1.73 to 37.97, indicating the diversity in the socioeconomic conditions of the studied areas. These summary statistics provide an overview of the distribution and variability of the variables, which is essential for better understanding the data set and planning subsequent analyses.
Since no abnormality was found in the data, a graphical analysis is carried out.
ggplot(data = dataset, aes(x = crim, y = medv)) +
geom_point(color = "blue", alpha = 0.6) +
labs(x = "Crime Rate (crim)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Crime Rate") +
theme_minimal()
ggplot(data = dataset, aes(x = zn, y = medv)) +
geom_point(color = "red", alpha = 0.6) +
labs(x = "Proportion of Residential Land Zoned (zn)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Proportion of Residential Land Zoned") +
theme_minimal()
ggplot(data = dataset, aes(x = indus, y = medv)) +
geom_point(color = "green", alpha = 0.6) +
labs(x = "Industrial Zone Proportion (indus)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Industrial Zone Proportion") +
theme_minimal()
ggplot(data = dataset, aes(x = nox, y = medv)) +
geom_point(color = "blue", alpha = 0.6) +
labs(x = "Nitrogen Oxides Concentration (nox)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Nitrogen Oxides Concentration") +
theme_minimal()
ggplot(data = dataset, aes(x = rm, y = medv)) +
geom_point(color = "red", alpha = 0.6) +
labs(x = "Average Number of Rooms (rm)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Average Number of Rooms") +
theme_minimal()
ggplot(data = dataset, aes(x = age, y = medv)) +
geom_line(color = "green", alpha = 0.6) +
labs(x = "Proportion of Owner-Occupied Units Built Before 1940 (age)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Proportion of Owner-Occupied Units Built Before 1940") +
theme_minimal()
ggplot(data = dataset, aes(x = dis, y = medv)) +
geom_line(color = "purple", alpha = 0.6) +
labs(x = "Weighted Distances to Employment Centers (dis)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Weighted Distances to Employment Centers") +
theme_minimal()
ggplot(data = dataset, aes(x = factor(rad), y = medv)) +
geom_boxplot(fill = "skyblue") +
labs(x = "Accessibility to Radial Highways (rad)", y = "Median Value (medv)") +
ggtitle("Boxplot: Median Value vs. Accessibility to Radial Highways") +
theme_minimal()
ggplot(data = dataset, aes(x = factor(tax), y = medv)) +
geom_boxplot(fill = "lightgreen") +
labs(x = "Property Tax Rate (tax)", y = "Median Value (medv)") +
ggtitle("Boxplot: Median Value vs. Property Tax Rate") +
theme_minimal()
ggplot(data = dataset, aes(x = ptratio, y = medv)) +
geom_point(color = "blue", fill = "lightblue") +
labs(x = "Pupil-Teacher Ratio (ptratio)", y = "Median Value (medv)") +
ggtitle("Density Plot: Median Value vs. Pupil-Teacher Ratio") +
theme_minimal()
ggplot(data = dataset, aes(x = b, y = medv)) +
geom_point(color = "purple", alpha = 0.6) +
labs(x = "Proportion of Residents of African American Descent (b)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Proportion of Residents of African American Descent (b)") +
theme_minimal()
ggplot(data = dataset, aes(x = lstat, y = medv)) +
geom_point(color = "orange", alpha = 0.6) +
labs(x = "Percentage of Lower Status Population (lstat)", y = "Median Value (medv)") +
ggtitle("Scatter Plot: Median Value vs. Percentage of Lower Status Population (lstat)") +
theme_minimal()
The previous graphs give us an idea of the correlation relationships
between each independent variable and the dependent variable. This does
not mean that they have a direct causal relationship, but they are a
sign that they have the potential to be statistically significant, so
some variables would be interesting to include in linear regression
analyses, for example, zn,nox, rm or ISTAT.
hist(dataset$medv)
hist(dataset$crim)
hist(dataset$zn)
hist(dataset$indus)
hist(dataset$nox)
hist(dataset$rm)
hist(dataset$dis)
hist(dataset$rad)
hist(dataset$tax)
hist(dataset$ptratio)
hist(dataset$b)
hist(dataset$lstat)
Thanks to the histograms, we can see the distribution of the data for each variable, where we see that there are several such as: crim, zn, indus, chas, nox, dix, rad, tax, ptratio, b and lstat that do not follow a normal distribution, therefore it will be interesting to apply quadratic or logarithmic transformations to these variables in our subsequent linear regression analyses.
To better understand the correlation relationships between the variables, a diagram and a correlation matrix will be made. It is very important to analyze this to avoid multiculinality problems.
dataset_excl_chas <- dataset[, !colnames(dataset) %in% c("chas")]
cor_matrix <- cor(dataset_excl_chas)
corrplot(cor_matrix, type = "upper", order = "hclust", addCoef.col = "black")
cor_matrix
## medv cmedv crim zn indus nox
## medv 1.0000000 0.9984759 -0.3883046 0.3604453 -0.4837252 -0.4273208
## cmedv 0.9984759 1.0000000 -0.3895824 0.3603862 -0.4847544 -0.4293002
## crim -0.3883046 -0.3895824 1.0000000 -0.2004692 0.4065834 0.4209717
## zn 0.3604453 0.3603862 -0.2004692 1.0000000 -0.5338282 -0.5166037
## indus -0.4837252 -0.4847544 0.4065834 -0.5338282 1.0000000 0.7636514
## nox -0.4273208 -0.4293002 0.4209717 -0.5166037 0.7636514 1.0000000
## rm 0.6953599 0.6963038 -0.2192467 0.3119906 -0.3916759 -0.3021882
## age -0.3769546 -0.3779989 0.3527343 -0.5695373 0.6447785 0.7314701
## dis 0.2499287 0.2493148 -0.3796701 0.6644082 -0.7080270 -0.7692301
## rad -0.3816262 -0.3847656 0.6255051 -0.3119478 0.5951293 0.6114406
## tax -0.4685359 -0.4719788 0.5827643 -0.3145633 0.7207602 0.6680232
## ptratio -0.5077867 -0.5056546 0.2899456 -0.3916785 0.3832476 0.1889327
## b 0.3334608 0.3348608 -0.3850639 0.1755203 -0.3569765 -0.3800506
## lstat -0.7376627 -0.7408360 0.4556215 -0.4129946 0.6037997 0.5908789
## rm age dis rad tax ptratio
## medv 0.6953599 -0.3769546 0.2499287 -0.3816262 -0.4685359 -0.5077867
## cmedv 0.6963038 -0.3779989 0.2493148 -0.3847656 -0.4719788 -0.5056546
## crim -0.2192467 0.3527343 -0.3796701 0.6255051 0.5827643 0.2899456
## zn 0.3119906 -0.5695373 0.6644082 -0.3119478 -0.3145633 -0.3916785
## indus -0.3916759 0.6447785 -0.7080270 0.5951293 0.7207602 0.3832476
## nox -0.3021882 0.7314701 -0.7692301 0.6114406 0.6680232 0.1889327
## rm 1.0000000 -0.2402649 0.2052462 -0.2098467 -0.2920478 -0.3555015
## age -0.2402649 1.0000000 -0.7478805 0.4560225 0.5064556 0.2615150
## dis 0.2052462 -0.7478805 1.0000000 -0.4945879 -0.5344316 -0.2324705
## rad -0.2098467 0.4560225 -0.4945879 1.0000000 0.9102282 0.4647412
## tax -0.2920478 0.5064556 -0.5344316 0.9102282 1.0000000 0.4608530
## ptratio -0.3555015 0.2615150 -0.2324705 0.4647412 0.4608530 1.0000000
## b 0.1280686 -0.2735340 0.2915117 -0.4444128 -0.4418080 -0.1773833
## lstat -0.6138083 0.6023385 -0.4969958 0.4886763 0.5439934 0.3740443
## b lstat
## medv 0.3334608 -0.7376627
## cmedv 0.3348608 -0.7408360
## crim -0.3850639 0.4556215
## zn 0.1755203 -0.4129946
## indus -0.3569765 0.6037997
## nox -0.3800506 0.5908789
## rm 0.1280686 -0.6138083
## age -0.2735340 0.6023385
## dis 0.2915117 -0.4969958
## rad -0.4444128 0.4886763
## tax -0.4418080 0.5439934
## ptratio -0.1773833 0.3740443
## b 1.0000000 -0.3660869
## lstat -0.3660869 1.0000000
The correlation analysis between the independent variables and the dependent variable “medv” shows several interesting results. A strong positive correlation is observed between “medv” and “cmedv,” indicating that both variables are highly related. Furthermore, “rm” (number of rooms) also presents a significant positive correlation with “medv,” suggesting that as the number of rooms increases, the median home value tends to increase.
On the other hand, there is a notable negative correlation between “medv” and “lstat” (percentage of population with low socioeconomic status), indicating that as the percentage of population with low socioeconomic status decreases, the median value of the housing tends to increase. It is important to mention that multicollinearity is also observed between some independent variables, such as “nox” and “indus,” which suggests that these variables are correlated with each other.
With the exploratory analysis of the data, the variables that are believed to have the best potential to model the situation will be chosen…
Likewise, for models 2 and 3, transformations will be used through the conclusions found in the histograms and other insights found in the exploratory analysis of the data.
dmodel1<-lm(medv ~ rm+nox+lstat+crim+tax+indus,data=dataset)
summary(dmodel1)
##
## Call:
## lm(formula = medv ~ rm + nox + lstat + crim + tax + indus, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.782 -3.553 -1.137 1.765 30.464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.481981 3.299447 -0.752 0.45226
## rm 5.213899 0.444913 11.719 < 2e-16 ***
## nox 3.850583 3.474274 1.108 0.26826
## lstat -0.560857 0.053869 -10.412 < 2e-16 ***
## crim -0.059567 0.035764 -1.666 0.09643 .
## tax -0.006609 0.002392 -2.763 0.00593 **
## indus 0.010882 0.063367 0.172 0.86372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.459 on 499 degrees of freedom
## Multiple R-squared: 0.6519, Adjusted R-squared: 0.6477
## F-statistic: 155.8 on 6 and 499 DF, p-value: < 2.2e-16
The model has a moderate ability to explain variations in house prices, with an adjusted R-squared of approximately 0.648.
Among the independent variables, “rm” (average number of rooms per household) and “lstat” (percentage of population with low status) are statistically significant. A one-unit increase in the average number of rooms is associated with an increase of about 5,214 units in the housing price, while a one-unit increase in the percentage of the population with low status is associated with a decrease of about 0.561 units in the price of housing.
The variables “tax” (property tax), “nox” (concentration of nitrogen oxides), “crim” (crime rate) and “indus” (proportion of acres of non-retail businesses per city) are not statistically significant in this model.
The model as a whole has a statistically significant F value, indicating that at least one of the independent variables is relevant in predicting house prices.
The intercept is not statistically significant in this context.
In summary, the model suggests that the average number of rooms and the percentage of the population with low status are important factors in predicting housing prices, while other variables do not have a significant impact on the model. The model as a whole is capable of explaining a considerable part of the variability in house prices.
Although the model shows a relatively good fit, it is necessary to perform diagnostic tests to detect problems such as multiculinality, heterocerasticity or normality of residuals.
# Multicollinearity
vif(dmodel1)
## rm nox lstat crim tax indus
## 1.656208 2.746972 2.507990 1.603871 2.753383 3.202960
Since no value exceeds the threshold of 10, multiculionality is not observed, that is, there is no correlation between the independent variables.
# Heteroscedasticity
bptest(dmodel1)
##
## studentized Breusch-Pagan test
##
## data: dmodel1
## BP = 33.193, df = 6, p-value = 9.628e-06
The Breusch-Pagan test shows an extremely low p value (9.628e-06), less than P=0.05, which indicates that we must reject the null hypothesis of homoscedasticity in model dmodel1, concluding that there is evidence of heteroskedasticity in the model.
# Normality of Residuals
qqnorm(dmodel1$residuals)
qqline(dmodel1$residuals)
shapiro.test(dmodel1$residuals)
##
## Shapiro-Wilk normality test
##
## data: dmodel1$residuals
## W = 0.87497, p-value < 2.2e-16
The Shapiro-Wilk normality test yields a p value very close to zero (< 2.2e-16), which indicates that we must reject the null hypothesis that the residuals of model dmodel1 follow a normal distribution, concluding that the residuals do not They are normally distributed.
dmodel2<-lm(log(medv) ~ rm+nox+lstat+crim+tax+indus,data=dataset)
summary(dmodel2)
##
## Call:
## lm(formula = log(medv) ~ rm + nox + lstat + crim + tax + indus,
## data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74079 -0.13175 -0.01677 0.10572 0.93981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.626e+00 1.305e-01 20.118 < 2e-16 ***
## rm 1.432e-01 1.760e-02 8.136 3.26e-15 ***
## nox 5.965e-02 1.374e-01 0.434 0.664477
## lstat -3.031e-02 2.131e-03 -14.223 < 2e-16 ***
## crim -8.308e-03 1.415e-03 -5.872 7.87e-09 ***
## tax -3.153e-04 9.461e-05 -3.332 0.000926 ***
## indus 1.576e-03 2.507e-03 0.629 0.529846
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.216 on 499 degrees of freedom
## Multiple R-squared: 0.7242, Adjusted R-squared: 0.7209
## F-statistic: 218.4 on 6 and 499 DF, p-value: < 2.2e-16
The model achieves a moderate ability to explain variations in the logarithm of house prices, with an adjusted R-squared of approximately 0.721.
Of the independent variables, “rm” (average number of rooms per home) and “lstat” (percentage of population with low status) are statistically significant. A one-unit increase in the average number of bedrooms is associated with an increase in the log house price by approximately 0.315 units, while a one-unit increase in the percentage of the population with low status is associated with a decrease in the log of the house price by approximately 0.047 units.
The variables “nox” (concentration of nitrogen oxides), “crim” (crime rate), “tax” (property tax), and “indus” (proportion of non-retail business acres per city) are not statistically significant in this model.
The model as a whole has a statistically significant F value, indicating that at least one of the independent variables is relevant in predicting the logarithm of house prices.
The intercept is not statistically significant in this context.
In summary, the second model suggests that the average number of rooms and the percentage of population with low status are important factors in predicting the logarithm of house prices, while other variables do not have a significant impact on the model. The model as a whole is able to explain a considerable part of the variability in the logarithm of house prices.
# Multicollinearity
vif(dmodel2)
## rm nox lstat crim tax indus
## 1.656208 2.746972 2.507990 1.603871 2.753383 3.202960
Since no value exceeds the threshold of 10, multiculionality is not observed, that is, there is no correlation between the independent variables.
# Heteroscedasticity
bptest(dmodel2)
##
## studentized Breusch-Pagan test
##
## data: dmodel2
## BP = 48.152, df = 6, p-value = 1.102e-08
The studentized Breusch-Pagan test yields an extremely low p-value (1.518e-07), which leads us to strongly reject the null hypothesis of homoscedasticity in the dmodel2 model. With such a low p value, we can strongly conclude that heteroskedasticity exists in the model residuals. The significance level is much lower than 0.05, which reinforces this conclusion.
# Normality of Residuals
qqnorm(dmodel2$residuals)
qqline(dmodel2$residuals)
shapiro.test(dmodel2$residuals)
##
## Shapiro-Wilk normality test
##
## data: dmodel2$residuals
## W = 0.94569, p-value = 1.164e-12
The Shapiro-Wilk normality test shows an extremely low p-value (1.164e-12), less tah p=0.05. Indicating that the residuals from the dmodel2 model do not follow a normal distribution. Therefore, we can strongly reject the null hypothesis that the residuals are normally distributed. This result suggests that the model does not meet the assumption of normality of errors.
dmodel3<-lm(log(medv) ~ +I(cmedv^2)+nox+rm+nox+lstat+crim,data=dataset)
summary(dmodel3)
##
## Call:
## lm(formula = log(medv) ~ +I(cmedv^2) + nox + rm + nox + lstat +
## crim, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66621 -0.06049 0.00662 0.07079 0.49828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.332e+00 8.588e-02 38.796 < 2e-16 ***
## I(cmedv^2) 4.820e-04 1.724e-05 27.954 < 2e-16 ***
## nox -1.851e-01 6.669e-02 -2.776 0.00571 **
## rm -3.294e-02 1.267e-02 -2.599 0.00962 **
## lstat -1.880e-02 1.396e-03 -13.459 < 2e-16 ***
## crim -9.702e-03 8.128e-04 -11.937 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1363 on 500 degrees of freedom
## Multiple R-squared: 0.8899, Adjusted R-squared: 0.8888
## F-statistic: 807.9 on 5 and 500 DF, p-value: < 2.2e-16
The model has a high capacity to explain variations in the logarithm of house prices, with an adjusted R-squared of approximately 0.8898.
Of the independent variables, “I(cmedv^2)” (a quadratic transformation of the average house price), “nox” (concentration of nitrogen oxides), “rm” (average number of rooms per house), “lstat” (percentage of population with low status) and “crim” (crime rate) are statistically significant.
The coefficient of “I(cmedv^2)” is positive, suggesting a positive quadratic relationship between the transformed variable and the logarithm of house prices.
The intercept is statistically significant and has a positive value, indicating a base value in the logarithm of house prices.
Overall, the model as a whole is highly significant, with a statistically significant F value.
The model has a good fit to the data, with a low residual standard error of approximately 0.1363.
# Multicollinearity
vif(dmodel3)
## I(cmedv^2) nox rm lstat crim
## 2.206569 1.622443 2.154363 2.701969 1.328018
Since no value exceeds the threshold of 10, multiculionality is not observed, that is, there is no correlation between the independent variables.
# Heteroscedasticity
bptest(dmodel3)
##
## studentized Breusch-Pagan test
##
## data: dmodel3
## BP = 107.81, df = 5, p-value < 2.2e-16
El test de Breusch-Pagan arroja un valor p muy cercano a cero (p-value < 2.2e-16), lo que indica una fuerte evidencia en contra de la hipótesis nula de homocedasticidad. Por lo tanto, podemos concluir que existe heterocedasticidad en el modelo dmodel3, lo que sugiere que la varianza de los errores no es constante en todas las observaciones y que podría haber problemas de dispersión en el modelo.
# Normality of Residuals
qqnorm(dmodel3$residuals)
qqline(dmodel3$residuals)
shapiro.test(dmodel3$residuals)
##
## Shapiro-Wilk normality test
##
## data: dmodel3$residuals
## W = 0.93175, p-value = 1.972e-14
The Shapiro-Wilk normality test yields an extremely low p-value (p-value = 1.972e-14), indicating strong evidence against the null hypothesis that the residuals follow a normal distribution. Therefore, we can conclude that the residuals do not follow a normal distribution, suggesting that the assumption of normality of the errors in the model may not be adequately met. It is important to keep this non-normality in mind when interpreting model results and considering possible adjustments or transformations of the data if necessary.
AIC(dmodel1)
## [1] 3162.478
AIC(dmodel2)
## [1] -106.17
AIC(dmodel3)
## [1] -572.6356
The models presented problems of normality and heterocerasticity, which is why it is important to continue making data transformations and apply other regularization methods to achieve a more precise and reliable model for predictions. On the other hand, because most of the variables are statistically significant, have a higher r-squared value, do not present multiculinality and have a lower AIC value, model 3 is chosen as the best option to fit our needs. data. It is also important to mention that heterocerasticity is a problem inherent to the nature of the data, since data from one point in time from various US states are being studied, therefore the variances will differ when studying many groups of data with characteristics different.
The model has a high capacity to explain variations in the logarithm of house prices, with an adjusted R-squared of approximately 0.8898.
Of the independent variables, “I(cmedv^2)” (a quadratic transformation of the average house price), “nox” (concentration of nitrogen oxides), “rm” (average number of rooms per house), “lstat” (percentage of population with low status) and “crim” (crime rate) are statistically significant.
The coefficient of “I(cmedv^2)” is positive, suggesting a positive quadratic relationship between the transformed variable and the logarithm of house prices.
The intercept is statistically significant and has a positive value, indicating a base value in the logarithm of house prices.
Overall, the model as a whole is highly significant, with a statistically significant F value.
The model has a good fit to the data, with a low residual standard error of approximately 0.1363.
Coefficient interpretations:
# LASSO regression via glmnet package can only take numerical observations. Then, the dataset is transformed to model.matrix() format.
# Independent variables
x<-model.matrix(log(medv) ~ rm+nox+lstat+crim,dataset)[,-1] ### OLS model specification
# x<-model.matrix(Weekly_Sales~.,train.data)[,-1] ### matrix of independent variables X's
y<-dataset$medv ### dependent variable
# In estimating LASSO regression it is important to define the lambda that minimizes the prediction error rate.
# Cross-validation ensures that every data / observation from the original dataset (datains) has a chance of appearing in train and test datasets.
# Find the best lambda using cross-validation.
set.seed(123)
cv.lasso<-cv.glmnet(x,y,alpha=1) # alpha = 1 for LASSO
# Display the best lambda value
cv.lasso$lambda.min ### lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is
## [1] 0.1494561
# Fit the final model on the training data
lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
# Display regression coefficients
coef(lassomodel)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.74433078
## rm 5.06940392
## nox .
## lstat -0.57351659
## crim -0.09007037
# Make predictions on the test data
x.test<-model.matrix(log(medv) ~ rm+nox+lstat+crim,dataset)[,-1] ### OLS model specification
# x.test<-model.matrix(Weekly_Sales~.,test.data)[,-1]
lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()
### visualizing lasso regression results
lbs_fun <- function(fit, offset_x=1, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])+ offset_x
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
lasso<-glmnet(scale(x),y,alpha=1)
plot(lasso,xvar="lambda",label=T)
lbs_fun(lasso)
abline(v=cv.lasso$lambda.min,col="red",lty=2)
abline(v=cv.lasso$lambda.1se,col="blue",lty=2)
summary(lassomodel)
## Length Class Mode
## a0 1 -none- numeric
## beta 4 dgCMatrix S4
## df 1 -none- numeric
## dim 2 -none- numeric
## lambda 1 -none- numeric
## dev.ratio 1 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
Since we are working with a database that collects information on the value of homes at a specific point in time, they are cross-sectional data and therefore do not have serial autocorrelation.
In summary, thanks to the exploratory analysis of the data, we had a clear idea of the data we had and the transformations necessary to achieve a robust model. Model 3 stands out as a model with a very good fit to the data, thanks to good values of AIC and R squared, and it was also a model that had many variables with statistical significance and that helped us corroborate our hypotheses. However, despite reducing heterocerasticity with data transformations and with the lasso model, it is important to continue applying other transformation techniques to avoid this problem, although as already mentioned, it is important to highlight that heterocerasticity is a natural problem of The data we are working with are cross-sectional data from several US states at a given point in time, so the variances will hardly be homogeneous. Regarding the insights found, the great impact that the crime rate has on homes is highlighted, since this variable was statistically significant, with a high and negative coefficient, so an increase in crime leads to a considerable loss in the value of the property. the homes, which is why it is important to guarantee the security of the neighborhoods and thus be able to increase the value of the homes. On the other hand, air quality, as thought, turned out to have a negative significance, which highlights the tendency of people to live in quiet places, with excellent air and a very good quality of life. However, something interesting was seeing that the number of rooms turned out to have a negative relationship with the value of the homes, something very peculiar and that could be related to the fact that perhaps people value other more important factors more than the number of rooms.
Likewise, the value of homes is a variable that depends on many other independent variables of those studied here, it will be interesting to complement this analysis with the economic projections of the areas in the future, the stores that are around, the distances to the cities, etc.
Regarding the application of linear regression in the context of predictive analytics, this approach appears to be a valuable tool. Linear regression models provide an understanding of how various characteristics affect property values and are therefore essential for making informed real estate decisions. Furthermore, the identification of significant variables, such as air quality (nox) and crime rate (crim), can provide crucial information for the planning and development of future real estate projects. This predictive analytics capability can significantly improve efficiency and effectiveness in property investment and management, supporting the continued importance of linear regression in analysis and decision making in the real estate market.