Mainroad
table1::table1(~mainroad, data = housing, na.rm = TRUE, digits = 1, format.number = TRUE) plot(housing$mainroad,xlab=“mainroad”)
| Overall (N=545) |
|
|---|---|
| mainroad | |
| no | 77 (14.1%) |
| yes | 468 (85.9%) |
Este trabajo exploratorio es un primer ejercicio de acercamiento al uso de R y Rstudio en el análisis de datos. La base de datos elegida atiende a las indicaciones del ejercicio y buscamos poner en práctica los conocimientos adquiridos en la asignatura.
El análisis de la base de datos Housing pretende dará luz sobre los principales factores que incrementan el valor de una vivienda.
Inspeccionar y preparar la base de datos para su análisis Analizar las variables de interés para determinar su influencia en el aumento del precio de las viviendas
La base de datos Housing contiene 545 observaciones y consta de 13 variables que se mencionan a continuación:
library(dplyr)
library(DescTools)
library(ggplot2)
library(moments)
library(corrplot)
library(car)
housing = read.csv("Housing.csv", header = TRUE, sep=",")
head(housing, 8)
## price area bedrooms bathrooms stories mainroad guestroom basement
## 1 13300000 7420 4 2 3 yes no no
## 2 12250000 8960 4 4 4 yes no no
## 3 12250000 9960 3 2 2 yes no yes
## 4 12215000 7500 4 2 2 yes no yes
## 5 11410000 7420 4 1 2 yes yes yes
## 6 10850000 7500 3 3 1 yes no yes
## 7 10150000 8580 4 3 4 yes no no
## 8 10150000 16200 5 3 2 yes no no
## hotwaterheating airconditioning parking prefarea furnishingstatus
## 1 no yes 2 yes furnished
## 2 no yes 3 no furnished
## 3 no no 2 yes semi-furnished
## 4 no yes 3 yes furnished
## 5 no yes 2 no furnished
## 6 no yes 2 yes semi-furnished
## 7 no yes 2 yes semi-furnished
## 8 no no 0 no unfurnished
tail(housing, 10)
## price area bedrooms bathrooms stories mainroad guestroom basement
## 536 2100000 3360 2 1 1 yes no no
## 537 1960000 3420 5 1 2 no no no
## 538 1890000 1700 3 1 2 yes no no
## 539 1890000 3649 2 1 1 yes no no
## 540 1855000 2990 2 1 1 no no no
## 541 1820000 3000 2 1 1 yes no yes
## 542 1767150 2400 3 1 1 no no no
## 543 1750000 3620 2 1 1 yes no no
## 544 1750000 2910 3 1 1 no no no
## 545 1750000 3850 3 1 2 yes no no
## hotwaterheating airconditioning parking prefarea furnishingstatus
## 536 no no 1 no unfurnished
## 537 no no 0 no unfurnished
## 538 no no 0 no unfurnished
## 539 no no 0 no unfurnished
## 540 no no 1 no unfurnished
## 541 no no 2 no unfurnished
## 542 no no 0 no semi-furnished
## 543 no no 0 no unfurnished
## 544 no no 0 no furnished
## 545 no no 0 no unfurnished
dim(housing)
## [1] 545 13
class(housing)
## [1] "data.frame"
str(housing)
## 'data.frame': 545 obs. of 13 variables:
## $ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
## $ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
## $ bedrooms : int 4 4 3 4 4 3 4 5 4 3 ...
## $ bathrooms : int 2 4 2 2 1 3 3 3 1 2 ...
## $ stories : int 3 4 2 2 2 1 4 2 2 4 ...
## $ mainroad : chr "yes" "yes" "yes" "yes" ...
## $ guestroom : chr "no" "no" "no" "no" ...
## $ basement : chr "no" "no" "yes" "yes" ...
## $ hotwaterheating : chr "no" "no" "no" "no" ...
## $ airconditioning : chr "yes" "yes" "no" "yes" ...
## $ parking : int 2 3 2 3 2 2 2 0 2 1 ...
## $ prefarea : chr "yes" "no" "yes" "yes" ...
## $ furnishingstatus: chr "furnished" "furnished" "semi-furnished" "furnished" ...
sum(complete.cases(housing))
## [1] 545
summary(housing)
## price area bedrooms bathrooms
## Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.000
## 1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
## Median : 4340000 Median : 4600 Median :3.000 Median :1.000
## Mean : 4766729 Mean : 5151 Mean :2.965 Mean :1.286
## 3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :13300000 Max. :16200 Max. :6.000 Max. :4.000
## stories mainroad guestroom basement
## Min. :1.000 Length:545 Length:545 Length:545
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :1.806
## 3rd Qu.:2.000
## Max. :4.000
## hotwaterheating airconditioning parking prefarea
## Length:545 Length:545 Min. :0.0000 Length:545
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.6936
## 3rd Qu.:1.0000
## Max. :3.0000
## furnishingstatus
## Length:545
## Class :character
## Mode :character
##
##
##
Las variables se transformaran a factores para un mejor manejo. Las variables bathrooms, stories y parking se convertiran a as.numeric.
housing$bedrooms <- factor(housing$bedrooms)
housing$bathrooms <- as.numeric(housing$bathrooms)
housing$stories <- as.numeric(housing$stories)
housing$mainroad <- factor(housing$mainroad)
housing$guestroom <- factor(housing$guestroom)
housing$basement <- factor(housing$basement)
housing$hotwaterheating <- factor(housing$hotwaterheating)
housing$airconditioning <- factor(housing$airconditioning)
housing$parking <- as.numeric(housing$parking)
housing$prefarea <- factor(housing$prefarea)
housing$furnishingstatus <- factor(housing$furnishingstatus)
Comprobamos la estructura:
str(housing)
## 'data.frame': 545 obs. of 13 variables:
## $ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
## $ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
## $ bedrooms : Factor w/ 6 levels "1","2","3","4",..: 4 4 3 4 4 3 4 5 4 3 ...
## $ bathrooms : num 2 4 2 2 1 3 3 3 1 2 ...
## $ stories : num 3 4 2 2 2 1 4 2 2 4 ...
## $ mainroad : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ guestroom : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 2 2 ...
## $ basement : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 1 ...
## $ hotwaterheating : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ airconditioning : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ parking : num 2 3 2 3 2 2 2 0 2 1 ...
## $ prefarea : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 2 2 ...
## $ furnishingstatus: Factor w/ 3 levels "furnished","semi-furnished",..: 1 1 2 1 1 2 2 3 1 3 ...
summary(housing)
## price area bedrooms bathrooms stories
## Min. : 1750000 Min. : 1650 1: 2 Min. :1.000 Min. :1.000
## 1st Qu.: 3430000 1st Qu.: 3600 2:136 1st Qu.:1.000 1st Qu.:1.000
## Median : 4340000 Median : 4600 3:300 Median :1.000 Median :2.000
## Mean : 4766729 Mean : 5151 4: 95 Mean :1.286 Mean :1.806
## 3rd Qu.: 5740000 3rd Qu.: 6360 5: 10 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :13300000 Max. :16200 6: 2 Max. :4.000 Max. :4.000
## mainroad guestroom basement hotwaterheating airconditioning parking
## no : 77 no :448 no :354 no :520 no :373 Min. :0.0000
## yes:468 yes: 97 yes:191 yes: 25 yes:172 1st Qu.:0.0000
## Median :0.0000
## Mean :0.6936
## 3rd Qu.:1.0000
## Max. :3.0000
## prefarea furnishingstatus
## no :417 furnished :140
## yes:128 semi-furnished:227
## unfurnished :178
##
##
##
str(housing$price)
## int [1:545] 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
class(housing$price)
## [1] "integer"
summary(housing$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1750000 3430000 4340000 4766729 5740000 13300000
El análisis muestra que es una variable cuantitativa continua.
mean(housing$price)
## [1] 4766729
median(housing$price)
## [1] 4340000
var(housing$price)
## [1] 3.498544e+12
sd(housing$price)
## [1] 1870440
IQR(housing$price)
## [1] 2310000
range(housing$price)
## [1] 1750000 13300000
quantile(housing$price)
## 0% 25% 50% 75% 100%
## 1750000 3430000 4340000 5740000 13300000
Esto nos indica que el 50% de nuestros datos se encuentra entre 3,430,000 y 5,740,000 por el máximo de precio es probable tengamos valores atípicos y un sesgo a la derecha.
Para confirmar esto aplicamos medidas de forma.
skewness(housing$price)
## [1] 1.2089
kurtosis(housing$price)
## [1] 4.931205
Preparamos los elementos requeridos para armar una tabla de frecuencias.
k <- nclass.Sturges(housing$price) # Número de clases
ac <- (max(housing$price)-min(housing$price))/k # Calculamos el ancho
bins <- seq(min(housing$price), max(housing$price), by = ac) # Secuencia del valor mínimo al máximo
precio.clases <- cut(housing$price, breaks = bins, include.lowest=TRUE, dig.lab = 8)
Creamos ahora la tabla de frecuencias:
dist.freq <- table(precio.clases)
dist.freq
## precio.clases
## [1750000,2800000] (2800000,3850000] (3850000,4900000] (4900000,5950000]
## 53 152 140 83
## (5950000,7000000] (7000000,8050000] (8050000,9100000] (9100000,10150000]
## 53 28 21 9
## (10150000,11200000] (11200000,12250000] (12250000,13300000]
## 1 4 1
transform(dist.freq,
rel.freq=prop.table(Freq),
cum.freq=cumsum(Freq),
cum.rel.freq=cumsum(prop.table(Freq)))
## precio.clases Freq rel.freq cum.freq cum.rel.freq
## 1 [1750000,2800000] 53 0.097247706 53 0.09724771
## 2 (2800000,3850000] 152 0.278899083 205 0.37614679
## 3 (3850000,4900000] 140 0.256880734 345 0.63302752
## 4 (4900000,5950000] 83 0.152293578 428 0.78532110
## 5 (5950000,7000000] 53 0.097247706 481 0.88256881
## 6 (7000000,8050000] 28 0.051376147 509 0.93394495
## 7 (8050000,9100000] 21 0.038532110 530 0.97247706
## 8 (9100000,10150000] 9 0.016513761 539 0.98899083
## 9 (10150000,11200000] 1 0.001834862 540 0.99082569
## 10 (11200000,12250000] 4 0.007339450 544 0.99816514
## 11 (12250000,13300000] 1 0.001834862 545 1.00000000
Generamos ahora histograma y gráfica boxplot.
hist(housing$price, main = "Histograma Precio", xlab="Precio",col = "darkblue")
boxplot(housing$price, horizontal = FALSE,col = "white",border=4)
points (mean(housing$price), col = 3, pch = 19)
Los precios de la vivienda fluctuan entre $1,750,000 y $13,300,000. El 50% de las viviendas en venta se encuentran en el rango de $3,430,000 a $5,740,000 Se indentificaron valores atípicos que podrían estar afectando el modelo de análisis. Estos valores serán tratados para mejorar el modelo. La distribución de la variable muestra un sesgo a la derecha
Llevaremos a cabo un análisis de las variables categóricas:
table1::table1(~mainroad, data = housing, na.rm = TRUE, digits = 1, format.number = TRUE) plot(housing$mainroad,xlab=“mainroad”)
| Overall (N=545) |
|
|---|---|
| mainroad | |
| no | 77 (14.1%) |
| yes | 468 (85.9%) |
table1::table1(~guestroom, data = housing, na.rm = TRUE, digits = 1, format.number = TRUE) plot(housing$guestroom,xlab=“guestroom”)
| Overall (N=545) |
|
|---|---|
| guestroom | |
| no | 448 (82.2%) |
| yes | 97 (17.8%) |
table1::table1(~basement, data = housing, na.rm = TRUE, digits = 1, format.number = TRUE) plot(housing$basement,xlab=“basement”)
| Overall (N=545) |
|
|---|---|
| basement | |
| no | 354 (65.0%) |
| yes | 191 (35.0%) |
table1::table1(~airconditioning, data = housing, na.rm = TRUE, digits = 1, format.number = TRUE) plot(housing$basement,xlab=“airconditioning”)
| Overall (N=545) |
|
|---|---|
| airconditioning | |
| no | 373 (68.4%) |
| yes | 172 (31.6%) |
table1::table1(~hotwaterheating, data = housing, na.rm = TRUE, digits = 1, format.number = TRUE) plot(housing$basement,xlab=“hotwaterheating”)
| Overall (N=545) |
|
|---|---|
| hotwaterheating | |
| no | 520 (95.4%) |
| yes | 25 (4.6%) |
table1::table1(~prefarea, data = housing, na.rm = TRUE, digits = 1, format.number = TRUE) plot(housing$basement,xlab=“prefarea”)
| Overall (N=545) |
|
|---|---|
| prefarea | |
| no | 417 (76.5%) |
| yes | 128 (23.5%) |
table1::table1(~furnishingstatus, data = housing, na.rm = TRUE, digits = 1, format.number = TRUE) plot(housing$furnishingstatus,xlab=“furnishing”)
| Overall (N=545) |
|
|---|---|
| furnishingstatus | |
| furnished | 140 (25.7%) |
| semi-furnished | 227 (41.7%) |
| unfurnished | 178 (32.7%) |
Del análisis se concluye que la inclusión de los servicios y comodidades influye en un precio de venta más elevado.
Para confirmar esto llevamos a cabo la ejecución de t.test para 3 variables teniendo en cuenta:
t.test(price~mainroad,data = housing)
##
## Welch Two Sample t-test
##
## data: price by mainroad
## t = -11.853, df = 210.68, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -1857781 -1327964
## sample estimates:
## mean in group no mean in group yes
## 3398905 4991777
2.2e-16, p-value < 0.05 de significancia, por lo tanto, se rechaza H0, las variables son dependientes
t.test(price~guestroom,data = housing)
##
## Welch Two Sample t-test
##
## data: price by guestroom
## t = -6.3847, df = 146.28, p-value = 2.144e-09
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -1634767.3 -861935.3
## sample estimates:
## mean in group no mean in group yes
## 4544546 5792897
2.144e-09, p-value < 0.05 de significancia, por lo tanto, se rechaza H0, las variables son dependientes
t.test(price~prefarea,data = housing)
##
## Welch Two Sample t-test
##
## data: price by prefarea
## t = -7.4923, df = 187.47, p-value = 2.592e-12
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -1836514 -1070980
## sample estimates:
## mean in group no mean in group yes
## 4425299 5879046
2.592e-12, p-value < 0.05 de significancia, por lo tanto, se rechaza H0, las variables son dependientes
str(housing)
## 'data.frame': 545 obs. of 13 variables:
## $ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
## $ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
## $ bedrooms : Factor w/ 6 levels "1","2","3","4",..: 4 4 3 4 4 3 4 5 4 3 ...
## $ bathrooms : num 2 4 2 2 1 3 3 3 1 2 ...
## $ stories : num 3 4 2 2 2 1 4 2 2 4 ...
## $ mainroad : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ guestroom : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 2 2 ...
## $ basement : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 1 ...
## $ hotwaterheating : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ airconditioning : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 1 2 2 ...
## $ parking : num 2 3 2 3 2 2 2 0 2 1 ...
## $ prefarea : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 2 2 ...
## $ furnishingstatus: Factor w/ 3 levels "furnished","semi-furnished",..: 1 1 2 1 1 2 2 3 1 3 ...
freq_bedrooms <- table(housing$bedrooms)
freq_bedrooms
##
## 1 2 3 4 5 6
## 2 136 300 95 10 2
transform(freq_bedrooms,
rel.freq=prop.table(Freq),
cum.freq=cumsum(Freq),
cum.rel.freq=cumsum(prop.table(Freq)))
## Var1 Freq rel.freq cum.freq cum.rel.freq
## 1 1 2 0.003669725 2 0.003669725
## 2 2 136 0.249541284 138 0.253211009
## 3 3 300 0.550458716 438 0.803669725
## 4 4 95 0.174311927 533 0.977981651
## 5 5 10 0.018348624 543 0.996330275
## 6 6 2 0.003669725 545 1.000000000
barplot(freq_bedrooms, xlab = "Bedrooms", col = "#10454F", main = "Bedrooms")
freq_bathrooms <- table(housing$bathrooms)
freq_bathrooms
##
## 1 2 3 4
## 401 133 10 1
transform(freq_bathrooms,
rel.freq=prop.table(Freq),
cum.freq=cumsum(Freq),
cum.rel.freq=cumsum(prop.table(Freq)))
## Var1 Freq rel.freq cum.freq cum.rel.freq
## 1 1 401 0.735779817 401 0.7357798
## 2 2 133 0.244036697 534 0.9798165
## 3 3 10 0.018348624 544 0.9981651
## 4 4 1 0.001834862 545 1.0000000
barplot(freq_bathrooms, xlab = "Bathrooms", col = "#10454F", main = "Bathrooms")
freq_stories <- table(housing$stories)
freq_stories
##
## 1 2 3 4
## 227 238 39 41
transform(freq_stories,
rel.freq=prop.table(Freq),
cum.freq=cumsum(Freq),
cum.rel.freq=cumsum(prop.table(Freq)))
## Var1 Freq rel.freq cum.freq cum.rel.freq
## 1 1 227 0.41651376 227 0.4165138
## 2 2 238 0.43669725 465 0.8532110
## 3 3 39 0.07155963 504 0.9247706
## 4 4 41 0.07522936 545 1.0000000
barplot(freq_stories, xlab = "Stories", col = "#10454F", main = "Bathrooms")
freq_parking <- table(housing$parking)
freq_parking
##
## 0 1 2 3
## 299 126 108 12
transform(freq_parking,
rel.freq=prop.table(Freq),
cum.freq=cumsum(Freq),
cum.rel.freq=cumsum(prop.table(Freq)))
## Var1 Freq rel.freq cum.freq cum.rel.freq
## 1 0 299 0.54862385 299 0.5486239
## 2 1 126 0.23119266 425 0.7798165
## 3 2 108 0.19816514 533 0.9779817
## 4 3 12 0.02201835 545 1.0000000
barplot(freq_parking, xlab = "Parking", col = "#10454F", main = "Parking")
Analizando su relación con el precio
Se observa una variación en el precio a medida que aumentan el número de servicios y habitaciones disponibles en la vivienda. Persisten también los casos atípicos.
Objetivo: Identificar relaciones de correlación entre al menos dos variables y plantear un problema de regresión.
housing.select<-select(housing,price,area)
round(cor(housing.select),4)
## price area
## price 1.000 0.536
## area 0.536 1.000
pairs(~ price+area,
data=housing, gap=.4, cex_labels=1.5)
corrplot(cor(housing.select))
La correlación observada es positiva.
attach(housing)
model1 <- lm(price ~ area+bedrooms+bathrooms+stories+parking+mainroad+guestroom
+basement+hotwaterheating+airconditioning+prefarea+furnishingstatus)
summary(model1)
##
## Call:
## lm(formula = price ~ area + bedrooms + bathrooms + stories +
## parking + mainroad + guestroom + basement + hotwaterheating +
## airconditioning + prefarea + furnishingstatus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2603715 -652407 -78857 515376 5210924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 357446.85 775559.37 0.461 0.645068
## area 244.65 24.48 9.992 < 2e-16 ***
## bedrooms2 -114030.93 766523.74 -0.149 0.881797
## bedrooms3 44379.14 767625.38 0.058 0.953919
## bedrooms4 95121.74 776261.47 0.123 0.902519
## bedrooms5 225794.30 840612.20 0.269 0.788337
## bedrooms6 725852.04 1075980.80 0.675 0.500228
## bathrooms 995395.42 104634.07 9.513 < 2e-16 ***
## stories 447940.57 66210.11 6.765 3.54e-11 ***
## parking 277396.50 58975.46 4.704 3.27e-06 ***
## mainroadyes 420192.91 143761.60 2.923 0.003617 **
## guestroomyes 302160.05 132199.82 2.286 0.022672 *
## basementyes 348044.60 111709.52 3.116 0.001936 **
## hotwaterheatingyes 855162.11 225019.54 3.800 0.000161 ***
## airconditioningyes 865512.39 108940.97 7.945 1.19e-14 ***
## prefareayes 646125.01 117161.83 5.515 5.48e-08 ***
## furnishingstatussemi-furnished -42608.78 117741.15 -0.362 0.717583
## furnishingstatusunfurnished -412029.93 126841.54 -3.248 0.001235 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1071000 on 527 degrees of freedom
## Multiple R-squared: 0.6822, Adjusted R-squared: 0.6719
## F-statistic: 66.54 on 17 and 527 DF, p-value: < 2.2e-16
anova(model1)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## area 1 5.4678e+14 5.4678e+14 476.3753 < 2.2e-16 ***
## bedrooms 5 1.8368e+14 3.6736e+13 32.0062 < 2.2e-16 ***
## bathrooms 1 2.2437e+14 2.2437e+14 195.4817 < 2.2e-16 ***
## stories 1 7.4811e+13 7.4811e+13 65.1784 4.671e-15 ***
## parking 1 4.6539e+13 4.6539e+13 40.5467 4.183e-10 ***
## mainroad 1 2.4823e+13 2.4823e+13 21.6268 4.192e-06 ***
## guestroom 1 3.3227e+13 3.3227e+13 28.9489 1.120e-07 ***
## basement 1 2.6272e+13 2.6272e+13 22.8895 2.231e-06 ***
## hotwaterheating 1 7.2006e+12 7.2006e+12 6.2734 0.012557 *
## airconditioning 1 7.8517e+13 7.8517e+13 68.4068 1.092e-15 ***
## prefarea 1 3.5577e+13 3.5577e+13 30.9964 4.123e-08 ***
## furnishingstatus 2 1.6523e+13 8.2616e+12 7.1979 0.000824 ***
## Residuals 527 6.0489e+14 1.1478e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
El R2 nos indica qie el % de la variación del precio que es explicado por el modelo, es decir, el 68.89% de la variación de precio es explicada por el modelo.
Adicional nos indica que a un nivel de confianza del 95% la variable bedrooms no resulta significativa para el modelo’
Ejecutamos nuevamente el modelo sin variable bedrooms:
model2 <- update(model1,~.-bedrooms)
summary(model2)
##
## Call:
## lm(formula = price ~ area + bathrooms + stories + parking + mainroad +
## guestroom + basement + hotwaterheating + airconditioning +
## prefarea + furnishingstatus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2640857 -681531 -67385 495051 5210247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 268518.21 222746.65 1.205 0.228551
## area 247.17 24.25 10.194 < 2e-16 ***
## bathrooms 1024077.94 100906.12 10.149 < 2e-16 ***
## stories 486503.39 60160.58 8.087 4.16e-15 ***
## parking 283577.52 58464.89 4.850 1.62e-06 ***
## mainroadyes 393735.49 141352.60 2.785 0.005535 **
## guestroomyes 295836.20 131862.32 2.244 0.025274 *
## basementyes 374040.80 109394.16 3.419 0.000676 ***
## hotwaterheatingyes 861783.03 223431.02 3.857 0.000129 ***
## airconditioningyes 863270.43 108501.89 7.956 1.07e-14 ***
## prefareayes 655385.53 115819.75 5.659 2.50e-08 ***
## furnishingstatussemi-furnished -44100.39 116729.66 -0.378 0.705730
## furnishingstatusunfurnished -417991.06 126315.89 -3.309 0.000999 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1069000 on 532 degrees of freedom
## Multiple R-squared: 0.6803, Adjusted R-squared: 0.6731
## F-statistic: 94.34 on 12 and 532 DF, p-value: < 2.2e-16
anova(model2)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## area 1 5.4678e+14 5.4678e+14 478.079 < 2.2e-16 ***
## bathrooms 1 3.3838e+14 3.3838e+14 295.861 < 2.2e-16 ***
## stories 1 1.2531e+14 1.2531e+14 109.564 < 2.2e-16 ***
## parking 1 5.2007e+13 5.2007e+13 45.473 4.043e-11 ***
## mainroad 1 2.1928e+13 2.1928e+13 19.173 1.438e-05 ***
## guestroom 1 3.6165e+13 3.6165e+13 31.621 3.030e-08 ***
## basement 1 3.3160e+13 3.3160e+13 28.994 1.091e-07 ***
## hotwaterheating 1 7.2019e+12 7.2019e+12 6.297 0.0123900 *
## airconditioning 1 7.9306e+13 7.9306e+13 69.341 7.048e-16 ***
## prefarea 1 3.7419e+13 3.7419e+13 32.717 1.780e-08 ***
## furnishingstatus 2 1.7110e+13 8.5549e+12 7.480 0.0006256 ***
## Residuals 532 6.0845e+14 1.1437e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
El modelo no muestra una mejora significativa en el R2, nos quedamos con model1.
vif(model1)
## GVIF Df GVIF^(1/(2*Df))
## area 1.337976 1 1.156709
## bedrooms 1.641134 5 1.050786
## bathrooms 1.310094 1 1.144593
## stories 1.563568 1 1.250427
## parking 1.223704 1 1.106211
## mainroad 1.190594 1 1.091143
## guestroom 1.214095 1 1.101860
## basement 1.348833 1 1.161393
## hotwaterheating 1.052265 1 1.025800
## airconditioning 1.217197 1 1.103267
## prefarea 1.171277 1 1.082255
## furnishingstatus 1.132983 2 1.031706
No se identifica colinealidad entre variables.
Eliminando outliers para mejorar el modelo:
residuos=rstudent(model1)
summary(residuos)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.489156 -0.621834 -0.074590 0.000397 0.483670 5.047234
plot(residuos)
Eliminando residuos superiores a 3
outliers = which(residuos>2.5)
outliers
## 1 3 4 5 14 15 16 21 28
## 1 3 4 5 14 15 16 21 28
housing.outliers= housing[-outliers,]
nrow((housing.outliers))
## [1] 536
model3 <- lm(price ~ area+bathrooms+stories+mainroad+guestroom
+basement+hotwaterheating+airconditioning+parking+prefarea+furnishingstatus,
data = housing.outliers)
summary(model3)
##
## Call:
## lm(formula = price ~ area + bathrooms + stories + mainroad +
## guestroom + basement + hotwaterheating + airconditioning +
## parking + prefarea + furnishingstatus, data = housing.outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2552754 -609397 -31035 518760 2751079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 370300.80 194333.73 1.905 0.057265 .
## area 232.38 21.27 10.927 < 2e-16 ***
## bathrooms 1007014.90 88516.23 11.377 < 2e-16 ***
## stories 466204.87 52381.33 8.900 < 2e-16 ***
## mainroadyes 403684.31 122773.50 3.288 0.001077 **
## guestroomyes 389659.78 115363.42 3.378 0.000785 ***
## basementyes 298016.71 96220.46 3.097 0.002058 **
## hotwaterheatingyes 664822.11 201168.27 3.305 0.001016 **
## airconditioningyes 890033.41 94696.16 9.399 < 2e-16 ***
## parking 198370.31 51521.16 3.850 0.000133 ***
## prefareayes 603498.11 101808.02 5.928 5.58e-09 ***
## furnishingstatussemi-furnished -16248.06 102482.08 -0.159 0.874088
## furnishingstatusunfurnished -344189.54 110107.64 -3.126 0.001871 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 928000 on 523 degrees of freedom
## Multiple R-squared: 0.7153, Adjusted R-squared: 0.7088
## F-statistic: 109.5 on 12 and 523 DF, p-value: < 2.2e-16
El nuevo modelo presenta una mejora de 3 puntos. Ahora explica el 72.8% de la variación.
Se eliminaron solo 9 registros del dataset, lo que representa solo el 1.6% de los registros.
apalancamiento=hatvalues(model3)
plot(apalancamiento)
puntosapa=which(apalancamiento>.08)
puntosapa
## 8 67
## 4 58
housing.outliers.heat = housing.outliers[-puntosapa,]
nrow(housing.outliers.heat)
## [1] 534
model4 <- lm(price ~ area+bathrooms+stories+parking+mainroad+guestroom
+basement+hotwaterheating+airconditioning+prefarea,
data = housing.outliers.heat)
summary(model4)
##
## Call:
## lm(formula = price ~ area + bathrooms + stories + parking + mainroad +
## guestroom + basement + hotwaterheating + airconditioning +
## prefarea, data = housing.outliers.heat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2724277 -566034 -55210 535405 2844063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.948e+05 1.705e+05 1.142 0.253958
## area 2.219e+02 2.261e+01 9.814 < 2e-16 ***
## bathrooms 1.003e+06 9.034e+04 11.099 < 2e-16 ***
## stories 4.789e+05 5.286e+04 9.061 < 2e-16 ***
## parking 2.318e+05 5.230e+04 4.433 1.13e-05 ***
## mainroadyes 4.432e+05 1.235e+05 3.588 0.000364 ***
## guestroomyes 4.180e+05 1.167e+05 3.582 0.000373 ***
## basementyes 3.279e+05 9.720e+04 3.373 0.000799 ***
## hotwaterheatingyes 6.843e+05 2.072e+05 3.302 0.001025 **
## airconditioningyes 9.175e+05 9.558e+04 9.600 < 2e-16 ***
## prefareayes 6.234e+05 1.031e+05 6.046 2.82e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 936900 on 523 degrees of freedom
## Multiple R-squared: 0.7032, Adjusted R-squared: 0.6975
## F-statistic: 123.9 on 10 and 523 DF, p-value: < 2.2e-16
El modelo no representa mejora. Se mantiene el model3.
La regresión es un modelo estadístico paramétrico, lo que significa que involucra la existencia de parámetros asociados a una distribución específica.
Se asume que los errores siguen una distribución normal; es decir, se espera que todas las distribuciones de errores se ajusten a una distribución normal.
En cuanto a la endogeneidad, se establece que el error, que representa las variables no modeladas de manera explícita, no debe estar correlacionado con las variables explicativas.
StanRes2 <- rstandard(model3)
qqnorm(StanRes2)
qqline(StanRes2)
Gráficamente se concluye que los residuos se distribuyen como una normal. Los puntos forman una línea de 45 grados y solo se observan variaciones en los extremos.
ks.test(StanRes2,"pnorm")
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: StanRes2
## D = 0.046186, p-value = 0.203
## alternative hypothesis: two-sided
hist(StanRes2)
H0: la variable se distribuya como una normal H1: la variable no se distribuya como una normal
Dado que el p-value es mayor a 0.5 de significancia, entonces se acepta H0 y distribuye como normal.
Nuestra recta de regresión sería: y=370300.80+232.38 área + 1007014.90 bathrooms + 466204.87 stories + 403684.31 mainroadyes + 389659.78 guestroomsyes + 298016.71 basementsyes + 664822.11 hotwaterheatingyes + 890033.41 airconditioningyes + 198370.31 parking + 603498.11 prefareayes -16248.06 furnishingstatussemi-furnished -344189.54 furnishingstatusunfurnished
Una vez que hemos realizado las validaciones, podemos usar el modelo para hacer predicciones.
data <- data.frame(
area= c(7510,5760,3500,8520),
bathrooms= c(2,2,2,1),
stories= c(2,4,2,1),
mainroad=c("yes","yes","yes","yes"),
guestroom=c("no","yes","no","no"),
basement=c("yes","no","no","no"),
hotwaterheating=c("no","no","yes","no"),
airconditioning=c("yes","yes","no","yes"),
parking=c(3,1,2,2),
prefarea=c("yes","yes","no","no"),
furnishingstatus=c("furnished","furnished","furnished","furnished")
)
predict(model3,newdata= data, interval= "confidence", level=0.95)
## fit lwr upr
## 1 7852236 7526055 8178417
## 2 8072888 7734623 8411154
## 3 5595307 5144288 6046326
## 4 5513832 5234351 5793313
El modelo identifica la influencia significativa de las variables analizadas en el precio de las viviendas.
Estos resultados pueden ser valiosos para entender las características específicas que contribuyen al aumento del valor de una propiedad y podrían ser considerados por agentes inmobiliarios, compradores o vendedores para tomar decisiones informadas sobre transacciones inmobiliarias ya que las predicciones se apoyan en los resultados del model3, mismo que explica el 72.8% de la variación del precio.
Es importante tener en cuenta que estas conclusiones se derivan del análisis específico de la base de datos proporcionada y deben considerarse en el contexto de ese conjunto de datos particular.