In this R Markdown session, I will examine the characteristics of the “USA Real Estate Dataset” available on the Kaggle website. In addition, an AIC (Akaike Information Criterion) model will be created to see if there is a potential regression model for predicting a house’s price based on the variables available.
The dataset used can be found on the Kaggle website, an online platform for data scientists containing free datasets and code collaboration. According to the author of the dataset, the dataset is scraped/updated weekly from https://www.realtor.com. Below is the link for the dataset:
https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset.
The data contains eight variables used to explore the effect of various factors on test scores. The variables are:
First, the dataset is loaded into R and saved as “realestate”.
realestate = read.csv("~/R datasets/realtor-data.csv", header=TRUE)
Now that the dataset is loaded into R, the next step is to view the data and see if it’s clean for analysis.
#Inspect the data frame.
head(realestate)
## status price bed bath acre_lot
## 1 for_sale 105000 3 2 0.12
## 2 for_sale 80000 4 2 0.08
## 3 for_sale 67000 2 1 0.15
## 4 for_sale 145000 4 2 0.10
## 5 for_sale 65000 6 2 0.05
## 6 for_sale 179000 4 3 0.46
## full_address
## 1 Sector Yahuecas Titulo # V84, Adjuntas, PR, 00601
## 2 Km 78 9 Carr # 135, Adjuntas, PR, 00601
## 3 556G 556-G 16 St, Juana Diaz, PR, 00795
## 4 R5 Comunidad El Paraso Calle De Oro R-5 Ponce, Ponce, PR, 00731
## 5 14 Navarro, Mayaguez, PR, 00680
## 6 Bo Calabazas San Sebastian, San Sebastian, PR, 00612
## street city state
## 1 Sector Yahuecas Titulo # V84 Adjuntas Puerto Rico
## 2 Km 78 9 Carr # 135 Adjuntas Puerto Rico
## 3 556G 556-G 16 St Juana Diaz Puerto Rico
## 4 R5 Comunidad El Paraso Calle De Oro R-5 Ponce Ponce Puerto Rico
## 5 14 Navarro Mayaguez Puerto Rico
## 6 Bo Calabazas San Sebastian San Sebastian Puerto Rico
## zip_code house_size sold_date
## 1 601 920
## 2 601 1527
## 3 795 748
## 4 731 1800
## 5 680 NA
## 6 612 2520
#View the column names.
colnames(realestate)
## [1] "status" "price" "bed" "bath" "acre_lot"
## [6] "full_address" "street" "city" "state" "zip_code"
## [11] "house_size" "sold_date"
#View summary of the data frame.
summary(realestate)
## status price bed bath
## Length:923159 Min. : 0 Min. : 1.00 Min. : 1.00
## Class :character 1st Qu.: 269000 1st Qu.: 2.00 1st Qu.: 1.00
## Mode :character Median : 475000 Median : 3.00 Median : 2.00
## Mean : 884123 Mean : 3.33 Mean : 2.49
## 3rd Qu.: 839900 3rd Qu.: 4.00 3rd Qu.: 3.00
## Max. :875000000 Max. :123.00 Max. :198.00
## NA's :71 NA's :131703 NA's :115192
## acre_lot full_address street city
## Min. : 0.00 Length:923159 Length:923159 Length:923159
## 1st Qu.: 0.11 Class :character Class :character Class :character
## Median : 0.29 Mode :character Mode :character Mode :character
## Mean : 17.08
## 3rd Qu.: 1.15
## Max. :100000.00
## NA's :273623
## state zip_code house_size sold_date
## Length:923159 Min. : 601 Min. : 100 Length:923159
## Class :character 1st Qu.: 2919 1st Qu.: 1130 Class :character
## Mode :character Median : 7004 Median : 1651 Mode :character
## Mean : 6590 Mean : 2142
## 3rd Qu.:10001 3rd Qu.: 2499
## Max. :99999 Max. :1450112
## NA's :205 NA's :297843
#View data types in the data frame.
str(realestate)
## 'data.frame': 923159 obs. of 12 variables:
## $ status : chr "for_sale" "for_sale" "for_sale" "for_sale" ...
## $ price : num 105000 80000 67000 145000 65000 179000 50000 71600 100000 300000 ...
## $ bed : num 3 4 2 4 6 4 3 3 2 5 ...
## $ bath : num 2 2 1 2 2 3 1 2 1 3 ...
## $ acre_lot : num 0.12 0.08 0.15 0.1 0.05 0.46 0.2 0.08 0.09 7.46 ...
## $ full_address: chr "Sector Yahuecas Titulo # V84, Adjuntas, PR, 00601" "Km 78 9 Carr # 135, Adjuntas, PR, 00601" "556G 556-G 16 St, Juana Diaz, PR, 00795" "R5 Comunidad El Paraso Calle De Oro R-5 Ponce, Ponce, PR, 00731" ...
## $ street : chr "Sector Yahuecas Titulo # V84" "Km 78 9 Carr # 135" "556G 556-G 16 St" "R5 Comunidad El Paraso Calle De Oro R-5 Ponce" ...
## $ city : chr "Adjuntas" "Adjuntas" "Juana Diaz" "Ponce" ...
## $ state : chr "Puerto Rico" "Puerto Rico" "Puerto Rico" "Puerto Rico" ...
## $ zip_code : num 601 601 795 731 680 612 639 731 730 670 ...
## $ house_size : num 920 1527 748 1800 NA ...
## $ sold_date : chr "" "" "" "" ...
This is a very large dataset, containing over 900,000 entries. To perform any kind of analysis, the data will need to be cleaned and condensed.
To begin data cleaning, a quick summary is performed to see the frequency for the number of listings in each state, the number of bathrooms, and the number of bedrooms in the dataset.
realestate%>%group_by(state)%>%summarize(n = n())
## # A tibble: 18 × 2
## state n
## <chr> <int>
## 1 Connecticut 91644
## 2 Delaware 2135
## 3 Georgia 50
## 4 Maine 36650
## 5 Massachusetts 175248
## 6 New Hampshire 51394
## 7 New Jersey 233722
## 8 New York 211778
## 9 Pennsylvania 17148
## 10 Puerto Rico 24679
## 11 Rhode Island 29596
## 12 South Carolina 25
## 13 Tennessee 20
## 14 Vermont 46460
## 15 Virgin Islands 2573
## 16 Virginia 31
## 17 West Virginia 3
## 18 Wyoming 3
realestate%>%group_by(bath)%>%summarize(n = n())
## # A tibble: 40 × 2
## bath n
## <dbl> <int>
## 1 1 202814
## 2 2 283430
## 3 3 191476
## 4 4 71766
## 5 5 27896
## 6 6 14014
## 7 7 6303
## 8 8 4097
## 9 9 2207
## 10 10 1369
## # … with 30 more rows
realestate%>%group_by(bed)%>%summarize(n = n())
## # A tibble: 44 × 2
## bed n
## <dbl> <int>
## 1 1 85598
## 2 2 174040
## 3 3 248225
## 4 4 150281
## 5 5 59474
## 6 6 36737
## 7 7 12581
## 8 8 10417
## 9 9 5499
## 10 10 2836
## # … with 34 more rows
Next, I want to check the range of the number of bedrooms and bathrooms.
range(realestate$bath, na.rm = TRUE)
## [1] 1 198
range(realestate$bed, na.rm = TRUE)
## [1] 1 123
Next, I check the amount of missing values in the dataset.
#Print the total number of missing values in the data frame.
sum(is.na(realestate))
## [1] 818637
As seen from the previous code, there are a lot of missing values in the dataset. Also, the range for the number of bedrooms and bathrooms are enormous, from 1 to well above 100. Since most listing have no more than 6 bedrooms and bathrooms, I elected to narrow the dataset to listings with no more than 6 bedrooms and no more than 6 bathrooms.
real_estate <- realestate%>%filter(!is.na(price), !is.na(acre_lot), !is.na(house_size), !is.na(zip_code))%>%filter(bed <= 6, bath <= 6, na.rm = TRUE)
summary(real_estate)
## status price bed bath
## Length:389835 Min. : 500 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 275000 1st Qu.:3.000 1st Qu.:2.000
## Mode :character Median : 445000 Median :3.000 Median :2.000
## Mean : 650146 Mean :3.426 Mean :2.446
## 3rd Qu.: 738000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :29900000 Max. :6.000 Max. :6.000
## acre_lot full_address street city
## Min. :0.00e+00 Length:389835 Length:389835 Length:389835
## 1st Qu.:1.20e-01 Class :character Class :character Class :character
## Median :2.70e-01 Mode :character Mode :character Mode :character
## Mean :9.86e+00
## 3rd Qu.:9.00e-01
## Max. :1.00e+05
## state zip_code house_size sold_date
## Length:389835 Min. : 601 Min. : 122 Length:389835
## Class :character 1st Qu.: 2559 1st Qu.: 1296 Class :character
## Mode :character Median : 5860 Median : 1800 Mode :character
## Mean : 5892 Mean : 2141
## 3rd Qu.: 8080 3rd Qu.: 2552
## Max. :95652 Max. :1450112
sum(is.na(real_estate))
## [1] 0
Now the dataset has been updated to where there are no missing values and no listing has more than 6 bedrooms or 6 bathrooms. The number of entries in the dataset is now 389,835.
There are a couple of wide-ranging and large values in the dataset. Both price and house size can be large and not normally distributed. To start, the price and house size variables have log values run and compared to the distributions of their original values:
lprice = log(real_estate$price)
lsqft = log(real_estate$house_size)
par(mfrow = c(1,2))
hist(real_estate$price, prob=TRUE, xlab = "Sales Price", main="")
lines(density(real_estate$price), col="blue")
rug(real_estate$price)
hist(lprice, prob = TRUE, xlab = "Log price", main = "")
lines(density(lprice), col = "red")
rug(lprice)
The price variable is heavily skewed to the right, which means the price variable is not normally distributed. However, the log-transformed price variable helps normally distribute the data. The same is done for the house size variable to see if it’s beneficial to keep the log-transformed house size variable:
par(mfrow = c(1,2))
hist(real_estate$house_size, prob = TRUE, xlab = "House Size", main = "")
lines(density(real_estate$house_size), col = "blue")
rug(real_estate$house_size)
hist(lsqft, prob = TRUE, xlab = "Log House Size", main = "")
lines(density(lsqft), col = "red")
rug(lsqft)
Like the price variable, the house size variable is also heavily skewed to the right, which means the data is not normally distributed. Meanwhile, the log-transformed house size variable seems to normally distributed the data. Going forward, the log-transformed variables will be beneficial in creating models for the data.
Next, a correlation plot is done to view the relationship between log=transformed price, log-transformed house size (now lsqft), number of beds, and number of baths.
attach(real_estate)
par(mfrow = c(1,2))
#A correlation plot represented by circles.
re_circvars = data.frame(bed, bath, lprice, lsqft)
circ = cor(re_circvars)
corrplot(circ)
#A correlation plot represented by numbers.
re_numvars = data.frame(bed, bath, lprice, lsqft)
num_cor = cor(re_numvars)
corrplot(num_cor, method = "number")
detach(real_estate)
From the correlation plots, there doesn’t seem to be a high correlation between any of the variables. Normally, a correlation of 0.80 or higher indicates a strong relationship. From what can be seen, none of the relationships examined have a strong correlation.
Next, a model selection is performed to determine the best regression model in evaluating a house’s price. One method in doing so is using AIC (Akaike Information Criterion). In short, AIC estimates the likelihood of a model to predict a dependent variable.
First, a new data frame is created, keeping 7 variables from the original dataset.
attach(real_estate)
#Creates a new data frame.
re = data.frame(lprice, bed, bath, acre_lot, state, zip_code, lsqft)
#Column names in the data frame.
names(re) = c("lprice", "bed", "bath", "acre_lot", "state", "zip_code", "lsqft")
detach(real_estate)
Now, an AIC model is run, trying to fit the best linear regression model for predicting price. A separate model could be run for each state. In this example, I chose to run an AIC model for houses in the state of Massachusetts. The log-transformed variables for price and house size were used in this model since they are normally distributed.
#Data frame with houses located in the state of Massachusetts.
re_mass <- re%>%filter(state == "Massachusetts")
#Regression model created to predict price.
fit_model <- lm(lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_mass)
#Creates AIC model for the regression model above.
stepAIC(fit_model, direction = "both", k = log(92590), trace = 0)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot + zip_code,
## data = re_mass)
##
## Coefficients:
## (Intercept) bed bath lsqft acre_lot zip_code
## 8.1945547 -0.0972895 0.2469284 0.4989785 0.0001889 0.0005893
#A fit line is created for the regression model.
fit <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_mass)
#A regression subsets fit.
reg_fit = regsubsets(lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_mass, nvmax = 5)
#Plots the summary of the regression subsets fit, showing the Cp fit.
plot(summary(reg_fit)$cp, xlab = "Number of Variables", ylab = "Cp")
#Displays a summary of the fit and reg fit lines.
summary(fit)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot + zip_code,
## data = re_mass)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3638 -0.2493 0.0120 0.2584 2.7483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.195e+00 3.783e-02 216.60 <2e-16 ***
## bed -9.729e-02 1.883e-03 -51.67 <2e-16 ***
## bath 2.469e-01 2.250e-03 109.77 <2e-16 ***
## lsqft 4.990e-01 5.812e-03 85.86 <2e-16 ***
## acre_lot 1.889e-04 1.751e-05 10.79 <2e-16 ***
## zip_code 5.893e-04 3.117e-06 189.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4698 on 92584 degrees of freedom
## Multiple R-squared: 0.584, Adjusted R-squared: 0.5839
## F-statistic: 2.599e+04 on 5 and 92584 DF, p-value: < 2.2e-16
summary(reg_fit)
## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot + zip_code,
## data = re_mass, nvmax = 5)
## 5 Variables (and intercept)
## Forced in Forced out
## bed FALSE FALSE
## bath FALSE FALSE
## lsqft FALSE FALSE
## acre_lot FALSE FALSE
## zip_code FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## bed bath lsqft acre_lot zip_code
## 1 ( 1 ) " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " " " "*"
## 3 ( 1 ) " " "*" "*" " " "*"
## 4 ( 1 ) "*" "*" "*" " " "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
From the AIC model’s algorithm, the selection shows which variables are statistically significant, as well as which variables fit the model when using anywhere between 1 and 5 variables. The graph showing Mallow’s \(C_p\) helps determine how many variables are best for the regression model. When looking at the visual, wherever there isn’t a noticeable difference in the \(C_p\) value helps determine the amount of variables to use in the regression model. In the case for the Massachusetts model, the \(C_p\) plot shows two variables can be used for the regression model to predict the house price for homes in Massachusetts.
The summary will show which variables to use for the regression model formula. Looking at the selection algorithm, the stars denote which variables to use. From the summary, the two variables to use for predicting the price for a home in Massachusetts are the number of baths and the zip code of the property (location).
How confident can we feel about this model? One way to tell is the \(R^2\) value shown in the printed summary. \(R^2\), or coefficient of determination, is the proportion of variance of the dependent variable that is explained by the regression model. Another way to describe \(R^2\) is it measures how well the linear regression model fits the data. The summary shows the \(R^2\) value = 0.5839. So, roughly 58% of the variability can be explained by the regression model. Normally, a \(R^2\) value of 0.80 or higher would have us be more confident in the model. With a value of just under 60%, this regression model may not be the best in predicting the price of a house in Massachusetts with the variables given.
To see if the same regression model can be used for another state, an AIC model is run for the houses located in the state of Vermont.
re_vt <- re%>%filter(state == "Vermont")
fit_model_vt <- lm(lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_vt)
stepAIC(fit_model_vt, direction = "both", k = log(45307), trace = 0)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot + zip_code,
## data = re_vt)
##
## Coefficients:
## (Intercept) bed bath lsqft acre_lot zip_code
## 7.366e+00 -1.167e-01 2.589e-01 6.722e-01 3.646e-03 9.264e-06
layout(matrix(1:2, nrow = 2))
fit_vt <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_vt)
reg_fit_vt = regsubsets(lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_vt, nvmax = 5)
plot(summary(reg_fit_vt)$cp, xlab = "Number of Variables", ylab = "Cp")
summary(fit_vt)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot + zip_code,
## data = re_vt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1534 -0.3261 0.0051 0.3417 1.5196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.366e+00 7.819e-02 94.208 < 2e-16 ***
## bed -1.167e-01 4.214e-03 -27.698 < 2e-16 ***
## bath 2.589e-01 4.912e-03 52.719 < 2e-16 ***
## lsqft 6.722e-01 1.172e-02 57.368 < 2e-16 ***
## acre_lot 3.646e-03 8.701e-05 41.904 < 2e-16 ***
## zip_code 9.264e-06 1.429e-06 6.482 9.23e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5331 on 20525 degrees of freedom
## Multiple R-squared: 0.5203, Adjusted R-squared: 0.5202
## F-statistic: 4453 on 5 and 20525 DF, p-value: < 2.2e-16
summary(reg_fit_vt)
## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot + zip_code,
## data = re_vt, nvmax = 5)
## 5 Variables (and intercept)
## Forced in Forced out
## bed FALSE FALSE
## bath FALSE FALSE
## lsqft FALSE FALSE
## acre_lot FALSE FALSE
## zip_code FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## bed bath lsqft acre_lot zip_code
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " "*" "*" " " " "
## 3 ( 1 ) " " "*" "*" "*" " "
## 4 ( 1 ) "*" "*" "*" "*" " "
## 5 ( 1 ) "*" "*" "*" "*" "*"
From the graph displaying Mallow’s \(C_p\), it appears three variables can be used in the regression model for house prices in Vermont. The summary shows the three variables to use are number of baths, the square footage of the house (house size), and the size of the lot (in acres). The \(R^2\) value = 0.5202, which means roughly 52% of the variability can be explained by the regression model. Like the Massachusetts regression model, the \(R^2\) value does not make me feel too confident in the regression model chosen for house prices in Vermont.
An AIC model can also be used for each zip code in the dataset.
First, a count table is created for the top 10 zip codes (per count) for the Massachusetts and Vermont data frames.
#Creates a table displaying the 10 most common zip codes in the Massachusetts data.
zip_counts_mass <- re_mass%>%group_by(zip_code)%>%summarize(count = n())%>%arrange(desc(count))
head(zip_counts_mass,10)
## # A tibble: 10 × 2
## zip_code count
## <dbl> <int>
## 1 1201 1471
## 2 2127 1248
## 3 2116 942
## 4 2128 921
## 5 2649 831
## 6 2151 824
## 7 2118 808
## 8 1040 771
## 9 2145 771
## 10 2135 728
zip_counts_vt <- re_vt%>%group_by(zip_code)%>%summarize(count = n())%>%arrange(desc(count))
head(zip_counts_vt,10)
## # A tibble: 10 × 2
## zip_code count
## <dbl> <int>
## 1 5701 804
## 2 5641 743
## 3 5156 595
## 4 5819 564
## 5 5301 556
## 6 5149 516
## 7 5091 401
## 8 5143 374
## 9 5363 361
## 10 5735 340
The Massachusetts zip code with the most listings is 1201. An AIC model is run for a subset for houses located in the 1201 zip code:
re_mass_1201 <- re%>%filter(state == "Massachusetts", zip_code == 1201)
#Regression model created to predict price.
fit_model_1201 <- lm(lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_1201)
#Creates AIC model for the regression model above.
stepAIC(fit_model_1201, direction = "both", k = log(1471), trace = 0)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_1201)
##
## Coefficients:
## (Intercept) bed bath lsqft acre_lot
## 8.28205 -0.08702 0.05289 0.57210 0.18488
#A fit line is created for the regression model.
fit_1201 <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_1201)
#A regression subsets fit.
reg_fit_1201 = regsubsets(lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_1201, nvmax = 4)
#Plots the summary of the regression subsets fit, showing the Cp fit.
plot(summary(reg_fit_1201)$cp, xlab = "Number of Variables", ylab = "Cp")
#Displays a summary of the fit and reg fit lines.
summary(fit_1201)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_1201)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14165 -0.30797 0.01352 0.25594 1.26985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28205 0.35401 23.395 < 2e-16 ***
## bed -0.08702 0.01541 -5.649 1.94e-08 ***
## bath 0.05289 0.01761 3.003 0.00272 **
## lsqft 0.57210 0.05537 10.332 < 2e-16 ***
## acre_lot 0.18488 0.01289 14.344 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.435 on 1466 degrees of freedom
## Multiple R-squared: 0.3394, Adjusted R-squared: 0.3376
## F-statistic: 188.3 on 4 and 1466 DF, p-value: < 2.2e-16
summary(reg_fit_1201)
## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_1201,
## nvmax = 4)
## 4 Variables (and intercept)
## Forced in Forced out
## bed FALSE FALSE
## bath FALSE FALSE
## lsqft FALSE FALSE
## acre_lot FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## bed bath lsqft acre_lot
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) " " " " "*" "*"
## 3 ( 1 ) "*" " " "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
From the graph displaying Mallow’s \(C_p\), it appears two variables can be used in the regression model for house prices located in the 1201 zip code. The summary shows the two variables to use are size of the lot (in acres) and the square footage of the house (house size). The \(R^2\) value = 0.3376, which means roughly 34% of the variability can be explained by the regression model.
The zip code with the second most listings in Massachusetts was 2127. An AIC model is run for a subset for houses located in the 2127 zip code:
re_mass_2127 <- re%>%filter(state == "Massachusetts", zip_code == 2127)
#Regression model created to predict price.
fit_model_2127 <- lm(lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_2127)
#Creates AIC model for the regression model above.
stepAIC(fit_model_2127, direction = "both", k = log(1248), trace = 0)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_2127)
##
## Coefficients:
## (Intercept) bed bath lsqft acre_lot
## 8.15168 -0.04655 0.06228 0.77212 1.31708
#A fit line is created for the regression model.
fit_2127 <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_2127)
#A regression subsets fit.
reg_fit_2127 = regsubsets(lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_2127, nvmax = 4)
#Plots the summary of the regression subsets fit, showing the Cp fit.
plot(summary(reg_fit_2127)$cp, xlab = "Number of Variables", ylab = "Cp")
#Displays a summary of the fit and reg fit lines.
summary(fit_2127)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_2127)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39745 -0.10127 -0.01633 0.12865 0.60908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.15168 0.16802 48.516 < 2e-16 ***
## bed -0.04655 0.01063 -4.380 1.28e-05 ***
## bath 0.06228 0.01075 5.795 8.66e-09 ***
## lsqft 0.77212 0.02722 28.362 < 2e-16 ***
## acre_lot 1.31708 0.09010 14.618 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1976 on 1243 degrees of freedom
## Multiple R-squared: 0.8295, Adjusted R-squared: 0.8289
## F-statistic: 1512 on 4 and 1243 DF, p-value: < 2.2e-16
summary(reg_fit_2127)
## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_2127,
## nvmax = 4)
## 4 Variables (and intercept)
## Forced in Forced out
## bed FALSE FALSE
## bath FALSE FALSE
## lsqft FALSE FALSE
## acre_lot FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## bed bath lsqft acre_lot
## 1 ( 1 ) " " " " "*" " "
## 2 ( 1 ) " " " " "*" "*"
## 3 ( 1 ) " " "*" "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
From the graph displaying Mallow’s \(C_p\), it appears three variables can be used in the regression model for house prices located in the 2127 zip code. The summary shows the three variables to use are square footage of the house (house size), size of the lot (in acres) and number of baths. The \(R^2\) value = 0.8289, which means roughly 83% of the variability can be explained by the regression model. So far, this has been our most confident model in predicting house prices in any location.
From the table depicting the 10 most common zip codes in Vermont, #1 on the list was zip code 5701. An AIC model is run for a subset for houses located in the 5701 zip code:
re_vt_5701 <- re%>%filter(state == "Vermont", zip_code == 5701)
#Regression model created to predict price.
fit_model_5701 <- lm(lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5701)
#Creates AIC model for the regression model above.
stepAIC(fit_model_5701, direction = "both", k = log(804), trace = 0)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5701)
##
## Coefficients:
## (Intercept) bed bath lsqft acre_lot
## 6.40510 -0.23082 0.32138 0.79387 0.02869
#A fit line is created for the regression model.
fit_5701 <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5701)
#A regression subsets fit.
reg_fit_5701 = regsubsets(lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5701, nvmax = 4)
#Plots the summary of the regression subsets fit, showing the Cp fit.
plot(summary(reg_fit_5701)$cp, xlab = "Number of Variables", ylab = "Cp")
#Displays a summary of the fit and reg fit lines.
summary(fit_5701)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5701)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.48289 -0.33821 -0.05551 0.32716 1.36356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.405104 0.575934 11.121 <2e-16 ***
## bed -0.230818 0.020510 -11.254 <2e-16 ***
## bath 0.321381 0.035286 9.108 <2e-16 ***
## lsqft 0.793867 0.084003 9.450 <2e-16 ***
## acre_lot 0.028686 0.002762 10.387 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5741 on 799 degrees of freedom
## Multiple R-squared: 0.3726, Adjusted R-squared: 0.3694
## F-statistic: 118.6 on 4 and 799 DF, p-value: < 2.2e-16
summary(reg_fit_5701)
## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5701,
## nvmax = 4)
## 4 Variables (and intercept)
## Forced in Forced out
## bed FALSE FALSE
## bath FALSE FALSE
## lsqft FALSE FALSE
## acre_lot FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## bed bath lsqft acre_lot
## 1 ( 1 ) " " "*" " " " "
## 2 ( 1 ) " " "*" " " "*"
## 3 ( 1 ) "*" " " "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
From the graph displaying Mallow’s \(C_p\), it appears three variables can be used in the regression model for house prices located in the 2127 zip code. The summary shows the three variables to use are number of baths, size of the lot (in acres) and square footage of the house (house size). The \(R^2\) value = 0.3694, which means roughly 37% of the variability can be explained by the regression model.
The zip code with the second most listings in Vermont was 5641. An AIC model is run for a subset for houses located in the 5641 zip code:
re_vt_5641 <- re%>%filter(state == "Vermont", zip_code == 5641)
#Regression model created to predict price.
fit_model_5641 <- lm(lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5641)
#Creates AIC model for the regression model above.
stepAIC(fit_model_5641, direction = "both", k = log(743), trace = 0)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5641)
##
## Coefficients:
## (Intercept) bed bath lsqft acre_lot
## 4.783177 -0.126191 0.048912 1.059461 0.003516
#A fit line is created for the regression model.
fit_5641 <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5641)
#A regression subsets fit.
reg_fit_5641 = regsubsets(lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5641, nvmax = 4)
#Plots the summary of the regression subsets fit, showing the Cp fit.
plot(summary(reg_fit_5641)$cp, xlab = "Number of Variables", ylab = "Cp")
#Displays a summary of the fit and reg fit lines.
summary(fit_5641)
##
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5641)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07102 -0.22701 0.04006 0.28046 0.69098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7831770 0.4237630 11.287 <2e-16 ***
## bed -0.1261915 0.0131710 -9.581 <2e-16 ***
## bath 0.0489125 0.0183645 2.663 0.0079 **
## lsqft 1.0594612 0.0613718 17.263 <2e-16 ***
## acre_lot 0.0035156 0.0001878 18.724 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3681 on 738 degrees of freedom
## Multiple R-squared: 0.6115, Adjusted R-squared: 0.6094
## F-statistic: 290.4 on 4 and 738 DF, p-value: < 2.2e-16
summary(reg_fit_5641)
## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5641,
## nvmax = 4)
## 4 Variables (and intercept)
## Forced in Forced out
## bed FALSE FALSE
## bath FALSE FALSE
## lsqft FALSE FALSE
## acre_lot FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## bed bath lsqft acre_lot
## 1 ( 1 ) " " " " "*" " "
## 2 ( 1 ) " " " " "*" "*"
## 3 ( 1 ) "*" " " "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
From the graph displaying Mallow’s \(C_p\), it appears two variables can be used in the regression model for house prices located in the 2127 zip code. The summary shows the two variables to use are square footage of the house (house size) and size of the lot (in acres). The \(R^2\) value = 0.6094, which means roughly 61% of the variability can be explained by the regression model.
In this session, exploratory data cleaning techniques and AIC modeling were demonstrated using the “USA Real Estate Dataset” dataset from Kaggle.
First, the dataset was inspected and checked for missing values. A range was done to view the range for the number of bedrooms and bathrooms. From there, a subset was created to remove missing values, as well as keeping the listings which contained no more than 6 bedrooms and no more than 6 bathrooms. The distribution was viewed two variables were log-transformed (price and house size) to have the data be normally distributed.
Next, a correlation plot was created to see the relationship among four variables (log-transformed price and house size, number of beds, number of baths). There doesn’t appear to be any strong correlations among the selected variables, which could impact the search for a best fit regression model.
Various AIC models are created in order to determine the best linear regression model for predicting price. In particular, AIC models were created for houses listed in Massachusetts and Vermont, as well as specific zip codes within Massachusetts and Vermont to determine if the formula for predicting price changes for each location. From the models, it appears types and number of variables can impact the price prediction for each location. For example, the two variables to predict house price in Massachusetts were number of baths and the zip code of the property, while the three variables to predict house prices in Vermont were number of baths, square footage of the house (house size), and size of the lot (in acres).
Overall, I only felt confident in one AIC model, which was the model for predicting price in the 2127 zip code of Massachusetts. This model had a \(R^2\) value of roughly 83%, which means about 83% of the variability can be explained by the regression model. All other AIC models had no higher than a 0.60 \(R^2\) value.
I believe that more factors added in to the model will help predict a more accurate price range for the data. For example, factors such as whether a house has a pool, solar, a heating and cooling system, or a garage (to name a few) can be help create a better prediction model.
Thank you to the reader for viewing my work.