In this R Markdown session, I will examine the characteristics of the “USA Real Estate Dataset” available on the Kaggle website. In addition, an AIC (Akaike Information Criterion) model will be created to see if there is a potential regression model for predicting a house’s price based on the variables available.

About the Dataset

The dataset used can be found on the Kaggle website, an online platform for data scientists containing free datasets and code collaboration. According to the author of the dataset, the dataset is scraped/updated weekly from https://www.realtor.com. Below is the link for the dataset:

https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset.

The data contains eight variables used to explore the effect of various factors on test scores. The variables are:

Status - if house is for sale or off the market
Price - house price
Bed - Number of bedrooms
Acre Lot - Size of property in acres
Full Address - Full address of the property
Street - Name of street of the property
City - Name of the city where the property is located
State - Name of the state where the property is located
Zip Code - Zip Code where the property is located
House Size - Size of the property (in square feet)
Sold Date - Date the property was sold

Exploratory Data Analysis

First, the dataset is loaded into R and saved as “realestate”.

realestate = read.csv("~/R datasets/realtor-data.csv", header=TRUE)

Now that the dataset is loaded into R, the next step is to view the data and see if it’s clean for analysis.

#Inspect the data frame.
head(realestate)

##     status  price bed bath acre_lot
## 1 for_sale 105000   3    2     0.12
## 2 for_sale  80000   4    2     0.08
## 3 for_sale  67000   2    1     0.15
## 4 for_sale 145000   4    2     0.10
## 5 for_sale  65000   6    2     0.05
## 6 for_sale 179000   4    3     0.46
##                                                      full_address
## 1               Sector Yahuecas Titulo # V84, Adjuntas, PR, 00601
## 2                         Km 78 9 Carr # 135, Adjuntas, PR, 00601
## 3                         556G 556-G 16 St, Juana Diaz, PR, 00795
## 4 R5 Comunidad El Paraso Calle De Oro R-5 Ponce, Ponce, PR, 00731
## 5                                 14 Navarro, Mayaguez, PR, 00680
## 6            Bo Calabazas San Sebastian, San Sebastian, PR, 00612
##                                          street          city       state
## 1                  Sector Yahuecas Titulo # V84      Adjuntas Puerto Rico
## 2                            Km 78 9 Carr # 135      Adjuntas Puerto Rico
## 3                              556G 556-G 16 St    Juana Diaz Puerto Rico
## 4 R5 Comunidad El Paraso Calle De Oro R-5 Ponce         Ponce Puerto Rico
## 5                                    14 Navarro      Mayaguez Puerto Rico
## 6                    Bo Calabazas San Sebastian San Sebastian Puerto Rico
##   zip_code house_size sold_date
## 1      601        920          
## 2      601       1527          
## 3      795        748          
## 4      731       1800          
## 5      680         NA          
## 6      612       2520

#View the column names.
colnames(realestate)

##  [1] "status"       "price"        "bed"          "bath"         "acre_lot"    
##  [6] "full_address" "street"       "city"         "state"        "zip_code"    
## [11] "house_size"   "sold_date"

#View summary of the data frame.
summary(realestate)

##     status              price                bed              bath       
##  Length:923159      Min.   :        0   Min.   :  1.00   Min.   :  1.00  
##  Class :character   1st Qu.:   269000   1st Qu.:  2.00   1st Qu.:  1.00  
##  Mode  :character   Median :   475000   Median :  3.00   Median :  2.00  
##                     Mean   :   884123   Mean   :  3.33   Mean   :  2.49  
##                     3rd Qu.:   839900   3rd Qu.:  4.00   3rd Qu.:  3.00  
##                     Max.   :875000000   Max.   :123.00   Max.   :198.00  
##                     NA's   :71          NA's   :131703   NA's   :115192  
##     acre_lot         full_address          street              city          
##  Min.   :     0.00   Length:923159      Length:923159      Length:923159     
##  1st Qu.:     0.11   Class :character   Class :character   Class :character  
##  Median :     0.29   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :    17.08                                                           
##  3rd Qu.:     1.15                                                           
##  Max.   :100000.00                                                           
##  NA's   :273623                                                              
##     state              zip_code       house_size       sold_date        
##  Length:923159      Min.   :  601   Min.   :    100   Length:923159     
##  Class :character   1st Qu.: 2919   1st Qu.:   1130   Class :character  
##  Mode  :character   Median : 7004   Median :   1651   Mode  :character  
##                     Mean   : 6590   Mean   :   2142                     
##                     3rd Qu.:10001   3rd Qu.:   2499                     
##                     Max.   :99999   Max.   :1450112                     
##                     NA's   :205     NA's   :297843

#View data types in the data frame.
str(realestate)

## 'data.frame':    923159 obs. of  12 variables:
##  $ status      : chr  "for_sale" "for_sale" "for_sale" "for_sale" ...
##  $ price       : num  105000 80000 67000 145000 65000 179000 50000 71600 100000 300000 ...
##  $ bed         : num  3 4 2 4 6 4 3 3 2 5 ...
##  $ bath        : num  2 2 1 2 2 3 1 2 1 3 ...
##  $ acre_lot    : num  0.12 0.08 0.15 0.1 0.05 0.46 0.2 0.08 0.09 7.46 ...
##  $ full_address: chr  "Sector Yahuecas Titulo # V84, Adjuntas, PR, 00601" "Km 78 9 Carr # 135, Adjuntas, PR, 00601" "556G 556-G 16 St, Juana Diaz, PR, 00795" "R5 Comunidad El Paraso Calle De Oro R-5 Ponce, Ponce, PR, 00731" ...
##  $ street      : chr  "Sector Yahuecas Titulo # V84" "Km 78 9 Carr # 135" "556G 556-G 16 St" "R5 Comunidad El Paraso Calle De Oro R-5 Ponce" ...
##  $ city        : chr  "Adjuntas" "Adjuntas" "Juana Diaz" "Ponce" ...
##  $ state       : chr  "Puerto Rico" "Puerto Rico" "Puerto Rico" "Puerto Rico" ...
##  $ zip_code    : num  601 601 795 731 680 612 639 731 730 670 ...
##  $ house_size  : num  920 1527 748 1800 NA ...
##  $ sold_date   : chr  "" "" "" "" ...

This is a very large dataset, containing over 900,000 entries. To perform any kind of analysis, the data will need to be cleaned and condensed.

Data Cleaning

To begin data cleaning, a quick summary is performed to see the frequency for the number of listings in each state, the number of bathrooms, and the number of bedrooms in the dataset.

realestate%>%group_by(state)%>%summarize(n = n())

## # A tibble: 18 × 2
##    state               n
##    <chr>           <int>
##  1 Connecticut     91644
##  2 Delaware         2135
##  3 Georgia            50
##  4 Maine           36650
##  5 Massachusetts  175248
##  6 New Hampshire   51394
##  7 New Jersey     233722
##  8 New York       211778
##  9 Pennsylvania    17148
## 10 Puerto Rico     24679
## 11 Rhode Island    29596
## 12 South Carolina     25
## 13 Tennessee          20
## 14 Vermont         46460
## 15 Virgin Islands   2573
## 16 Virginia           31
## 17 West Virginia       3
## 18 Wyoming             3

realestate%>%group_by(bath)%>%summarize(n = n())

## # A tibble: 40 × 2
##     bath      n
##    <dbl>  <int>
##  1     1 202814
##  2     2 283430
##  3     3 191476
##  4     4  71766
##  5     5  27896
##  6     6  14014
##  7     7   6303
##  8     8   4097
##  9     9   2207
## 10    10   1369
## # … with 30 more rows

realestate%>%group_by(bed)%>%summarize(n = n())

## # A tibble: 44 × 2
##      bed      n
##    <dbl>  <int>
##  1     1  85598
##  2     2 174040
##  3     3 248225
##  4     4 150281
##  5     5  59474
##  6     6  36737
##  7     7  12581
##  8     8  10417
##  9     9   5499
## 10    10   2836
## # … with 34 more rows

Next, I want to check the range of the number of bedrooms and bathrooms.

range(realestate$bath, na.rm = TRUE)

## [1]   1 198

range(realestate$bed, na.rm = TRUE)

## [1]   1 123

Next, I check the amount of missing values in the dataset.

#Print the total number of missing values in the data frame.
sum(is.na(realestate))

## [1] 818637

As seen from the previous code, there are a lot of missing values in the dataset. Also, the range for the number of bedrooms and bathrooms are enormous, from 1 to well above 100. Since most listing have no more than 6 bedrooms and bathrooms, I elected to narrow the dataset to listings with no more than 6 bedrooms and no more than 6 bathrooms.

real_estate <- realestate%>%filter(!is.na(price), !is.na(acre_lot), !is.na(house_size), !is.na(zip_code))%>%filter(bed <= 6, bath <= 6, na.rm = TRUE)

summary(real_estate)

##     status              price               bed             bath      
##  Length:389835      Min.   :     500   Min.   :1.000   Min.   :1.000  
##  Class :character   1st Qu.:  275000   1st Qu.:3.000   1st Qu.:2.000  
##  Mode  :character   Median :  445000   Median :3.000   Median :2.000  
##                     Mean   :  650146   Mean   :3.426   Mean   :2.446  
##                     3rd Qu.:  738000   3rd Qu.:4.000   3rd Qu.:3.000  
##                     Max.   :29900000   Max.   :6.000   Max.   :6.000  
##     acre_lot        full_address          street              city          
##  Min.   :0.00e+00   Length:389835      Length:389835      Length:389835     
##  1st Qu.:1.20e-01   Class :character   Class :character   Class :character  
##  Median :2.70e-01   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :9.86e+00                                                           
##  3rd Qu.:9.00e-01                                                           
##  Max.   :1.00e+05                                                           
##     state              zip_code       house_size       sold_date        
##  Length:389835      Min.   :  601   Min.   :    122   Length:389835     
##  Class :character   1st Qu.: 2559   1st Qu.:   1296   Class :character  
##  Mode  :character   Median : 5860   Median :   1800   Mode  :character  
##                     Mean   : 5892   Mean   :   2141                     
##                     3rd Qu.: 8080   3rd Qu.:   2552                     
##                     Max.   :95652   Max.   :1450112

sum(is.na(real_estate))

## [1] 0

Now the dataset has been updated to where there are no missing values and no listing has more than 6 bedrooms or 6 bathrooms. The number of entries in the dataset is now 389,835.

Examining the Distribution of the Data

There are a couple of wide-ranging and large values in the dataset. Both price and house size can be large and not normally distributed. To start, the price and house size variables have log values run and compared to the distributions of their original values:

lprice = log(real_estate$price)
lsqft = log(real_estate$house_size)

par(mfrow = c(1,2))

hist(real_estate$price, prob=TRUE, xlab = "Sales Price", main="")
lines(density(real_estate$price), col="blue")
rug(real_estate$price)
    
hist(lprice, prob = TRUE, xlab = "Log price", main = "")
lines(density(lprice), col = "red")
rug(lprice)

The price variable is heavily skewed to the right, which means the price variable is not normally distributed. However, the log-transformed price variable helps normally distribute the data. The same is done for the house size variable to see if it’s beneficial to keep the log-transformed house size variable:

par(mfrow = c(1,2))

hist(real_estate$house_size, prob = TRUE, xlab = "House Size", main = "")
lines(density(real_estate$house_size), col = "blue")
rug(real_estate$house_size)

hist(lsqft, prob = TRUE, xlab = "Log House Size", main = "")
lines(density(lsqft), col = "red")
rug(lsqft)

Like the price variable, the house size variable is also heavily skewed to the right, which means the data is not normally distributed. Meanwhile, the log-transformed house size variable seems to normally distributed the data. Going forward, the log-transformed variables will be beneficial in creating models for the data.

Examinining the Correlation of the Data

Next, a correlation plot is done to view the relationship between log=transformed price, log-transformed house size (now lsqft), number of beds, and number of baths.

attach(real_estate)
par(mfrow = c(1,2))
#A correlation plot represented by circles.
re_circvars = data.frame(bed, bath, lprice, lsqft)
circ = cor(re_circvars)
corrplot(circ)
#A correlation plot represented by numbers.
re_numvars = data.frame(bed, bath, lprice, lsqft)
num_cor = cor(re_numvars)
corrplot(num_cor, method = "number")

detach(real_estate)

From the correlation plots, there doesn’t seem to be a high correlation between any of the variables. Normally, a correlation of 0.80 or higher indicates a strong relationship. From what can be seen, none of the relationships examined have a strong correlation.

AIC Model for Real Estate

Next, a model selection is performed to determine the best regression model in evaluating a house’s price. One method in doing so is using AIC (Akaike Information Criterion). In short, AIC estimates the likelihood of a model to predict a dependent variable.

First, a new data frame is created, keeping 7 variables from the original dataset.

attach(real_estate)
#Creates a new data frame.
re = data.frame(lprice, bed, bath, acre_lot, state, zip_code, lsqft)
#Column names in the data frame. 
names(re) = c("lprice", "bed", "bath", "acre_lot", "state", "zip_code", "lsqft")

detach(real_estate)

AIC Model for Massachusetts

Now, an AIC model is run, trying to fit the best linear regression model for predicting price. A separate model could be run for each state. In this example, I chose to run an AIC model for houses in the state of Massachusetts. The log-transformed variables for price and house size were used in this model since they are normally distributed.

#Data frame with houses located in the state of Massachusetts.
re_mass <- re%>%filter(state == "Massachusetts")
#Regression model created to predict price.
fit_model <- lm(lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_mass)
#Creates AIC model for the regression model above. 
stepAIC(fit_model, direction = "both", k = log(92590), trace = 0)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot + zip_code, 
##     data = re_mass)
## 
## Coefficients:
## (Intercept)          bed         bath        lsqft     acre_lot     zip_code  
##   8.1945547   -0.0972895    0.2469284    0.4989785    0.0001889    0.0005893

#A fit line is created for the regression model.
fit <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_mass)
#A regression subsets fit.
reg_fit = regsubsets(lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_mass, nvmax = 5)
#Plots the summary of the regression subsets fit, showing the Cp fit. 
plot(summary(reg_fit)$cp, xlab = "Number of Variables", ylab = "Cp")

#Displays a summary of the fit and reg fit lines. 
summary(fit)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot + zip_code, 
##     data = re_mass)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3638 -0.2493  0.0120  0.2584  2.7483 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.195e+00  3.783e-02  216.60   <2e-16 ***
## bed         -9.729e-02  1.883e-03  -51.67   <2e-16 ***
## bath         2.469e-01  2.250e-03  109.77   <2e-16 ***
## lsqft        4.990e-01  5.812e-03   85.86   <2e-16 ***
## acre_lot     1.889e-04  1.751e-05   10.79   <2e-16 ***
## zip_code     5.893e-04  3.117e-06  189.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4698 on 92584 degrees of freedom
## Multiple R-squared:  0.584,  Adjusted R-squared:  0.5839 
## F-statistic: 2.599e+04 on 5 and 92584 DF,  p-value: < 2.2e-16

summary(reg_fit)

## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot + zip_code, 
##     data = re_mass, nvmax = 5)
## 5 Variables  (and intercept)
##          Forced in Forced out
## bed          FALSE      FALSE
## bath         FALSE      FALSE
## lsqft        FALSE      FALSE
## acre_lot     FALSE      FALSE
## zip_code     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          bed bath lsqft acre_lot zip_code
## 1  ( 1 ) " " "*"  " "   " "      " "     
## 2  ( 1 ) " " "*"  " "   " "      "*"     
## 3  ( 1 ) " " "*"  "*"   " "      "*"     
## 4  ( 1 ) "*" "*"  "*"   " "      "*"     
## 5  ( 1 ) "*" "*"  "*"   "*"      "*"

From the AIC model’s algorithm, the selection shows which variables are statistically significant, as well as which variables fit the model when using anywhere between 1 and 5 variables. The graph showing Mallow’s \(C_p\) helps determine how many variables are best for the regression model. When looking at the visual, wherever there isn’t a noticeable difference in the \(C_p\) value helps determine the amount of variables to use in the regression model. In the case for the Massachusetts model, the \(C_p\) plot shows two variables can be used for the regression model to predict the house price for homes in Massachusetts.

The summary will show which variables to use for the regression model formula. Looking at the selection algorithm, the stars denote which variables to use. From the summary, the two variables to use for predicting the price for a home in Massachusetts are the number of baths and the zip code of the property (location).

How confident can we feel about this model? One way to tell is the \(R^2\) value shown in the printed summary. \(R^2\), or coefficient of determination, is the proportion of variance of the dependent variable that is explained by the regression model. Another way to describe \(R^2\) is it measures how well the linear regression model fits the data. The summary shows the \(R^2\) value = 0.5839. So, roughly 58% of the variability can be explained by the regression model. Normally, a \(R^2\) value of 0.80 or higher would have us be more confident in the model. With a value of just under 60%, this regression model may not be the best in predicting the price of a house in Massachusetts with the variables given.

AIC Model for Vermont

To see if the same regression model can be used for another state, an AIC model is run for the houses located in the state of Vermont.

re_vt <- re%>%filter(state == "Vermont")

fit_model_vt <- lm(lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_vt)

stepAIC(fit_model_vt, direction = "both", k = log(45307), trace = 0)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot + zip_code, 
##     data = re_vt)
## 
## Coefficients:
## (Intercept)          bed         bath        lsqft     acre_lot     zip_code  
##   7.366e+00   -1.167e-01    2.589e-01    6.722e-01    3.646e-03    9.264e-06

layout(matrix(1:2, nrow = 2))
fit_vt <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_vt)
reg_fit_vt = regsubsets(lprice ~ bed+bath+lsqft+acre_lot+zip_code, data = re_vt, nvmax = 5)

plot(summary(reg_fit_vt)$cp, xlab = "Number of Variables", ylab = "Cp")

summary(fit_vt)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot + zip_code, 
##     data = re_vt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1534 -0.3261  0.0051  0.3417  1.5196 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.366e+00  7.819e-02  94.208  < 2e-16 ***
## bed         -1.167e-01  4.214e-03 -27.698  < 2e-16 ***
## bath         2.589e-01  4.912e-03  52.719  < 2e-16 ***
## lsqft        6.722e-01  1.172e-02  57.368  < 2e-16 ***
## acre_lot     3.646e-03  8.701e-05  41.904  < 2e-16 ***
## zip_code     9.264e-06  1.429e-06   6.482 9.23e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5331 on 20525 degrees of freedom
## Multiple R-squared:  0.5203, Adjusted R-squared:  0.5202 
## F-statistic:  4453 on 5 and 20525 DF,  p-value: < 2.2e-16

summary(reg_fit_vt)

## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot + zip_code, 
##     data = re_vt, nvmax = 5)
## 5 Variables  (and intercept)
##          Forced in Forced out
## bed          FALSE      FALSE
## bath         FALSE      FALSE
## lsqft        FALSE      FALSE
## acre_lot     FALSE      FALSE
## zip_code     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          bed bath lsqft acre_lot zip_code
## 1  ( 1 ) " " " "  "*"   " "      " "     
## 2  ( 1 ) " " "*"  "*"   " "      " "     
## 3  ( 1 ) " " "*"  "*"   "*"      " "     
## 4  ( 1 ) "*" "*"  "*"   "*"      " "     
## 5  ( 1 ) "*" "*"  "*"   "*"      "*"

From the graph displaying Mallow’s \(C_p\), it appears three variables can be used in the regression model for house prices in Vermont. The summary shows the three variables to use are number of baths, the square footage of the house (house size), and the size of the lot (in acres). The \(R^2\) value = 0.5202, which means roughly 52% of the variability can be explained by the regression model. Like the Massachusetts regression model, the \(R^2\) value does not make me feel too confident in the regression model chosen for house prices in Vermont.

AIC Model for Individual Zip Codes in Massachusetts and Vermont

An AIC model can also be used for each zip code in the dataset.

First, a count table is created for the top 10 zip codes (per count) for the Massachusetts and Vermont data frames.

#Creates a table displaying the 10 most common zip codes in the Massachusetts data. 
zip_counts_mass <- re_mass%>%group_by(zip_code)%>%summarize(count = n())%>%arrange(desc(count))
head(zip_counts_mass,10)

## # A tibble: 10 × 2
##    zip_code count
##       <dbl> <int>
##  1     1201  1471
##  2     2127  1248
##  3     2116   942
##  4     2128   921
##  5     2649   831
##  6     2151   824
##  7     2118   808
##  8     1040   771
##  9     2145   771
## 10     2135   728

zip_counts_vt <- re_vt%>%group_by(zip_code)%>%summarize(count = n())%>%arrange(desc(count))
head(zip_counts_vt,10)

## # A tibble: 10 × 2
##    zip_code count
##       <dbl> <int>
##  1     5701   804
##  2     5641   743
##  3     5156   595
##  4     5819   564
##  5     5301   556
##  6     5149   516
##  7     5091   401
##  8     5143   374
##  9     5363   361
## 10     5735   340

The Massachusetts zip code with the most listings is 1201. An AIC model is run for a subset for houses located in the 1201 zip code:

re_mass_1201 <- re%>%filter(state == "Massachusetts", zip_code == 1201)
#Regression model created to predict price.
fit_model_1201 <- lm(lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_1201)
#Creates AIC model for the regression model above. 
stepAIC(fit_model_1201, direction = "both", k = log(1471), trace = 0)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_1201)
## 
## Coefficients:
## (Intercept)          bed         bath        lsqft     acre_lot  
##     8.28205     -0.08702      0.05289      0.57210      0.18488

#A fit line is created for the regression model.
fit_1201 <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_1201)
#A regression subsets fit.
reg_fit_1201 = regsubsets(lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_1201, nvmax = 4)
#Plots the summary of the regression subsets fit, showing the Cp fit. 
plot(summary(reg_fit_1201)$cp, xlab = "Number of Variables", ylab = "Cp")

#Displays a summary of the fit and reg fit lines. 
summary(fit_1201)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_1201)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14165 -0.30797  0.01352  0.25594  1.26985 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28205    0.35401  23.395  < 2e-16 ***
## bed         -0.08702    0.01541  -5.649 1.94e-08 ***
## bath         0.05289    0.01761   3.003  0.00272 ** 
## lsqft        0.57210    0.05537  10.332  < 2e-16 ***
## acre_lot     0.18488    0.01289  14.344  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.435 on 1466 degrees of freedom
## Multiple R-squared:  0.3394, Adjusted R-squared:  0.3376 
## F-statistic: 188.3 on 4 and 1466 DF,  p-value: < 2.2e-16

summary(reg_fit_1201)

## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_1201, 
##     nvmax = 4)
## 4 Variables  (and intercept)
##          Forced in Forced out
## bed          FALSE      FALSE
## bath         FALSE      FALSE
## lsqft        FALSE      FALSE
## acre_lot     FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          bed bath lsqft acre_lot
## 1  ( 1 ) " " " "  " "   "*"     
## 2  ( 1 ) " " " "  "*"   "*"     
## 3  ( 1 ) "*" " "  "*"   "*"     
## 4  ( 1 ) "*" "*"  "*"   "*"

From the graph displaying Mallow’s \(C_p\), it appears two variables can be used in the regression model for house prices located in the 1201 zip code. The summary shows the two variables to use are size of the lot (in acres) and the square footage of the house (house size). The \(R^2\) value = 0.3376, which means roughly 34% of the variability can be explained by the regression model.

The zip code with the second most listings in Massachusetts was 2127. An AIC model is run for a subset for houses located in the 2127 zip code:

re_mass_2127 <- re%>%filter(state == "Massachusetts", zip_code == 2127)
#Regression model created to predict price.
fit_model_2127 <- lm(lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_2127)
#Creates AIC model for the regression model above. 
stepAIC(fit_model_2127, direction = "both", k = log(1248), trace = 0)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_2127)
## 
## Coefficients:
## (Intercept)          bed         bath        lsqft     acre_lot  
##     8.15168     -0.04655      0.06228      0.77212      1.31708

#A fit line is created for the regression model.
fit_2127 <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_2127)
#A regression subsets fit.
reg_fit_2127 = regsubsets(lprice ~ bed+bath+lsqft+acre_lot, data = re_mass_2127, nvmax = 4)
#Plots the summary of the regression subsets fit, showing the Cp fit. 
plot(summary(reg_fit_2127)$cp, xlab = "Number of Variables", ylab = "Cp")

#Displays a summary of the fit and reg fit lines. 
summary(fit_2127)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_2127)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39745 -0.10127 -0.01633  0.12865  0.60908 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.15168    0.16802  48.516  < 2e-16 ***
## bed         -0.04655    0.01063  -4.380 1.28e-05 ***
## bath         0.06228    0.01075   5.795 8.66e-09 ***
## lsqft        0.77212    0.02722  28.362  < 2e-16 ***
## acre_lot     1.31708    0.09010  14.618  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1976 on 1243 degrees of freedom
## Multiple R-squared:  0.8295, Adjusted R-squared:  0.8289 
## F-statistic:  1512 on 4 and 1243 DF,  p-value: < 2.2e-16

summary(reg_fit_2127)

## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot, data = re_mass_2127, 
##     nvmax = 4)
## 4 Variables  (and intercept)
##          Forced in Forced out
## bed          FALSE      FALSE
## bath         FALSE      FALSE
## lsqft        FALSE      FALSE
## acre_lot     FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          bed bath lsqft acre_lot
## 1  ( 1 ) " " " "  "*"   " "     
## 2  ( 1 ) " " " "  "*"   "*"     
## 3  ( 1 ) " " "*"  "*"   "*"     
## 4  ( 1 ) "*" "*"  "*"   "*"

From the graph displaying Mallow’s \(C_p\), it appears three variables can be used in the regression model for house prices located in the 2127 zip code. The summary shows the three variables to use are square footage of the house (house size), size of the lot (in acres) and number of baths. The \(R^2\) value = 0.8289, which means roughly 83% of the variability can be explained by the regression model. So far, this has been our most confident model in predicting house prices in any location.

From the table depicting the 10 most common zip codes in Vermont, #1 on the list was zip code 5701. An AIC model is run for a subset for houses located in the 5701 zip code:

re_vt_5701 <- re%>%filter(state == "Vermont", zip_code == 5701)
#Regression model created to predict price.
fit_model_5701 <- lm(lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5701)
#Creates AIC model for the regression model above. 
stepAIC(fit_model_5701, direction = "both", k = log(804), trace = 0)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5701)
## 
## Coefficients:
## (Intercept)          bed         bath        lsqft     acre_lot  
##     6.40510     -0.23082      0.32138      0.79387      0.02869

#A fit line is created for the regression model.
fit_5701 <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5701)
#A regression subsets fit.
reg_fit_5701 = regsubsets(lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5701, nvmax = 4)
#Plots the summary of the regression subsets fit, showing the Cp fit. 
plot(summary(reg_fit_5701)$cp, xlab = "Number of Variables", ylab = "Cp")

#Displays a summary of the fit and reg fit lines. 
summary(fit_5701)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5701)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48289 -0.33821 -0.05551  0.32716  1.36356 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.405104   0.575934  11.121   <2e-16 ***
## bed         -0.230818   0.020510 -11.254   <2e-16 ***
## bath         0.321381   0.035286   9.108   <2e-16 ***
## lsqft        0.793867   0.084003   9.450   <2e-16 ***
## acre_lot     0.028686   0.002762  10.387   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5741 on 799 degrees of freedom
## Multiple R-squared:  0.3726, Adjusted R-squared:  0.3694 
## F-statistic: 118.6 on 4 and 799 DF,  p-value: < 2.2e-16

summary(reg_fit_5701)

## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5701, 
##     nvmax = 4)
## 4 Variables  (and intercept)
##          Forced in Forced out
## bed          FALSE      FALSE
## bath         FALSE      FALSE
## lsqft        FALSE      FALSE
## acre_lot     FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          bed bath lsqft acre_lot
## 1  ( 1 ) " " "*"  " "   " "     
## 2  ( 1 ) " " "*"  " "   "*"     
## 3  ( 1 ) "*" " "  "*"   "*"     
## 4  ( 1 ) "*" "*"  "*"   "*"

From the graph displaying Mallow’s \(C_p\), it appears three variables can be used in the regression model for house prices located in the 2127 zip code. The summary shows the three variables to use are number of baths, size of the lot (in acres) and square footage of the house (house size). The \(R^2\) value = 0.3694, which means roughly 37% of the variability can be explained by the regression model.

The zip code with the second most listings in Vermont was 5641. An AIC model is run for a subset for houses located in the 5641 zip code:

re_vt_5641 <- re%>%filter(state == "Vermont", zip_code == 5641)
#Regression model created to predict price.
fit_model_5641 <- lm(lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5641)
#Creates AIC model for the regression model above. 
stepAIC(fit_model_5641, direction = "both", k = log(743), trace = 0)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5641)
## 
## Coefficients:
## (Intercept)          bed         bath        lsqft     acre_lot  
##    4.783177    -0.126191     0.048912     1.059461     0.003516

#A fit line is created for the regression model.
fit_5641 <- lm(formula = lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5641)
#A regression subsets fit.
reg_fit_5641 = regsubsets(lprice ~ bed+bath+lsqft+acre_lot, data = re_vt_5641, nvmax = 4)
#Plots the summary of the regression subsets fit, showing the Cp fit. 
plot(summary(reg_fit_5641)$cp, xlab = "Number of Variables", ylab = "Cp")

#Displays a summary of the fit and reg fit lines. 
summary(fit_5641)

## 
## Call:
## lm(formula = lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5641)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07102 -0.22701  0.04006  0.28046  0.69098 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.7831770  0.4237630  11.287   <2e-16 ***
## bed         -0.1261915  0.0131710  -9.581   <2e-16 ***
## bath         0.0489125  0.0183645   2.663   0.0079 ** 
## lsqft        1.0594612  0.0613718  17.263   <2e-16 ***
## acre_lot     0.0035156  0.0001878  18.724   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3681 on 738 degrees of freedom
## Multiple R-squared:  0.6115, Adjusted R-squared:  0.6094 
## F-statistic: 290.4 on 4 and 738 DF,  p-value: < 2.2e-16

summary(reg_fit_5641)

## Subset selection object
## Call: regsubsets.formula(lprice ~ bed + bath + lsqft + acre_lot, data = re_vt_5641, 
##     nvmax = 4)
## 4 Variables  (and intercept)
##          Forced in Forced out
## bed          FALSE      FALSE
## bath         FALSE      FALSE
## lsqft        FALSE      FALSE
## acre_lot     FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          bed bath lsqft acre_lot
## 1  ( 1 ) " " " "  "*"   " "     
## 2  ( 1 ) " " " "  "*"   "*"     
## 3  ( 1 ) "*" " "  "*"   "*"     
## 4  ( 1 ) "*" "*"  "*"   "*"

From the graph displaying Mallow’s \(C_p\), it appears two variables can be used in the regression model for house prices located in the 2127 zip code. The summary shows the two variables to use are square footage of the house (house size) and size of the lot (in acres). The \(R^2\) value = 0.6094, which means roughly 61% of the variability can be explained by the regression model.

Conclusion

In this session, exploratory data cleaning techniques and AIC modeling were demonstrated using the “USA Real Estate Dataset” dataset from Kaggle.

First, the dataset was inspected and checked for missing values. A range was done to view the range for the number of bedrooms and bathrooms. From there, a subset was created to remove missing values, as well as keeping the listings which contained no more than 6 bedrooms and no more than 6 bathrooms. The distribution was viewed two variables were log-transformed (price and house size) to have the data be normally distributed.

Next, a correlation plot was created to see the relationship among four variables (log-transformed price and house size, number of beds, number of baths). There doesn’t appear to be any strong correlations among the selected variables, which could impact the search for a best fit regression model.

Various AIC models are created in order to determine the best linear regression model for predicting price. In particular, AIC models were created for houses listed in Massachusetts and Vermont, as well as specific zip codes within Massachusetts and Vermont to determine if the formula for predicting price changes for each location. From the models, it appears types and number of variables can impact the price prediction for each location. For example, the two variables to predict house price in Massachusetts were number of baths and the zip code of the property, while the three variables to predict house prices in Vermont were number of baths, square footage of the house (house size), and size of the lot (in acres).

Overall, I only felt confident in one AIC model, which was the model for predicting price in the 2127 zip code of Massachusetts. This model had a \(R^2\) value of roughly 83%, which means about 83% of the variability can be explained by the regression model. All other AIC models had no higher than a 0.60 \(R^2\) value.

I believe that more factors added in to the model will help predict a more accurate price range for the data. For example, factors such as whether a house has a pool, solar, a heating and cooling system, or a garage (to name a few) can be help create a better prediction model.

Thank you to the reader for viewing my work.

Determine the Best Regression Model for Houses in the United States

Travis Gubbe

October 4, 2022