Machine learning linear regression model to predict condominium sale prices and answer these questions: - How well does the size of a condominium (measured in gross square feet) explain or predict sale price across New York City as a whole? - How well does the size of a condominium explain or predict sale price for each individual borough?
library(readr)
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
library(stringr)
library(ggplot2)
library(dplyr)
library(purrr)
library(tidyverse)
library(ggplot2)
library(tidyr)
library(broom)
Import the data as a data frame and rename the columns.
setwd("C:/Users/Ana/Desktop/Data Analytics/CSV Files")
NYC_property_sales <- read_csv("NYC_property_sales.csv")
Parsed with column specification:
cols(
.default = col_double(),
borough = [31mcol_character()[39m,
neighborhood = [31mcol_character()[39m,
building_class_category = [31mcol_character()[39m,
tax_class_at_present = [31mcol_character()[39m,
building_class_at_present = [31mcol_character()[39m,
address = [31mcol_character()[39m,
apartment_number = [31mcol_character()[39m,
building_class_at_time_of_sale = [31mcol_character()[39m,
sale_date = [34mcol_datetime(format = "")[39m
)
See spec(...) for full column specifications.
#head(NYC_property_sales, 100)
#filter to only include the building class "condominiums with elevators"
NYC_condos <- NYC_property_sales %>%
filter(building_class_at_time_of_sale == "R4")
head(NYC_condos)
Generate a scatterplot using the NYC_condos dataframe with gross_square_feet on the x-axis and sale_price on the y-axis
ggplot(data = NYC_condos,
aes(x = gross_square_feet, y = sale_price, color = borough)) +
geom_point(alpha = 0.5) +
scale_y_continuous(label = scales::comma) +
xlim(0, 12000) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Condominium Sale Price vs Floor Area", x = "Condominium size (sq ft)", y = "Sale Price ($)") +
theme(panel.background = element_rect(fill = "white"))
Manually set the limits of the plot
#set the limits of the plot to x = max $5,000, y = max $20,000,000
ggplot(data = NYC_condos,
aes(x = gross_square_feet, y = sale_price, color = borough)) +
geom_point(alpha = 0.5) +
scale_y_continuous(label = scales::comma, limits = c(0, 20000000)) +
xlim(0, 5000) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Condominium Sale Price vs Floor Area", x = "Condominium size (sq ft)", y = "Sale Price ($)") +
theme(panel.background = element_rect(fill = "white"))
Split the plot by borough
#facet wrap by borough
ggplot(data = NYC_condos,
aes(x = gross_square_feet, y = sale_price)) +
geom_point(alpha = 0.5) +
scale_y_continuous(label = scales::comma) +
#xlim(0, 5000) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Condominium Sale Price vs Floor Area", x = "Condominium size (sq ft)", y = "Sale Price ($)") +
theme(panel.background = element_rect(fill = "white")) +
facet_wrap(~borough, scales = "free", ncol = 2)
The plots above show that there is likely a correlation between gross_square_feet and sale_price. However it also appears that the regression line is affected by outliers. Next step is to remove some of the outlier values that we know are incorrect.
Then re-plot the scatter plots
#remove some erronous data
NYC_condos_original <- NYC_condos
#remove entries from that correspond to sale_prices > $200 million. One is the price of multiple units and one is for the most expensive apartment ever sold. As this is not particularly representative of the general data, it will be removed as it may cause the line of best fit to be skewed away from the majority of the data.
NYC_condos_2 <- NYC_condos %>%
filter(sale_price < 200000000)
nrow(NYC_condos_2)
[1] 8094
#filter out rows that correspond to multiple unit sales
NYC_condos_3 <- NYC_condos_2 %>%
group_by(sale_price, sale_date) %>%
filter(n() <= 2) %>%
arrange(desc(sale_price))
head(NYC_condos_3)
nrow(NYC_condos_3)
[1] 7945
ggplot(data = NYC_condos_3,
aes(x = gross_square_feet, y = sale_price)) +
geom_point(alpha = 0.5) +
scale_y_continuous(label = scales::comma) +
#xlim(0, 5000) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Condominium Sale Price vs Floor Area", x = "Condominium size (sq ft)", y = "Sale Price ($)") +
theme(panel.background = element_rect(fill = "white")) +
facet_wrap(~borough, scales = "free", ncol = 2)
#explore other potentially erronous data
NYC_condos_explore <- NYC_condos_3 %>%
filter(borough == "Manhattan", sale_price > 60000000, gross_square_feet < 2500)
#NYC_condos_explore
#even though there are some further data that could be errors, there is no way from telling from the data set provided. Therefore, they will be left in the data set. For example an apartment in Manhattan with 862 sq ft but sale price $74,400,000 is unlikely to be correct, but there is no way of telling if it is or not.
Create linear regression models
#create linear regression models
NYC_condos_lm_original <- lm(sale_price ~ gross_square_feet, data = NYC_condos_original)
NYC_condos_lm <- lm(sale_price ~ gross_square_feet, data = NYC_condos_3)
summary(NYC_condos_lm_original)
Call:
lm(formula = sale_price ~ gross_square_feet, data = NYC_condos_original)
Residuals:
Min 1Q Median 3Q Max
-79533546 -1211195 -860173 -251714 211550415
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 940683.39 57703.86 16.30 <2e-16 ***
gross_square_feet 1192.72 19.43 61.39 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4745000 on 8094 degrees of freedom
Multiple R-squared: 0.3177, Adjusted R-squared: 0.3176
F-statistic: 3769 on 1 and 8094 DF, p-value: < 2.2e-16
summary(NYC_condos_lm)
Call:
lm(formula = sale_price ~ gross_square_feet, data = NYC_condos_3)
Residuals:
Min 1Q Median 3Q Max
-17373592 -767805 57079 722027 73491983
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.419e+06 4.871e+04 -49.65 <2e-16 ***
gross_square_feet 3.859e+03 3.423e+01 112.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2442000 on 7943 degrees of freedom
Multiple R-squared: 0.6155, Adjusted R-squared: 0.6154
F-statistic: 1.271e+04 on 1 and 7943 DF, p-value: < 2.2e-16
A bivariate linear regression model has been carried out on the sale price and condominium size data. The hypothesis is that there is a relationship between sale price and condominium size. The null hypothesis is that there is no relationship between sale price and condominium size.
The first thing to look at is the value of β1 i.e. the gradient of the regression line. The gradient in the original dataset is 1193. The gradient in the cleaned dataset is 3859. Both of these values are far from zero so it would suggest there is a relationship however there are other statistics that need to be considered before making a final conclusion.
The t-statistic is the number of standard deviations that β1 is from 0. A large value of the t-statistic suggests that we can reject the null hypothesis. The t-statistic is 61 in the original dataset and 113 in the cleaned data set. This means that in both datasets, the t-statistic suggests the null hypothesis can be rejected.
We can also look at the p-value, which is the probability of observing any value (of β1) equal-to or larger than t if the null hypothesis (H0) is true.The p-value is the probability of the result occurring by chance. So a p-value of 0.9 means that the occurance is very likely to happen by chance and therefore is not down to the predictor variable. Therefore the null hypothesis stands. However, in both these datasets the p-value is neglidgible which means that is it incredibly unlikely that the response we see here is by chance. Therefore, we can reject the null hypothesis that there is no correlation between the predictor and response variable. i.e. it is extremely unlikely that the relationship between condominium size and sale price observed in this dataset is due to random chance.
RSE is the average amount that the response variable measurements deviate from the true regression line. A lower value for RSE means an overall better fit. The RSE is 4,745,000 in the original dataset and 2,442,000 after data cleaning.
R^2 is a measure of the proportion of the variability in our response which can be explained by the predictor variable. R^2 is a value between 0 and 1. A value nearer to 1 means a better fit. The adjusted R^2 value was 0.32 in the original dataset and 0.62 in the cleaned dataset. This means that 62% of the variability in our response variable can be explained by the predictor variable.
The standard error of the regression provides the absolute measure of the typical distance that the data points fall from the regression line. SE is in the units of the dependent variable.
confint(NYC_condos_lm_original)
2.5 % 97.5 %
(Intercept) 827568.990 1053797.787
gross_square_feet 1154.636 1230.802
confint(NYC_condos_lm)
2.5 % 97.5 %
(Intercept) -2514141.303 -2323159.698
gross_square_feet 3792.148 3926.339
Lastly, the confidence intervals for the original (uncleaned dataset) is 1155 - 1231 $ per square foot. The confidence intervals for the cleanded dataset is 3792 - 3926 $ per square foot. The confidence intervals for the original dataset is lower. This is likely due to the high value low floor area condominiums that were included in the original dataset but were removed in the cleaned dataset.
Next, nest the dataframe and fit models to the dataframe. Apply the broom() functions tidy() to each nested model to extract the model coefficients.
NYC_nested <- NYC_condos_3 %>%
group_by(borough) %>%
nest() %>%
mutate(linear_model = map(.x = data,
.f = ~lm(sale_price ~ gross_square_feet,
data = .))) %>%
mutate(tidy_coefficients = map(.x = linear_model,
.f = tidy,
conf.int = TRUE))
#print nested dataframe
NYC_nested
#print data for the Bronx
NYC_nested$data[[3]]
summary(NYC_nested$linear_model[[3]])
Call:
lm(formula = sale_price ~ gross_square_feet, data = .)
Residuals:
Min 1Q Median 3Q Max
-797569 -63973 -14647 30202 1493512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -254011.87 24667.81 -10.30 <2e-16 ***
gross_square_feet 648.96 29.57 21.94 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 151500 on 328 degrees of freedom
Multiple R-squared: 0.5948, Adjusted R-squared: 0.5936
F-statistic: 481.5 on 1 and 328 DF, p-value: < 2.2e-16
Extract the tidy model coefficents
NYC_coefficients <- NYC_nested %>%
select(borough, tidy_coefficients) %>%
unnest(cols = tidy_coefficients)
NYC_slope <- NYC_coefficients %>%
filter(term == "gross_square_feet") %>%
arrange(estimate)
NYC_slope
#extract the nested linear model object for one of the boroughs and use the summary() function to inspect the results
Again, nest the dataframe and fit models to the dataframe. This time, apply the broom() function glance() to each nested model to extract the summary statistics.
NYC_nested_2 <- NYC_condos_3 %>%
group_by(borough) %>%
nest() %>%
mutate(linear_model = map(.x = data,
.f = ~lm(sale_price ~ gross_square_feet,
data = .))) %>%
mutate(tidy_summary_stats = map(.x = linear_model,
.f = glance))
#print nested dataframe
NYC_nested_2
#print data for the Bronx
NYC_nested_2$data[[3]]
NA
Extract the summary statistics.
NYC_summary_stats <- NYC_nested_2 %>%
select(borough, tidy_summary_stats) %>%
unnest(cols = tidy_summary_stats)
NYC_summary_stats
NA
The analysis has shown that, overall, we can say that sale_price can be predicted by gross_square_area. For each borough the t-statistic is sufficiently high and the p_value sufficiently low that we can say that there is a relationship between sale_price and gross_square_area. Some of the boroughs have regression lines that fit the data better than others. This is demonstrated by the adjusted R-squared value. A higher value means that more of the variability in the response variable can be attributed to the predictor variable. This is the case for Manhattan. However the fit for Queens is less good as the r-squared value is only 34%. This is evidenced in the scatter plots as you can see that the points for these two boroughs are spread further from the regression line.