This document provides R code and analysis to explore the relationships between various property features, school performance, pricing, and sales dynamics within a Property_Schools dataset.
First, we need to load the necessary R libraries and the dataset:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## corrplot 0.92 loaded
##
## Loading required package: viridisLite
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 16104 Columns: 85
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (46): Address, Street Name, City/Location, School Name, Tax ID, Bed Room...
## dbl (32): UniqueID, Zip Code, Close Price, Tax Rate, Tax Year, Baths Full, B...
## lgl (7): HOA Mandatory, Maintenance Fee Yes No, Management Co YN, Pool Area...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 85
## UniqueID Address `Street Name` `City/Location` `Zip Code` `School Name`
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 18577356 185 Capetown Capetown Conroe 77356 Montgomery I…
## 2 1177382 11 LANSDOWNE LANSDOWNE The Woodlands 77382 Conroe Indep…
## 3 NA 203 N Vershir… Vershire Magnolia 77354 <NA>
## 4 NA 27037 Renly R… Renly Ridge Magnolia 77354 <NA>
## 5 NA 119 Keechie C… Keechie Creek Conroe 77304 <NA>
## 6 2377380 23 Olmstead R… Olmstead The Woodlands 77380 Conroe Indep…
## # ℹ 79 more variables: `Close Price` <dbl>, `Tax ID` <chr>, `Tax Rate` <dbl>,
## # `Tax Year` <dbl>, `Baths Full` <dbl>, `Baths Half` <dbl>,
## # `Baths Total` <dbl>, `Bed Room Description` <chr>, Bedrooms <dbl>,
## # `Beds Max` <dbl>, `Building SqFt` <dbl>, CDOM <dbl>, `Close Date` <chr>,
## # Country <chr>, County <chr>, `Current Price` <dbl>, `Current $/SqFt` <dbl>,
## # `(Current $/SqFT) / (List $/SqFt)` <dbl>, `Garage Carport` <chr>,
## # `Garage Desc` <chr>, `Heat System` <chr>, `HOA Mandatory` <lgl>, …
## UniqueID Address Street Name City/Location
## Min. :2.774e+05 Length:16104 Length:16104 Length:16104
## 1st Qu.:2.527e+08 Class :character Class :character Class :character
## Median :6.618e+08 Mode :character Mode :character Mode :character
## Mean : Inf
## 3rd Qu.:1.724e+09
## Max. : Inf
## NA's :6743
## Zip Code School Name Close Price Tax ID
## Min. :77002 Length:16104 Min. : 1250 Length:16104
## 1st Qu.:77316 Class :character 1st Qu.: 2050 Class :character
## Median :77386 Mode :character Median : 2400 Mode :character
## Mean :77346 Mean : 41033
## 3rd Qu.:77459 3rd Qu.: 2900
## Max. :77598 Max. :600000
## NA's :3 NA's :4057
## Tax Rate Tax Year Baths Full Baths Half
## Min. : 0.233 Min. :2011 Min. :2.000 Min. :0.0000
## 1st Qu.: 2.580 1st Qu.:2021 1st Qu.:2.000 1st Qu.:0.0000
## Median : 2.914 Median :2021 Median :2.000 Median :1.0000
## Mean : 3.451 Mean :2021 Mean :2.368 Mean :0.5794
## 3rd Qu.: 3.200 3rd Qu.:2021 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :2513.000 Max. :3312 Max. :5.000 Max. :5.0000
## NA's :11504 NA's :12159
## Baths Total Bed Room Description Bedrooms Beds Max
## Min. :2.000 Length:16104 Min. :3.000 Min. : 0.000
## 1st Qu.:2.000 Class :character 1st Qu.:3.000 1st Qu.: 3.000
## Median :2.100 Mode :character Median :4.000 Median : 4.000
## Mean :2.426 Mean :3.693 Mean : 3.845
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.: 4.000
## Max. :5.100 Max. :8.000 Max. :32.000
## NA's :12434
## Building SqFt CDOM Close Date Country
## Min. :1700 Min. : -7.00 Length:16104 Length:16104
## 1st Qu.:1967 1st Qu.: 9.00 Class :character Class :character
## Median :2264 Median : 21.00 Mode :character Mode :character
## Mean :2365 Mean : 34.08
## 3rd Qu.:2654 3rd Qu.: 42.00
## Max. :4000 Max. :687.00
##
## County Current Price Current $/SqFt
## Length:16104 Min. : 1250 Min. : 0.4883
## Class :character 1st Qu.: 2166 1st Qu.: 0.9758
## Mode :character Median : 2695 Median : 1.1350
## Mean :131044 Mean : 55.3024
## 3rd Qu.:325000 3rd Qu.:144.0497
## Max. :600000 Max. :317.1892
##
## (Current $/SqFT) / (List $/SqFt) Garage Carport Garage Desc
## Min. :0.6412 Length:16104 Length:16104
## 1st Qu.:1.0000 Class :character Class :character
## Median :1.0000 Mode :character Mode :character
## Mean :0.9994
## 3rd Qu.:1.0000
## Max. :2.5714
##
## Heat System HOA Mandatory HOA Phone HOA Website
## Length:16104 Mode :logical Length:16104 Length:16104
## Class :character FALSE:411 Class :character Class :character
## Mode :character TRUE :4828 Mode :character Mode :character
## NA's :10865
##
##
##
## Legal Legal Subdivision List Price List $/SqFt
## Length:16104 Length:16104 Min. : 1250 Min. : 0.49
## Class :character Class :character 1st Qu.: 2175 1st Qu.: 0.98
## Mode :character Mode :character Median : 2695 Median : 1.13
## Mean :131469 Mean : 55.48
## 3rd Qu.:325000 3rd Qu.:144.34
## Max. :690201 Max. :317.56
##
## Lot Description Lot Price Sq Ft List Lot Price Sq Ft Sold Lot Size
## Length:16104 Min. : 0.00 Min. : 0.000 Min. : 0
## Class :character 1st Qu.: 0.33 1st Qu.: 0.320 1st Qu.: 5352
## Mode :character Median : 0.42 Median : 0.370 Median : 6386
## Mean : 27.86 Mean : 6.735 Mean : 6821
## 3rd Qu.: 46.14 3rd Qu.: 0.480 3rd Qu.: 7674
## Max. :12355.36 Max. :341.490 Max. :701721
## NA's :2707 NA's :5926 NA's :2668
## Lot Size Source Maint Fee Amt Maint Fee Includes Maint Fee Pay Schedule
## Length:16104 Min. : 0.0 Length:16104 Length:16104
## Class :character 1st Qu.: 500.0 Class :character Class :character
## Mode :character Median : 770.0 Mode :character Mode :character
## Mean : 808.1
## 3rd Qu.:1050.0
## Max. :3317.0
## NA's :11307
## Maintenance Fee Yes No Management Co YN Management Company Market Area
## Mode:logical Mode :logical Length:16104 Length:16104
## NA's:16104 FALSE:6622 Class :character Class :character
## TRUE :4240 Mode :character Mode :character
## NA's :5242
##
##
##
## Master Planned Community Original List Price Other Mandatory Fee
## Length:16104 Min. : 2 Length:16104
## Class :character 1st Qu.: 2200 Class :character
## Mode :character Median : 2715 Mode :character
## Mean : 135034
## 3rd Qu.: 330000
## Max. :4150000
## NA's :1
## Pool Area Pool Private Pool Private Desc Price Sale
## Mode :logical Mode :logical Length:16104 Min. : 0
## FALSE:5840 FALSE:16104 Class :character 1st Qu.: 340000
## TRUE :8277 Mode :character Median : 434990
## NA's :1987 Mean : 463654
## 3rd Qu.: 540000
## Max. :4300000
## NA's :15915
## Property Description Property Type School District School District KW
## Length:16104 Length:16104 Length:16104 Length:16104
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## School Elementary EM (lower) EM - District EM - Great Schools
## Length:16104 Length:16104 Length:16104 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 5.000
## Mean : 5.632
## 3rd Qu.: 7.000
## Max. :10.000
## NA's :6730
## EM - Score School Middle School High Section Num
## Min. : 0.00 Length:16104 Length:16104 Length:16104
## 1st Qu.:100.00 Class :character Class :character Class :character
## Median :100.00 Mode :character Mode :character Mode :character
## Mean : 97.92
## 3rd Qu.:100.00
## Max. :100.00
## NA's :6730
## Sold Terms SP/LP Ratio Sales Velocity State Or Province
## Length:16104 Min. :0.641 Mode:logical Length:16104
## Class :character 1st Qu.:1.000 NA's:16104 Class :character
## Mode :character Median :1.000 Mode :character
## Mean :0.999
## 3rd Qu.:1.000
## Max. :2.571
## NA's :4057
## Status Stories Street Dir Prefix Street Dir Suffix
## Length:16104 Min. :1.000 Length:16104 Length:16104
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :2.000 Mode :character Mode :character
## Mean :1.686
## 3rd Qu.:2.000
## Max. :4.000
## NA's :713
## Street Number Street Suffix Street Surface Subdivision
## Min. : 0 Length:16104 Length:16104 Length:16104
## 1st Qu.: 2913 Class :character Class :character Class :character
## Median : 7811 Mode :character Mode :character Mode :character
## Mean :10702
## 3rd Qu.:18106
## Max. :42561
## NA's :24
## Tax Amount Utilities Utility District Water Sewer
## Min. : 0 Length:16104 Mode :logical Length:16104
## 1st Qu.: 5505 Class :character FALSE:983 Class :character
## Median : 6906 Mode :character TRUE :4699 Mode :character
## Mean : 6806 NA's :10422
## 3rd Qu.: 8645
## Max. :23307
## NA's :13473
## Year Built Year Built Src Zip Code + 4 Welcome Version
## Min. : 2000 Length:16104 Min. : 1 Min. :1.000
## 1st Qu.: 2006 Class :character 1st Qu.:2266 1st Qu.:1.000
## Median : 2014 Mode :character Median :3828 Median :3.000
## Mean : 2015 Mean :4079 Mean :2.931
## 3rd Qu.: 2021 3rd Qu.:5619 3rd Qu.:4.000
## Max. :20222 Max. :9865 Max. :5.000
## NA's :5278
This section explores how school performance ratings influence property pricing and how quickly properties sell. Indicators for property selling performance include: CDOM (cumulative days on the market), listing price, and sales velocity.
Analysis: We’ll examine the relationship between school performance (EM - Great Schools) and Listing Price. If school performance is numeric (0 - 10), we can calculate a correlation coefficient. If it’s categorical (e.g., A, B, C), we’ll use box plots to compare price distributions across different rating categories. We can also change the class of the school performance variable from numeric to categorical and construct a boxplot.
## Warning: Removed 6730 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning in viridisLite::viridis(256, alpha, begin, end, direction, option):
## Option 'YlGnBu' does not exist. Defaulting to 'viridis'.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6730 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: size.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 6730 rows containing missing values or values outside the scale range
## (`geom_point()`).
The boxplot suggests that schools with higher performance have a greater median (50th percentile) list price and properties located in lower performing school zones have a lower list price. There could be other factors that influence the list price such as the building square footage, number of bedrooms, and number of bathrooms.
Analysis: We’ll investigate the relationship between School Performance Ratings and CDOM (Cumulative Days on Market). A lower CDOM indicates faster sales velocity where sales velocity is calculated by taking the list price and dividing by the CDOM.
## Correlation between School Performance Ratings and CDOM: -0.05094555
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Single-Family Homes in high and low performing school zones sell faster. Higher performing school zones have higher quality homes and lower performing school zones have lower listing prices.
#Question 3: Is there a premium on Listing Price for properties within top-rated school districts, even when controlling for other features like Building Square Footage and Number of Bedrooms/Bathrooms? We can generate a linear regression model to test for influence on the features on the list price of the properties. One observation is that the list price is not normally distributed suggesting their might be higher variability at the tails of the distribution.
##
## --- Regression Model: Listing Price Premium by School Performance ---
##
## Call:
## lm(formula = `List Price` ~ `EM - Great Schools` + `Building SqFt` +
## Bedrooms + `Baths Full`, data = df_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -348140 -125212 -93388 201863 512913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33045.040 13609.253 -2.428 0.0152 *
## `EM - Great Schools` -9222.467 967.042 -9.537 < 2e-16 ***
## `Building SqFt` 26.943 5.188 5.194 2.1e-07 ***
## Bedrooms -870.997 3751.254 -0.232 0.8164
## `Baths Full` 66868.912 4379.326 15.269 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 187500 on 9369 degrees of freedom
## Multiple R-squared: 0.05887, Adjusted R-squared: 0.05847
## F-statistic: 146.5 on 4 and 9369 DF, p-value: < 2.2e-16
## Warning: package 'rgl' was built under R version 4.4.1
## Warning in rgl.planes(a = coef(model_price_premium)[2], b = coef(model_price_premium)[3], : 'rgl.planes' is deprecated.
## Use 'planes3d' instead.
## See help("Deprecated")
The regression model shows the theoretical quantiles are not normally distributed and the variability of the distribution is greater at the lower fitted values.
We can filter out the poor performing schools and with the remaining high performing schools, we can evaludate the CDOM and Listing Price for properties with and without a pool.
##
## Attaching package: 'scales'
## The following object is masked from 'package:viridis':
##
## viridis_pal
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# Observation: Our results indicate that properties with pool areas have
a much higher listing price (on average). Interestingly, as the school
performance increases, the probability that a property will have a pool
also increases suggesting a positive correlation between school
performance and quality in amenities. For properties that have a pool,
the distribution is highly skewed as the the third quartile (top of the
box and whisker plot with teal color) is much higher in the listing
price compared to the median (50th percentile) and 25th percentile. This
suggests that there might be other factors that influence the list price
such as the number of bedrooms, bathrooms, and building square
footage.