This dataset includes the covariates listed and described in the table below. There are 18 covariates and 158 observations. The data was manually collected from Zillow. The search on Zillow was filtered for recently sold homes within Cincinnati limits. Originally, there were over 300 observations, which was attributable to the excel file where classmate contributed. However, as I was mining through the data, it became apparent that there existed substantial data entry error. Therefore, the shared data was removed from the dataset to ensure that the analysis was as accurate as possible.
| Variable | Type | Description |
|---|---|---|
| Zip | Character | Zip Code in which the Home is located |
| Price | Double | Final Sales Price of the Home |
| # Bed | Double | Number of Bedrooms |
| # Bath | Double | Number of Bathrooms (0.5 indicates half-bath) |
| Square Footage | Double | Square Footage of the Home |
| Year Built | Character | Year in which the Home was first built |
| Lot Size | Double | Square Footage of the Property on which the Home sits |
| Heating | Character | Method of Heating, such as Gas, Electric, Baseboard, None, or Other |
| Cooling | Character | Method of Cooling, such as Central, Wall Units, None, or Other |
| Parking Spaces | Double | Number of Parking Spaces in the Garage. Zero if no garage. |
| Garage | Character | Type of Garage: Attached, Detatched, Carport, or None |
| Basement | Character | Status of the Basement: Finished, Partially Finished, Unfinished, or None |
| Stories | Double | Number of Floors in the Home |
| Pool | Boolean/Numeric | Does the Home have a private Pool; 1 if Yes, 0 if no |
| Style | Character | Style of Home, such as Historical, Traditional, Transitional, Contemporary, etc. |
| Fireplace | Boolean/Numeric | Does the Home have working fireplaces; 1 if Yes, 0 if no |
| Average Median Home Value in Zip | Double | The Average of the Median Home Values for each Zip Code. Taken from Zillow’s monthly data |
| Years Old | Double | Number of Years since being built. [ 2019 - Year Built ] |
The data captured is from the sell dates of 2017-10-25 to 2019-11-04. There are 158 observations in the dataset that range over 32 zipcodes.
The following characteristics that were captured for each sold home are repeated below, and explained in the prior tab:
## [1] "Address" "zip" "sold_price"
## [4] "sold_date" "num_bed" "num_bath"
## [7] "sqr_ft" "year_built" "lot_size_ft"
## [10] "heating" "cooling" "parking_spaces"
## [13] "garage" "basement_status" "num_stories"
## [16] "Pool?" "type_and_style" "Fireplace"
## [19] "med_val" "age"
Below are some summary statistics of the houses’ sales price and pysical size:
| Summary Statistics Table of Home Prices and Size | zillow (N = 158) |
|---|---|
| Final Sale Price | Â Â |
| Â Â min | 16000 |
| Â Â median (Q1, Q3) | 198,500.00 (138,000.00, 466,875.00) |
|   mean (sd) | 364,384.72 ± 433,636.37 |
| Â Â max | 2700000 |
| Â Â Missing | 0 |
| Square Footage of Home | Â Â |
| Â Â min | 792 |
| Â Â median (Q1, Q3) | 1,783.50 (1,325.50, 2,297.75) |
|   mean (sd) | 2,056.81 ± 1,070.53 |
| Â Â max | 7900 |
| Â Â Missing | 0 |
| Square Footage of Lot | Â Â |
| Â Â min | 784 |
| Â Â median (Q1, Q3) | 7,623.00 (4,356.00, 11,761.15) |
|   mean (sd) | 13,666.79 ± 24,528.67 |
| Â Â max | 217800 |
| Â Â Missing | 0 |
The average price of homes in Cincinnati is $364,385, but this price ranges from $16,000 to $2,700,000.
The average size of homes in Cincinnati is 2,056 square feet, which sit on an average lot size of 13,666 square feet. (This is approximately 0.3137464 acres.)
The same data but by zip code:
| Zip Code | Median Home Value per Sq. Foot | Number Sold in ZIP | Closing Price | Square Footage | Lot Size (Sq Ft) |
|---|---|---|---|---|---|
| 45202 | 26 | $15,318,792 | $593,154 | $2,566 | $13,233 |
| 45204 | 4 | NA | $128,750 | $1,764 | $4,628 |
| 45205 | 7 | $390,145 | $63,057 | $1,346 | $23,330 |
| 45206 | 7 | $2,403,527 | $307,693 | $2,378 | $6,957 |
| 45208 | 18 | $11,793,601 | $896,119 | $2,984 | $17,634 |
| 45209 | 7 | $2,201,414 | $364,059 | $1,591 | $6,091 |
| 45211 | 6 | $765,462 | $143,833 | $1,694 | $10,186 |
| 45212 | 1 | $109,784 | $80,000 | $1,065 | $2,744 |
| 45213 | 3 | $531,630 | $251,423 | $1,432 | $7,826 |
| 45214 | 1 | NA | $298,000 | $2,162 | $1,991 |
| 45215 | 1 | $236,263 | $419,000 | $2,305 | $20,909 |
| 45216 | 1 | $136,052 | $103,500 | $2,408 | $6,969 |
| 45217 | 1 | $133,600 | $205,000 | $1,920 | $11,326 |
| 45220 | 2 | $714,390 | $400,000 | $3,131 | $49,179 |
| 45223 | 6 | NA | $204,083 | $1,535 | $5,690 |
| 45224 | 3 | $457,047 | $195,050 | $1,910 | $14,562 |
| 45225 | 1 | NA | $80,025 | $2,189 | $2,613 |
| 45226 | 3 | $1,466,112 | $515,833 | $2,534 | $5,343 |
| 45227 | 9 | $1,906,193 | $229,948 | $1,671 | $8,156 |
| 45230 | 8 | $1,377,085 | $138,738 | $1,430 | $11,630 |
| 45231 | 1 | $66,990 | $94,000 | $792 | $6,011 |
| 45233 | 2 | $241,444 | $110,500 | $1,114 | $7,732 |
| 45237 | 5 | $641,660 | $138,080 | $1,789 | $6,917 |
| 45238 | 9 | $1,145,375 | $101,367 | $1,454 | $7,952 |
| 45239 | 4 | $599,162 | $134,125 | $1,742 | $8,832 |
| 45240 | 1 | $193,875 | $173,655 | $2,250 | $9,191 |
| 45243 | 3 | $2,902,303 | $1,216,847 | $4,662 | $61,565 |
| 45244 | 2 | $672,359 | $373,500 | $2,542 | $22,216 |
| 45245 | 11 | $1,812,507 | $167,245 | $1,414 | $17,874 |
| 45248 | 1 | $160,973 | $244,000 | $1,521 | $17,293 |
| 45249 | 2 | $696,377 | $327,250 | $2,411 | $25,613 |
| 45255 | 2 | $408,294 | $229,500 | $1,631 | $34,260 |
Look for Outliers The center, shape, and spread of the data:
As shown in the histogram and boxplots below, our data is highly skewed to the right. This is intuitive, as all values must be nonnegative and therefore have a lower bound.
Below are some boxplots to see the spread of the data.
This bar graph shows the number of homes built throughout the years.
Here we can see the average home price by zip code to identify the general price of homes by neighborhood.
Here we can look for correlation between covariates by plotting a pairplot.
So far in my exploration and modeling attempts, I have incorporated transformation of the response variable (square root, inverse, etc), transformation of some covariates (squaring age because older homes are also valued higher. This may or may not be true but is slightly indicated by a scatterplot of age and sold_price), tested how the model improves/worsens based on certain interactions, etc. As I have not determined my final model for the project, this question cannot but fully answered.
In the models I have created thus far, the QQ plots are indicating that the distribution is symmetrical but has very heavy tails. I am struggling to find a model that captures these extreme points. I think outliers are also contributing to this struggle because the range of prices go from $16,000.00 to $2,700,000.00.
I am also noticing that with most models I have tried, the variance is not equal - the residual plot is cone-shaped.
We can improve the unequal variance by transforming the response variable. I am not sure how I will remedy how to best capture the extreme points. I plan to use a hybrid of forward selection and backward elimination to guide me to a good model.