1. In your data set, how many covariates do you have? What are they? How many observations do you have? How did you collect them?

This dataset includes the covariates listed and described in the table below. There are 18 covariates and 158 observations. The data was manually collected from Zillow. The search on Zillow was filtered for recently sold homes within Cincinnati limits. Originally, there were over 300 observations, which was attributable to the excel file where classmate contributed. However, as I was mining through the data, it became apparent that there existed substantial data entry error. Therefore, the shared data was removed from the dataset to ensure that the analysis was as accurate as possible.

Variable Type Description
Zip Character Zip Code in which the Home is located
Price Double Final Sales Price of the Home
# Bed Double Number of Bedrooms
# Bath Double Number of Bathrooms (0.5 indicates half-bath)
Square Footage Double Square Footage of the Home
Year Built Character Year in which the Home was first built
Lot Size Double Square Footage of the Property on which the Home sits
Heating Character Method of Heating, such as Gas, Electric, Baseboard, None, or Other
Cooling Character Method of Cooling, such as Central, Wall Units, None, or Other
Parking Spaces Double Number of Parking Spaces in the Garage. Zero if no garage.
Garage Character Type of Garage: Attached, Detatched, Carport, or None
Basement Character Status of the Basement: Finished, Partially Finished, Unfinished, or None
Stories Double Number of Floors in the Home
Pool Boolean/Numeric Does the Home have a private Pool; 1 if Yes, 0 if no
Style Character Style of Home, such as Historical, Traditional, Transitional, Contemporary, etc.
Fireplace Boolean/Numeric Does the Home have working fireplaces; 1 if Yes, 0 if no
Average Median Home Value in Zip Double The Average of the Median Home Values for each Zip Code. Taken from Zillow’s monthly data
Years Old Double Number of Years since being built. [ 2019 - Year Built ]


2. Include an short exploratory data analysis of the collected data, i.e., figures and tables.

The data captured is from the sell dates of 2017-10-25 to 2019-11-04. There are 158 observations in the dataset that range over 32 zipcodes.

The following characteristics that were captured for each sold home are repeated below, and explained in the prior tab:

##  [1] "Address"         "zip"             "sold_price"     
##  [4] "sold_date"       "num_bed"         "num_bath"       
##  [7] "sqr_ft"          "year_built"      "lot_size_ft"    
## [10] "heating"         "cooling"         "parking_spaces" 
## [13] "garage"          "basement_status" "num_stories"    
## [16] "Pool?"           "type_and_style"  "Fireplace"      
## [19] "med_val"         "age"

Below are some summary statistics of the houses’ sales price and pysical size:

Summary Statistics Table of Home Prices and Size zillow (N = 158)
Final Sale Price   
   min 16000
   median (Q1, Q3) 198,500.00 (138,000.00, 466,875.00)
   mean (sd) 364,384.72 ± 433,636.37
   max 2700000
   Missing 0
Square Footage of Home   
   min 792
   median (Q1, Q3) 1,783.50 (1,325.50, 2,297.75)
   mean (sd) 2,056.81 ± 1,070.53
   max 7900
   Missing 0
Square Footage of Lot   
   min 784
   median (Q1, Q3) 7,623.00 (4,356.00, 11,761.15)
   mean (sd) 13,666.79 ± 24,528.67
   max 217800
   Missing 0

The average price of homes in Cincinnati is $364,385, but this price ranges from $16,000 to $2,700,000.

The average size of homes in Cincinnati is 2,056 square feet, which sit on an average lot size of 13,666 square feet. (This is approximately 0.3137464 acres.)

The same data but by zip code:

Home Sizes and Values by Zip Code
Zip Code Median Home Value per Sq. Foot Number Sold in ZIP Closing Price Square Footage Lot Size (Sq Ft)
45202 26 $15,318,792 $593,154 $2,566 $13,233
45204 4 NA $128,750 $1,764 $4,628
45205 7 $390,145 $63,057 $1,346 $23,330
45206 7 $2,403,527 $307,693 $2,378 $6,957
45208 18 $11,793,601 $896,119 $2,984 $17,634
45209 7 $2,201,414 $364,059 $1,591 $6,091
45211 6 $765,462 $143,833 $1,694 $10,186
45212 1 $109,784 $80,000 $1,065 $2,744
45213 3 $531,630 $251,423 $1,432 $7,826
45214 1 NA $298,000 $2,162 $1,991
45215 1 $236,263 $419,000 $2,305 $20,909
45216 1 $136,052 $103,500 $2,408 $6,969
45217 1 $133,600 $205,000 $1,920 $11,326
45220 2 $714,390 $400,000 $3,131 $49,179
45223 6 NA $204,083 $1,535 $5,690
45224 3 $457,047 $195,050 $1,910 $14,562
45225 1 NA $80,025 $2,189 $2,613
45226 3 $1,466,112 $515,833 $2,534 $5,343
45227 9 $1,906,193 $229,948 $1,671 $8,156
45230 8 $1,377,085 $138,738 $1,430 $11,630
45231 1 $66,990 $94,000 $792 $6,011
45233 2 $241,444 $110,500 $1,114 $7,732
45237 5 $641,660 $138,080 $1,789 $6,917
45238 9 $1,145,375 $101,367 $1,454 $7,952
45239 4 $599,162 $134,125 $1,742 $8,832
45240 1 $193,875 $173,655 $2,250 $9,191
45243 3 $2,902,303 $1,216,847 $4,662 $61,565
45244 2 $672,359 $373,500 $2,542 $22,216
45245 11 $1,812,507 $167,245 $1,414 $17,874
45248 1 $160,973 $244,000 $1,521 $17,293
45249 2 $696,377 $327,250 $2,411 $25,613
45255 2 $408,294 $229,500 $1,631 $34,260

Look for Outliers The center, shape, and spread of the data:

As shown in the histogram and boxplots below, our data is highly skewed to the right. This is intuitive, as all values must be nonnegative and therefore have a lower bound.

Below are some boxplots to see the spread of the data.

This bar graph shows the number of homes built throughout the years.

Here we can see the average home price by zip code to identify the general price of homes by neighborhood.

Here we can look for correlation between covariates by plotting a pairplot.


3. What kind of models are you using? What techniques are you using, for example, indicator variables, polynomial regression, transformation and so on?

So far in my exploration and modeling attempts, I have incorporated transformation of the response variable (square root, inverse, etc), transformation of some covariates (squaring age because older homes are also valued higher. This may or may not be true but is slightly indicated by a scatterplot of age and sold_price), tested how the model improves/worsens based on certain interactions, etc. As I have not determined my final model for the project, this question cannot but fully answered.


4. What are the potential problems/issues in your model? For example, skewness, nonnormality, nonlinearity, multicollinearity, heteroscedasticity, dummy variables, outliers and/or the data simply has very weak signal?

In the models I have created thus far, the QQ plots are indicating that the distribution is symmetrical but has very heavy tails. I am struggling to find a model that captures these extreme points. I think outliers are also contributing to this struggle because the range of prices go from $16,000.00 to $2,700,000.00.

I am also noticing that with most models I have tried, the variance is not equal - the residual plot is cone-shaped.


5. What kind of remedies are you proposing to use to solve the potential issues?

We can improve the unequal variance by transforming the response variable. I am not sure how I will remedy how to best capture the extreme points. I plan to use a hybrid of forward selection and backward elimination to guide me to a good model.