Data Preparation

# load data
library(tidyverse)
library(Hmisc)
getHdata(boston)
head(boston)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

What factors influence median home value and can it be predicted?

Cases

What are the cases, and how many are there?

Each case represents a section of a Boston area city.

Data collection

Describe the method of data collection.

The data was originally used for the paper “Hedonic Housing Prices and the Demand for Clean Air”. It is unclear how it was collected, but the authors cite several people of the first page for providing the data. Apparently it came from a number of sources like city records and environmental research.

Type of study

What type of study is this (observational/experiment)?

This was an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The data was obtained from http://biostat.mc.vanderbilt.edu/DataSets

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is median house price. It is numerical.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

These are subject to change as the project progesses, but I believe make a good starting place. I think every variable in the dataset offers promise, so I might try most of them out at least once. I will try to avoid blending true categoricals with continuous (unless that’s covered in Ch 8) becuase my only experience with ANCOVA models is from a SAS course I took through work.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

describe(boston$value)
## boston$value : Median value of owner-occupied homes / $1000 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      229        1    22.53    9.778    10.20    12.75 
##      .25      .50      .75      .90      .95 
##    17.03    21.20    25.00    34.80    43.40 
## 
## lowest :  5.0  5.6  6.3  7.0  7.2, highest: 46.7 48.3 48.5 48.8 50.0
boston %>%
  ggplot() + geom_histogram(aes(x = value), color = "black", fill = "darkblue")

describe(boston$rooms)
## boston$rooms : Average no. rooms per dwelling 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      446        1    6.285   0.7515    5.314    5.593 
##      .25      .50      .75      .90      .95 
##    5.886    6.209    6.623    7.151    7.588 
## 
## lowest : 3.561 3.863 4.138 4.368 4.519, highest: 8.375 8.398 8.704 8.725 8.780
boston %>%
  ggplot() + geom_histogram(aes(x = rooms), color = "black", fill = "darkgreen")

describe(boston$distance)
## boston$distance : Weighted distances to five Boston employment centers 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      412        1    3.795    2.298    1.462    1.628 
##      .25      .50      .75      .90      .95 
##    2.100    3.207    5.188    6.817    7.828 
## 
## lowest :  1.1296  1.1370  1.1691  1.1742  1.1781
## highest:  9.2203  9.2229 10.5857 10.7103 12.1265
boston %>% 
  ggplot() + geom_histogram(aes(x = distance), color = "black", fill = "purple")

table(boston$highway)
## 
##   1   2   3   4   5   6   7   8  24 
##  20  24  38 110 115  26  17  24 132
describe(boston$crime)
## boston$crime : Per capita crime rate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      506        0      504        1    3.614    5.794  0.02791  0.03820 
##      .25      .50      .75      .90      .95 
##  0.08205  0.25651  3.67708 10.75300 15.78915 
## 
## lowest :  0.00632  0.00906  0.01096  0.01301  0.01311
## highest: 45.74610 51.13580 67.92080 73.53410 88.97620
boston %>%
  ggplot() + geom_histogram(aes(x = crime), color = "black", fill = "orange")

boston %>% 
  mutate(
    crime_sqrt = sqrt(crime)
  ) %>%
  ggplot() + geom_histogram(aes(x = crime_sqrt), color = "black", fill = "orange")

The crime variable is heavily skewed, so I believe taking the square root helps make it a better predictor variable. It will likely take some work to get the transformation correct.