Complete the following tasks:
- Save this .qmd file in the folder you have designated on your computer for LM1
- Download and save the WestRoxbury.csv file into the same location as this file
NOTE
1 - The code is created for each TABLE in a way that stands alone. Data are read in fresh from the dataset at the start of each table. (Except Table 2.12-2.14 run together.) This way you can run just that chunk without running anything before that in order to get your answer. Thus, the code may differ some from the textbook.
2 - Also note that we will always read in the .csv file rather than use the \(mlba\) package from the textbook authors.
Practice showing different subsets of the data and some summary statistics.
(This is partial code displayed in text. You may want to add in all of the other code for practice.)
TOTAL.VALUE TAX LOT.SQFT YR.BUILT GROSS.AREA
Min. : 105.0 Min. : 1320 Min. : 997 Min. : 0 Min. : 821
1st Qu.: 325.1 1st Qu.: 4090 1st Qu.: 4772 1st Qu.:1920 1st Qu.:2347
Median : 375.9 Median : 4728 Median : 5683 Median :1935 Median :2700
Mean : 392.7 Mean : 4939 Mean : 6278 Mean :1937 Mean :2925
3rd Qu.: 438.8 3rd Qu.: 5520 3rd Qu.: 7022 3rd Qu.:1955 3rd Qu.:3239
Max. :1217.8 Max. :15319 Max. :46411 Max. :2011 Max. :8154
LIVING.AREA FLOORS ROOMS BEDROOMS FULL.BATH
Min. : 504 Min. :1.000 Min. : 3.000 Min. :1.00 Min. :1.000
1st Qu.:1308 1st Qu.:1.000 1st Qu.: 6.000 1st Qu.:3.00 1st Qu.:1.000
Median :1548 Median :2.000 Median : 7.000 Median :3.00 Median :1.000
Mean :1657 Mean :1.684 Mean : 6.995 Mean :3.23 Mean :1.297
3rd Qu.:1874 3rd Qu.:2.000 3rd Qu.: 8.000 3rd Qu.:4.00 3rd Qu.:2.000
Max. :5289 Max. :3.000 Max. :14.000 Max. :9.00 Max. :5.000
HALF.BATH KITCHEN FIREPLACE REMODEL
Min. :0.0000 Min. :1.000 Min. :0.0000 Length:5802
1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
Median :1.0000 Median :1.000 Median :1.0000 Mode :character
Mean :0.6139 Mean :1.015 Mean :0.7399
3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000
Max. :3.0000 Max. :2.000 Max. :4.0000
TABLE 2.4
Sampling in R
Random sample of 5 observations
s <-sample(row.names(housing.df), 5)housing.df[s,]
rm(housing.df, upsampled.df) #remove (clear out) datasets from the environment. we will recreate in the next step
TABLE 2.5
Reviewing Variables in R
Using the pipe operator to combine steps – read in the .csv and transform REMODEL to a factor all in one step, rather than in separate steps.
Read the pipe operator as “AND THEN”. For example, in this code, we would read this as create the housing.df dataframe by reading in the WestRoxbury.csv file AND THEN convert the REMODEL variable into a factor.
(Ctrl-Shift-m is shortcut to create pipe operator)
Note that \(tidyverse\) is a collection of R packages. The mutate command is part of the \(dplyr\) package. We will use \(dplyr\) extensively in LM 2 for data manipulation and munging.
In this section, we transform a categorical variable into dummy variables.
The first part replicates the textbook by creating only 2 dummy variables for a 3-level categorical variable AND deleting the original text variable. This is all you need to use the variables in a model – you just need to remember what level you excluded, because that is your reference level and important for interpretation.
I prefer keeping all three levels and the original variable in my dataset so that I have the option of which to use as my reference category. This is doable if there are few categoricals in the dataset.
We use the dummy_cols function from the \(fastDummies\) package to do this. It’s fast and efficient.
I include traditional, longer IFELSE method to create dummy variables as a contrast. See why dummy_cols() is more efficient?
rm(housing.df)# replicate but keep all columns and the original variable (the default is FALSE, so I remove the options)housing.df <-dummy_cols(read.csv('WestRoxbury.csv'))str(housing.df)
rm(housing.df)# create the dummy variables using an if/else statement (the long, tedious methdod :))housing.df <-read.csv('WestRoxbury.csv')table(housing.df$REMODEL)
In this section we want to deal with missing values.
It turns our that our original dataset has no missing values. SO, we have to run some code to randomly create some missing values so that we can practice dealing with them.
housing.df <-dummy_cols(read.csv('WestRoxbury.csv'))# check for missing values in our datasetnbrna <-sum(is.na(housing.df$BEDROOMS))cat("There are",nbrna,"observations with missing values for BEDROOMS. \n")
There are 0 observations with missing values for BEDROOMS.
# So let's randomly replace 10 values for BEDROOMS to missingrows.to.missing <-sample(row.names(housing.df), 10)housing.df[rows.to.missing,]$BEDROOMS <-NAnbrna <-sum(is.na(housing.df$BEDROOMS))cat("There are",nbrna,"observations with missing values for BEDROOMS. \n\n")
There are 10 observations with missing values for BEDROOMS.
summary(housing.df$BEDROOMS)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 3.000 3.000 3.231 4.000 9.000 10
# Now impute the missing value with the median of the remaining non-missinghousing.df <- housing.df %>%replace_na(list(BEDROOMS=median(housing.df$BEDROOMS, na.rm=TRUE)))nbrna <-sum(is.na(housing.df$BEDROOMS))cat("There are",nbrna,"observations with missing values for BEDROOMS. \n\n")
There are 0 observations with missing values for BEDROOMS.
summary(housing.df$BEDROOMS)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.00 3.00 3.23 4.00 9.00
rm(housing.df, bedroom_median, rows.to.missing) #clean up the environment by name
TABLE 2.9
Data Partitioning in R
The text walks you through manual creation of partitions using random sampling.
This is useful to understand.
However, we will typically use the easier method from the \(caret\) package as shown in the last section of the code.
In this example, we partition into 60% training and 40% holdout sets.
library(caret)housing.df <-read.csv('WestRoxbury.csv') %>%mutate(REMODEL=factor(REMODEL))set.seed(1)index <- caret::createDataPartition(housing.df$TOTAL.VALUE, p=0.6, list=FALSE)train.df <- housing.df[index, ]holdout.df <- housing.df[-index, ]rows <-nrow(housing.df) # nrow() counts number of observationscat("The original dataset has",rows,"observations. \n\n")
The original dataset has 5802 observations.
trows <-nrow(train.df) cat("The 60% training partition has",trows,"observations. \n\n")
rm(list=ls()) #clear out environment globally -- everything :)
TABLE 2.11
Cleaning and Preprocessing Data
This is just a small taste of preprocessing. We will do much more in Week 2.
In this textbook example, the instructions are to keep only observations without missing values, remove the TAX variable, convert the REMODEL variable from a character to a factor variable and create categorical dummies for the second and third levels.
Use \(dplyr\) from the \(tidyverse\) to do this all in one step!
# get fitted values and residuals and place into a dataframetrain.res <-data.frame(actual=train.df$TOTAL.VALUE, predicted=reg$fitted.values,residuals=reg$residuals)head(train.res, 10) # look at first 10 obs