Introduction

The main goals of this assignment are to review your knowledge of R and to learn a bit about R Notebooks. This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. This document has three purposes:

  1. to demonstrate the use of RStudio’s Notebook capability
  2. to illustrate/ review some useful R packages for preparing and plotting data
  3. to explain the requirements of the first R exercise. You should be able to use this Notebook (Rmd) file as a guide to complete the assignment as well.

This document first uses the BostonHousing.csv data file from Chapter 3 of our text, and demonstrates a series of tasks and code chunks that you can use to complete the following tasks for a different dataset: WestRoxbury.csv.

After examining the RMD file that created this document, you will complete an assignment using the data from chapter 2, the WestRoxbury.csv data file.

Your Assignment

After reviewing these examples you’ll work with the WestRoxbury.csv data from Chapter 2 to answer 5 questions shown here. This is an individual assignment; if you need help, talk to the TAs, but not to classmates.

Your assignment is to create an attractive, insightful, professional-quality report using an R Notebook in which you:

  1. Read in the data
  2. Create a titled and labeled scatterplot of Tax vs. Total Value, including a fitted line (simple regression with lm)
  3. Report the formula that West Roxbury uses to compute the tax assessment on a property, AND explain why the Chapter 2 multiple regression model removed TAX from the model specification.
  4. Create a side-by-side boxplot display showing variation in Total Value by the number of full bathrooms in a home (HINT: treat the number of bathrooms as a factor)
  5. Comment on your impression of how property value varies with the number of bathrooms in a house.

Note that your comments and written explanations are as important as the correctness of the two simple graphs. (See syllabus for elaboration and further guidance on report writing.)

One bit of background may be helpful: In Massachusetts, cities and towns raise much of their operating revenues by taxing real estate. Typically, the tax on a property is a fixed percentage of the market valuation. In the WestRoxbury dataset, the TAX column reflects the annual tax owed on each piece of real estate. In other words, TAX is a linear function of the property value.

The main goals of this assignment are to review your knowledge of R and to learn a bit about R Notebooks.

R Markdown and Notebooks

Before going too far into this exercise, please review the online materials at R Markdown. The pages at that site explain the general ideas, and also show specific syntax and commands for formatted text and for code chunks.

I highly recommend using RStudio’s “cheatsheets” for help with Markdown, Notebooks, ggplot2, dplyr and other useful packages. See the Help menu:

Unlike the textbook, which calls packages as needed within a script, I recommend calling (library command) packages early on in a script. Either way works, but placing them all together allows a reader to find them easily.

Since library often generates messages and warnings that won’t add to my finished document, I recommend adding the warning = FALSE and message = FALSE in the chunk options:

library(tidyverse)   # loads a number of helpful Hadley Wickham packages
library(ggplot2)     # way better than Base plotting
library(readr)       # allows to read csv files as "tibbles"
library(tidyr)       # newer replacement for package Reshape

Boston Housing Example (adapted from Section 3.3 of our text)

Use this as a guide for YOUR notebook code

In Chapter 3, Example 2 is about Amtrak ridership. We’ll reproduce the chapter 3 code chunks here, and add some more commands to illustrate why and how we might wish to “tidy” and plot the data. NOTE: In a markdown file, you must specify the entire file path when reading a csv file.

The BostonHousing data is published at the University of California, Irvine Machine Learning Repository and also available at our text’s publisher website; the original publication source is given in a footnote on p. 57, and each observation is one census tract in Boston.

All variables (columns) in the table are defined in Table 3.1 on p. 58 of the text.

ALSO: here we demonstrate the readr command read_csv and compare the resulting dataframe to the more conventional results of read.csv. The readr package is part of the “tidyverse” family of packages. Think of these as updates and improvements to some of the older packages.

housing.df <- read.csv("BostonHousing.csv")

head(housing.df,9)  #  top 9 rows of data, as in Table 3.2

Now let’s see the structure and a glimpse of the tibble:

housing.tbl <- read_csv("BostonHousing.csv")
## Parsed with column specification:
## cols(
##   CRIM = col_double(),
##   ZN = col_double(),
##   INDUS = col_double(),
##   CHAS = col_double(),
##   NOX = col_double(),
##   RM = col_double(),
##   AGE = col_double(),
##   DIS = col_double(),
##   RAD = col_double(),
##   TAX = col_double(),
##   PTRATIO = col_double(),
##   LSTAT = col_double(),
##   MEDV = col_double(),
##   `CAT. MEDV` = col_double()
## )
head(housing.tbl,9) #  Note additional metadata
# glimpse is a tidyr function to examine the structure of a tibble
glimpse(housing.tbl)  
## Observations: 506
## Variables: 14
## $ CRIM        <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0...
## $ ZN          <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, ...
## $ INDUS       <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7...
## $ CHAS        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NOX         <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524,...
## $ RM          <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172,...
## $ AGE         <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, ...
## $ DIS         <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605,...
## $ RAD         <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4...
## $ TAX         <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, ...
## $ PTRATIO     <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 1...
## $ LSTAT       <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93...
## $ MEDV        <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 1...
## $ `CAT. MEDV` <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

In the chapter, Figures 3.1 and 3.2 shows some simple graphs, and include both the base R plotting functions and code using ggplot2. Here we just use ggplot2, which I will encourage us to adopt as a standard for most of the course. This code chunk is adapted from the textbook, first creating a scatterplot similar to the one in the upper right of Figure 3.1. For added information, I’ve colored the points to indicate which tracts border the Charles River.

# Create the plot object, p, in layers and then display
p <- ggplot(housing.tbl, aes(x=LSTAT, y=MEDV, color=as.factor(CHAS))) +
     geom_point(alpha = 0.7)  # alpha controls the transparency of the points 
p + ggtitle("Boston Median Value \nby Percent Low Income & Proximity to River")

We might also add a smoother to the plot:

p + geom_smooth() +
     ggtitle("Boston Median Value \nby Percent Low Income & Proximity to River")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

In Figure 3.2, we find two distributional graphs of MEDV. We start with a histogram:

hist <- ggplot(housing.tbl) +
     geom_histogram(aes(x=MEDV), fill="darkblue", binwidth = 5) +
     ggtitle("Median Values of Boston Housing") + 
     xlab("Median Home Value (000s)")
hist

And then, the side-by-side boxplots of Median Values for properties that do and do not bound the Charles River. For improved labels, let’s create a new factor for the second variable, and assign descriptive level names:

housing.tbl$river <- factor(housing.tbl$CHAS, labels=c("No", "Yes"))
bp <- ggplot(housing.tbl) +
     geom_boxplot((aes(x=river, y=MEDV))) + 
     xlab("Does Tract Bound the Charles River?") +
     ylab("Median Value (000s)") +
     ggtitle("Does Bordering the River Affect Housing Value?")
bp

Conclusion

There you are: an example of an R Notebook that weaves together text, graphs, and R Code using tidy tibbles and ggplot2.

After writing and testing the code, choose the knit button

to knit it to a Word document. Thus, you have an input and output file, your RMD and your Word document, which you may edit to make an insightful and communicative report. You zip all files together into a zipfile named “last name, first name, exercise 1.zip” and submit it via upload to LATTE.