My first R exercise for BUS 212

Introduction

The main goals of this assignment are to review your knowledge of R and to learn a bit about R Notebooks. This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. This document has three purposes:

to demonstrate the use of RStudio’s Notebook capability
to illustrate/ review some useful R packages for previewing and plotting data, as well as performing multiple regression
to reproduce and modify the given analysis with new data
to contemplate the modeling pitfall known as Target Leakage

Use this Notebook (Rmd) file as a guide to complete the assignment. You may use Base R or Tidyverse for your data management.

This document first uses the BostonHousing.csv data file from Chapter 3 of our text, and demonstrates a series of tasks and code chunks that you can use to complete the following tasks for a different dataset: WestRoxbury.csv.

Your Assignment

This is an individual assignment; if you need help, talk to the TAs, but not to classmates.

Your assignment is to create an attractive, insightful, professional-quality report using an R Notebook in which you:

Read in the data
Create a titled and labeled scatterplot
Create a boxplot showing variation in a target variable by some categorical variable (factor).
Fit your best regression to the Boston Housing data with MEDV as the target variable. Try using all other variables as predictors. Show diagnostic plots of your residuals.
Fit your best regression to the West Roxbury data with TOTAL VALUE as the target variable. Try using all other variables as predictors. Show diagnostic plots of your residuals.
Explain why one of your regression models should exclude TAX from the predictors.
Interpret how property value relates to a variety of predictors.

Note that your comments and written explanations are as important as the correctness of the two simple graphs. (See syllabus for elaboration and further guidance on report writing.)

R Markdown and Notebooks

Before going too far into this exercise, please review the online materials at R Markdown. The pages at that site explain the general ideas, and also show specific syntax and commands for formatted text and for code chunks.

I highly recommend using RStudio’s “cheatsheets” for help with Markdown, Notebooks, ggplot2, dplyr and other useful packages. See the Help menu:

Unlike the textbook, which calls packages as needed within a script, I recommend calling (library command) packages early on in a script. Either way works, but placing them all together allows a reader to find them easily.

Since library often generates messages and warnings that won’t add to my finished document, I recommend adding the warning = FALSE and message = FALSE in the chunk options:

library(ggplot2)     
# put other libraries here

Boston Housing Example (adapted from Section 3.3 of our text)

The BostonHousing data is published at the University of California, Irvine Machine Learning Repository and also available at our text’s publisher website; the original publication source is given in a footnote on p. 57, and each observation is one census tract in Boston.

All variables (columns) in the Boston Housing are defined in Table 3.1 on p. 58 of the text. The West Roxbury dataset is documented on p. 23, Table 2.1

housing.df <- read.csv("BostonHousing.csv")
#housing.df <- read.csv("WestRoxbury.csv")

head(housing.df,9)  #  top 9 rows of data, as in Table 3.2

In the chapter, Figures 3.1 and 3.2 shows some simple graphs, and include both the base R plotting functions and code using the ggplot method from library ggplot2. Here we just use ggplot2, which I will encourage us to adopt as a standard for most of the course. This code chunk is adapted from the textbook, first creating a scatterplot similar to the one in the upper right of Figure 3.1. For added information, I’ve colored the points to indicate which tracts border the Charles River.

In Figure 3.2, we find two distributional graphs of MEDV. We start with a histogram:

hist <- ggplot(data=housing.df, aes(x=MEDV), fill="darkblue", binwidth = 5) +
     geom_histogram() +
     ggtitle("Median Values of Boston Housing") + 
     xlab("Median Home Value (000s)")
hist

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s look at boxplots of Median Values for properties that do and do not bound the Charles River. For improved labels, let’s create a new factor for the second variable, and assign descriptive level names:

housing.df$river <- factor(housing.df$CHAS, labels=c("No", "Yes"))
bp <- ggplot(housing.df) +
     geom_boxplot(aes(x=river, y=MEDV)) + 
     xlab("Does Tract Bound the Charles River?") +
     ylab("Median Value (000s)") +
     ggtitle("Does Bordering the River Affect Housing Value with statistical significance?")
bp

# Create the plot object, p, in layers and then display
p <- ggplot(housing.df, aes(x=LSTAT, y=MEDV, color=as.factor(CHAS))) +
     geom_point(alpha = 0.7)  # alpha controls the transparency of the points 
p + ggtitle("Boston Median Value \nby Percent Low Income & Proximity to River")

We may benefit from adding a smoother to the plot:

p + geom_smooth(method=loess) +
     ggtitle("Boston Median Value by Percent Low Income & Proximity to River\nDoes Charles proximity make a difference?")

## `geom_smooth()` using formula 'y ~ x'

#Model some regressions, starting with Boston Housing data and ending with West Roxbury data.
mr <- lm( MEDV ~ . , data = housing.df)     # Boston Housing
#mr <- lm( TOTAL.VALUE ~ . , data = housing.df)     # West Roxbury
summary(mr)

## 
## Call:
## lm(formula = MEDV ~ ., data = housing.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8156 -1.9975 -0.2335  1.6757 16.0932 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.954458   3.816870  11.254  < 2e-16 ***
## CRIM         -0.129678   0.025517  -5.082 5.32e-07 ***
## ZN           -0.005113   0.011103  -0.460 0.645396    
## INDUS         0.114290   0.048362   2.363 0.018506 *  
## CHAS          2.359846   0.673138   3.506 0.000497 ***
## NOX         -15.362403   2.983384  -5.149 3.79e-07 ***
## RM            1.058350   0.354782   2.983 0.002995 ** 
## AGE          -0.006162   0.010319  -0.597 0.550689    
## DIS          -0.733482   0.161312  -4.547 6.86e-06 ***
## RAD           0.205249   0.051933   3.952 8.88e-05 ***
## TAX          -0.009369   0.002944  -3.182 0.001554 ** 
## PTRATIO      -0.558002   0.104307  -5.350 1.35e-07 ***
## LSTAT        -0.478377   0.039373 -12.150  < 2e-16 ***
## CAT..MEDV    11.813994   0.647596  18.243  < 2e-16 ***
## riverYes            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.709 on 492 degrees of freedom
## Multiple R-squared:  0.8415, Adjusted R-squared:  0.8373 
## F-statistic: 200.9 on 13 and 492 DF,  p-value: < 2.2e-16

Conclusion

There you are: an example of an R Notebook that weaves together text, graphs, and R Code using ggplot2 and linear (multiple) regression.

After writing and testing the code, choose the knit button

to knit it to a Word document. Thus, you have an input and output file, your RMD and your Word document, which you may edit to make an insightful and communicative report. You zip all files together into a zipfile named “last name, first name, exercise 1.zip” and submit it via upload to LATTE.

My first R exercise for BUS 212

Prof. Kamis

Fall 2021

Introduction

Your Assignment

R Markdown and Notebooks

Conclusion