The main goals of this assignment are to review your knowledge of R and to learn a bit about R Notebooks. This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. This document has three purposes:
Use this Notebook (Rmd) file as a guide to complete the assignment. You may use Base R or Tidyverse for your data management.
This document first uses the BostonHousing.csv data file from Chapter 3 of our text, and demonstrates a series of tasks and code chunks that you can use to complete the following tasks for a different dataset: WestRoxbury.csv.
This is an individual assignment; if you need help, talk to the TAs, but not to classmates.
Your assignment is to create an attractive, insightful, professional-quality report using an R Notebook in which you:
Note that your comments and written explanations are as important as the correctness of the two simple graphs. (See syllabus for elaboration and further guidance on report writing.)
Before going too far into this exercise, please review the online materials at R Markdown. The pages at that site explain the general ideas, and also show specific syntax and commands for formatted text and for code chunks.
I highly recommend using RStudio’s “cheatsheets” for help with Markdown, Notebooks, ggplot2, dplyr and other useful packages. See the Help menu:
Unlike the textbook, which calls packages as needed within a script, I recommend calling (library command) packages early on in a script. Either way works, but placing them all together allows a reader to find them easily.
Since library often generates messages and warnings that won’t add to my finished document, I recommend adding the warning = FALSE and message = FALSE in the chunk options:
library(ggplot2)
# put other libraries here
Boston Housing Example (adapted from Section 3.3 of our text)
The BostonHousing data is published at the University of California, Irvine Machine Learning Repository and also available at our text’s publisher website; the original publication source is given in a footnote on p. 57, and each observation is one census tract in Boston.
All variables (columns) in the Boston Housing are defined in Table 3.1 on p. 58 of the text. The West Roxbury dataset is documented on p. 23, Table 2.1
housing.df <- read.csv("BostonHousing.csv")
#housing.df <- read.csv("WestRoxbury.csv")
head(housing.df,9) # top 9 rows of data, as in Table 3.2
In the chapter, Figures 3.1 and 3.2 shows some simple graphs, and include both the base R plotting functions and code using the ggplot method from library ggplot2. Here we just use ggplot2, which I will encourage us to adopt as a standard for most of the course. This code chunk is adapted from the textbook, first creating a scatterplot similar to the one in the upper right of Figure 3.1. For added information, I’ve colored the points to indicate which tracts border the Charles River.
In Figure 3.2, we find two distributional graphs of MEDV. We start with a histogram:
hist <- ggplot(data=housing.df, aes(x=MEDV), fill="darkblue", binwidth = 5) +
geom_histogram() +
ggtitle("Median Values of Boston Housing") +
xlab("Median Home Value (000s)")
hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s look at boxplots of Median Values for properties that do and do not bound the Charles River. For improved labels, let’s create a new factor for the second variable, and assign descriptive level names:
housing.df$river <- factor(housing.df$CHAS, labels=c("No", "Yes"))
bp <- ggplot(housing.df) +
geom_boxplot(aes(x=river, y=MEDV)) +
xlab("Does Tract Bound the Charles River?") +
ylab("Median Value (000s)") +
ggtitle("Does Bordering the River Affect Housing Value with statistical significance?")
bp
# Create the plot object, p, in layers and then display
p <- ggplot(housing.df, aes(x=LSTAT, y=MEDV, color=as.factor(CHAS))) +
geom_point(alpha = 0.7) # alpha controls the transparency of the points
p + ggtitle("Boston Median Value \nby Percent Low Income & Proximity to River")
We may benefit from adding a smoother to the plot:
p + geom_smooth(method=loess) +
ggtitle("Boston Median Value by Percent Low Income & Proximity to River\nDoes Charles proximity make a difference?")
## `geom_smooth()` using formula 'y ~ x'
#Model some regressions, starting with Boston Housing data and ending with West Roxbury data.
mr <- lm( MEDV ~ . , data = housing.df) # Boston Housing
#mr <- lm( TOTAL.VALUE ~ . , data = housing.df) # West Roxbury
summary(mr)
##
## Call:
## lm(formula = MEDV ~ ., data = housing.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8156 -1.9975 -0.2335 1.6757 16.0932
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.954458 3.816870 11.254 < 2e-16 ***
## CRIM -0.129678 0.025517 -5.082 5.32e-07 ***
## ZN -0.005113 0.011103 -0.460 0.645396
## INDUS 0.114290 0.048362 2.363 0.018506 *
## CHAS 2.359846 0.673138 3.506 0.000497 ***
## NOX -15.362403 2.983384 -5.149 3.79e-07 ***
## RM 1.058350 0.354782 2.983 0.002995 **
## AGE -0.006162 0.010319 -0.597 0.550689
## DIS -0.733482 0.161312 -4.547 6.86e-06 ***
## RAD 0.205249 0.051933 3.952 8.88e-05 ***
## TAX -0.009369 0.002944 -3.182 0.001554 **
## PTRATIO -0.558002 0.104307 -5.350 1.35e-07 ***
## LSTAT -0.478377 0.039373 -12.150 < 2e-16 ***
## CAT..MEDV 11.813994 0.647596 18.243 < 2e-16 ***
## riverYes NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.709 on 492 degrees of freedom
## Multiple R-squared: 0.8415, Adjusted R-squared: 0.8373
## F-statistic: 200.9 on 13 and 492 DF, p-value: < 2.2e-16
There you are: an example of an R Notebook that weaves together text, graphs, and R Code using ggplot2 and linear (multiple) regression.
After writing and testing the code, choose the knit button
to knit it to a Word document. Thus, you have an input and output file, your RMD and your Word document, which you may edit to make an insightful and communicative report. You zip all files together into a zipfile named “last name, first name, exercise 1.zip” and submit it via upload to LATTE.