Regression lets you predict the values of a response variable from known values of explanatory variables. Which variable you use as the response variable depends on the question you are trying to answer, but in many datasets there will be an obvious choice for variables that would be interesting to predict. You’ll explore a dragon real estate dataset with 14 variables.
Variable Information:
- CRIM per capita crime rate by town. - ZN proportion of residential
land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town. - CHAS Charles
River dummy variable (= 1 if tract bounds river; 0 otherwise). - NOX
nitric oxides concentration (parts per 10 million). - RM average number
of rooms per dwelling. - AGE proportion of owner-occupied units built
prior to 1940. - DIS weighted distances to five Boston employment
centres. - RAD index of accessibility to radial highways. - TAX
full-value property-tax rate per $10,000. - PTRATIO pupil-teacher ratio
by town. - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by
town. - LSTAT % lower status of the population. - MEDV Median value of
owner-occupied homes in $1000’s.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
If package not available, you will need to install it.
dragron_real_estate <- read.csv("data/realestate-data.csv")
Type View(dragron_real_estate) in the console to view the dataset, and decide which variable would make a good response variable.
Predicting prices is a common business task, so house price makes a good response variable.
Before you can run any statistical models, it’s usually a good idea to visualize your dataset. Here, we’ll look at the relationship between Median value of owner-occupied homes and weighted distances to five Boston employment centres, using the Dragon real estate dataset.
dragon_real_estate is available, ggplot2 is loaded.
Using dragon_real_estate, draw a scatter plot of MEDV (y-axis) versus DIS (x-axis).
dragron_real_estate %>%
ggplot(aes(x=DIS, y=MEDV)) +
geom_point()
Update the plot to make the points 50% transparent by setting alpha to 0.5.
# Make points 50% transparent
dragron_real_estate %>%
ggplot(aes(x=DIS, y=MEDV)) +
geom_point(alpha=0.5)
Update the plot by adding a trend line, calculated using a linear regression. You can omit the confidence ribbon.
# Add a linear trend line without a confidence ribbon
dragron_real_estate %>%
ggplot(aes(x=DIS, y=MEDV)) +
geom_point(alpha=0.5) +
geom_smooth(
method = lm,
se = FALSE
)
## `geom_smooth()` using formula = 'y ~ x'
Scholarly scatter plotting! Scatter plots are the standard way to visualize the relationship between two numeric variables, and ggplot makes adding linear trend lines easy.
Linear regression models always fit a straight line to the data. Straight lines are defined by two properties: their intercept and their slope.
The intercept is the y-value when x equals zero. The slope is the rate of change in the y direction divided by the rate of change in the x direction.
While ggplot can display a linear regression trend line using geom_smooth(), it doesn’t give you access to the intercept and slope as variables, or allow you to work with the model results as variables. That means that sometimes you’ll need to run a linear regression yourself.
Time to run your first model!
Run a linear regression with MEDV as the response variable, DIS as the explanatory variable, and dragon_real_estate as the dataset.
# Run a linear regression of MEDV vs. DIS
lm(MEDV ~ DIS, data=dragron_real_estate)
##
## Call:
## lm(formula = MEDV ~ DIS, data = dragron_real_estate)
##
## Coefficients:
## (Intercept) DIS
## 18.690 1.055
The model had an (Intercept) coefficient of 18.690. What does this mean?
The model had a DIS coefficient of 1.055. What does this mean?
The intercept is positive, so a house with no Boston employment centres nearby still has a positive price. The coefficient for Boston employment is also positive, so as the number of nearby Boston employment increases, so does the price of the house.