Seattle home Prices. Home prices are often modeled by looking at the square footage of a home, number of bathrooms and other characteristics. One interesting question is if even after taking into account these characteristics is there evidence for differences in the listing prices for different Realtors? Is it realistic to assume that some Realtors tend to under or over valuate their listed properties? This dataset has information on the sales prices of 36 homes listed in the Seattle area. Of these homes, 28 were listed by one Realtor while 8 of the homes were listed by a different Realtor. In this dataset we also have information on the price of the listings in thousands of dollars, the total square feet, the price in dollars per square feet, the Realtor indicator, the number of bedrooms and the number of bathrooms.
Price ($000): The dollar sales of the convenience store
library(car)
library(leaps)
library(ggplot2)
# Set working directory
setwd("~/OneDrive - The University of Colorado Denver/BANA 6610/Homework 5")
# Bring numerical data in to R
seahomes <- read.csv("realestatedata-1.csv", header = T, sep = ",")
head(seahomes)
#Plot pairs
# Use pairs
pairs(seahomes)
qplot(x = seahomes$Square.Feet, y = seahomes$Price, col = seahomes$Realtor)
#Histogram of square footage
hist(seahomes$Square.Feet)
#Histogram of price
hist(seahomes$Price)
summary(seahomes)
Price Square.Feet Price.SqFt Realtor Bedrooms Bathrooms
Min. :165.0 Min. : 868 Min. :0.1164 A:28 Min. :1.000 Min. :1.000
1st Qu.:281.8 1st Qu.:1535 1st Qu.:0.1601 B: 8 1st Qu.:3.000 1st Qu.:2.000
Median :386.5 Median :1952 Median :0.2070 Median :3.000 Median :2.000
Mean :407.4 Mean :1940 Mean :0.2167 Mean :3.111 Mean :2.167
3rd Qu.:575.0 3rd Qu.:2345 3rd Qu.:0.2485 3rd Qu.:4.000 3rd Qu.:3.000
Max. :625.0 Max. :3260 Max. :0.4437 Max. :4.000 Max. :3.000
#Linear regression model
seamod <- lm(Price ~ Square.Feet+Price.SqFt+Bedrooms+Bathrooms,data = seahomes)
summary(seamod)
Call:
lm(formula = Price ~ Square.Feet + Price.SqFt + Bedrooms + Bathrooms,
data = seahomes)
Residuals:
Min 1Q Median 3Q Max
-82.741 -14.273 5.019 18.186 47.703
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -347.91023 38.80814 -8.965 4.07e-10 ***
Square.Feet 0.18435 0.01159 15.912 < 2e-16 ***
Price.SqFt 1696.37258 85.25337 19.898 < 2e-16 ***
Bedrooms 25.15235 10.72116 2.346 0.0255 *
Bathrooms -22.21939 10.35955 -2.145 0.0399 *
---
NA
Residual standard error: 31.33 on 31 degrees of freedom
Multiple R-squared: 0.9604, Adjusted R-squared: 0.9553
F-statistic: 188 on 4 and 31 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(seamod)
Based on the EDA we performed, do you have any concerns regarding the model we built in the previous part? Why or why not?
The R-squared and adjusted R-squared values look very good. However, I’m concerned about 3 things: a. The Square Ft and Price Per Sq Ft variables may be collinear; b. Price is not normally distributed and outliers may be skewing the results; c. The scatter plot of price vs. Price per Sq Ft doesn’t appear to be a strong linear relationship.
Consider adding the categorical variable (and remove price/sq. foot, as well as any other variables you believe should be removed based on the previous questions). How many coefficients will be added to your model to incorporate Realtor?
By including the realtor categorical variable we add a single attribute to the data set. The base level is realtor A, so if the observation is realtor A then the realtorA_IND is 1, otherwise it is 0.
# Bring numerical data in to R
seahomes2 <- read.csv("realestatedata-2.csv", header = T, sep = ",")
head(seahomes2)
Next we create a new model with the revised dataset:
#Linear regression model
seamod2 <- lm(Price ~ .,data = seahomes2)
summary(seamod2)
Call:
lm(formula = Price ~ ., data = seahomes2)
Residuals:
Min 1Q Median 3Q Max
-168.303 -54.720 -8.842 59.767 158.621
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 386.76580 65.34965 5.918 1.55e-06 ***
Square.Feet 0.16186 0.03178 5.093 1.65e-05 ***
RealtorA_IND -196.84035 36.98065 -5.323 8.51e-06 ***
Bedrooms -26.09857 28.09336 -0.929 0.360
Bathrooms -27.28321 28.63452 -0.953 0.348
---
NA
Residual standard error: 84.03 on 31 degrees of freedom
Multiple R-squared: 0.7152, Adjusted R-squared: 0.6784
F-statistic: 19.46 on 4 and 31 DF, p-value: 4.251e-08
par(mfrow=c(2,2))
plot(seamod2)
Consider a problem of predicting subsequent sales for movies based on Box Office gross sales and other features of the movie. The data file Movies.xls contains information for over 200 movies released during 1998 and 2001, regarding the following variables:
# Bring numerical data in to R
movies <- read.csv("movies.csv", header = T, sep = ",")
head(movies)
#Linear regression model
movielm <- lm(Sales ~ Rating,data = movies)
summary(movielm)
Call:
lm(formula = Sales ~ Rating, data = movies)
Residuals:
Min 1Q Median 3Q Max
-1.3482 -0.8533 -0.2884 0.6342 4.5717
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6186 0.2406 2.571 0.01080 *
RatingPG 0.5298 0.2935 1.805 0.07245 .
RatingR 0.7997 0.2559 3.125 0.00202 **
---
NA
Residual standard error: 1.103 on 221 degrees of freedom
Multiple R-squared: 0.04602, Adjusted R-squared: 0.03739
F-statistic: 5.33 on 2 and 221 DF, p-value: 0.005484
par(mfrow=c(2,2))
plot(movielm)
pairs(movies)
We can check for auto-correlation using a Durbin-Watson test. Here it shows a score below two which suggests some positive auto-correlation may exist:
durbinWatsonTest(movielm)
lag Autocorrelation D-W Statistic p-value
1 -0.000130451 1.992605 0.924
Alternative hypothesis: rho != 0
levels(movies$Genre)
[1] "Action/Adventure" "Animation" "Comedy" "Drama" "Family" "Horror/Thriller" "Mystery" "Romance"
[9] "Sci-Fi/Fantasy"
Therefore, we would need to establish a base level, and create 8 dummy variables for each of the non-base levels.
9. (i) How many error degrees of freedom would you have for a model that includes Rating and Genre as explanatory variables?
With 2 dummy variables for Rating, and 8 dummy variables for Genre we have 10 total x variables. For DF of (n -k -1), where n = 224, & k=10, our DF = 213.