Complete all Exercises, and submit answers to Questions on the Coursera platform.
This initial quiz will concern exploratory data analysis (EDA) of the Ames Housing dataset. EDA is essential when working with any source of data and helps inform modeling.
First, let us load the data:
load("ames_train.Rdata")
library(tidyverse)
library(grid)
library(gridExtra)
Misc.Feature
, Fence
, Pool.QC
Misc.Feature
, Alley
, Pool.QC
Pool.QC
, Alley
, Fence
Fireplace.Qu
, Pool.QC
, Lot.Frontage
na_count <- colSums(is.na(ames_train))
head(sort(na_count, decreasing = TRUE), 3)
## Pool.QC Misc.Feature Alley
## 997 971 933
int
? Change them to factors when conducting your analysis.
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 81 variables:
## $ PID : int 909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
## $ area : int 856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ MS.SubClass : int 30 120 30 70 60 85 20 20 20 180 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
## $ Lot.Frontage : int NA 42 60 80 70 64 60 53 74 35 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
## $ Overall.Qual : int 6 5 5 4 8 7 4 7 5 6 ...
## $ Overall.Cond : int 6 5 9 8 6 5 4 5 6 5 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add : int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
## $ Mas.Vnr.Area : int 0 149 0 0 0 500 0 20 0 76 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
## $ BsmtFin.SF.1 : int 238 552 737 0 643 0 0 0 647 467 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 393 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 618 104 100 405 167 0 936 1146 217 80 ...
## $ Total.Bsmt.SF : int 856 1049 837 405 810 0 936 1146 864 547 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 856 1049 1001 717 810 495 936 1246 889 1072 ...
## $ X2nd.Flr.SF : int 0 0 0 322 855 1427 0 0 0 0 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : int 1 1 0 0 1 0 0 0 0 1 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 2 1 1 2 3 1 2 1 1 ...
## $ Half.Bath : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
## $ TotRms.AbvGrd : int 4 5 5 6 6 7 4 5 6 5 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
## $ Fireplaces : int 1 0 0 0 0 1 0 1 0 0 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
## $ Garage.Yr.Blt : int 1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
## $ Garage.Cars : int 2 1 1 1 2 2 2 2 2 2 ...
## $ Garage.Area : int 399 266 216 281 528 672 576 428 484 525 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 0 0 154 0 0 0 0 100 0 0 ...
## $ Open.Porch.SF : int 0 105 0 0 45 0 32 24 0 44 ...
## $ Enclosed.Porch : int 0 0 42 168 0 177 112 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 86 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 166 0 0 111 0 0 0 0 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 3 2 11 5 11 7 2 3 4 5 ...
## $ Yr.Sold : int 2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
ames_train$Overall.Qual <- factor(ames_train$Overall.Qual,ordered = TRUE)
ames_train$Overall.Cond <- factor(ames_train$Overall.Cond,ordered = TRUE)
StoneBr
Timber
Veenker
NridgHt
ames_train %>%
group_by(Neighborhood) %>%
summarise(sdev=sd(price)) %>%
arrange(desc(sdev))
## # A tibble: 27 x 2
## Neighborhood sdev
## <fct> <dbl>
## 1 StoneBr 123459.
## 2 NridgHt 105089.
## 3 Timber 84030.
## 4 Veenker 72545.
## 5 Crawfor 71268.
## 6 GrnHill 70711.
## 7 Somerst 65199.
## 8 Edwards 54852.
## 9 CollgCr 52786.
## 10 SawyerW 48354.
## # ... with 17 more rows
price
?
Lot.Area
Bedroom.AbvGr
Overall.Qual
Year.Built
# Lot.Area
p1 <- ggplot(ames_train, aes(x = Lot.Area, y = price)) +
geom_point() +
stat_smooth(method = 'lm')
# Bedroom.AbvGr
p2 <- ggplot(ames_train, aes(x = Bedroom.AbvGr, y = price)) +
geom_jitter() +
stat_smooth(method = 'lm')
# Overall.Qual
p3 <- ggplot(ames_train, aes(x = Overall.Qual, y = price)) +
geom_jitter()+
stat_smooth(method = 'lm')
# Year.Built
p4 <- ggplot(ames_train, aes(x = Year.Built, y = price)) +
geom_point()+
stat_smooth(method = 'lm')
grid.arrange(p1, p2, p3, p4, ncol = 2)
price
and area
. Which of the following variable transformations makes the relationship appear to be the most linear?
price
or area
price
but not area
area
but not price
price
and area
# No log transform
p51 <- ggplot(ames_train, aes(x = area, y = price)) +
geom_point() +
stat_smooth(method = 'lm')
#Log area transform
p52 <- ggplot(ames_train, aes(x = log(area), y = price)) +
geom_point() +
stat_smooth(method = 'lm')
#Log price transform
p53 <- ggplot(ames_train, aes(x = area, y = log(price))) +
geom_point() +
stat_smooth(method = 'lm')
# Log transform both
p54 <- ggplot(ames_train, aes(x = log(area), y = log(price))) +
geom_point() +
stat_smooth(method = 'lm')
grid.arrange(p51, p52, p53, p54, ncol = 2)
x <- ames_train %>%
filter(Garage.Cars >= 1) %>%
summarize(Sum = n())
n <- ames_train %>%
filter(!is.na(Garage.Cars)) %>%
summarize(Sum = n())
print(paste("Beta(",9+x, "," ,1+n-x,")"))
## [1] "Beta( 962 , 47 )"
#1. Over 30 percent of houses were built after the year 1999?
q7_1 <- ames_train %>%
filter(Year.Built > 1999) %>%
summarize(Sum = n())
q7_11 <- ames_train %>%
filter(!is.na(Year.Built)) %>%
summarize(Sum = n())
print(paste("% of houses were built after the year 1999 is", q7_1/q7_11*100))
## [1] "% of houses were built after the year 1999 is 27.2"
#2. The median housing price is greater than the mean housing price?
mean <- mean(ames_train$price)
median <- median(ames_train$price)
c(median, mean)
## [1] 159467.0 181190.1
#3. 21 houses do not have a basement?
q73 <- ames_train %>%
filter(Total.Bsmt.SF==0) %>%
summarize(Sum = n())
q73
## # A tibble: 1 x 1
## Sum
## <int>
## 1 21
#4. 4 houses are located on gravel streets?
q74 <- ames_train %>%
group_by(Street) %>%
summarize(Sum = n())
q74
## # A tibble: 2 x 2
## Street Sum
## <fct> <int>
## 1 Grvl 3
## 2 Pave 997
ames_train$Has.Garage = ifelse(ames_train$Garage.Area > 0, 1 ,0)
ames_train$Has.Garage <- as.factor(ames_train$Has.Garage)
t.test(area ~ Has.Garage, data = ames_train)
##
## Welch Two Sample t-test
##
## data: area by Has.Garage
## t = -5.134, df = 50.702, p-value = 4.535e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -482.9963 -211.4183
## sample estimates:
## mean in group 0 mean in group 1
## 1145.043 1492.251
lambda <- 3
sigma_sq <- 1
b <- lambda/sigma_sq
a <- lambda * b
sum_x <- ames_train %>% filter(area>2000) %>% summarise(sum_x=sum(Bedroom.AbvGr))
n <- ames_train %>% filter(area>2000) %>% summarise(n=n())
a_star <- a + sum_x
b_star <- b + n
lambda_star <- a_star / b_star
sigma_star <- sqrt(a_star / b_star^2)
lambda_star
## sum_x
## 1 3.617021
sigma_star
## sum_x
## 1 0.1601644
price
) on \(\log\)(area
), there are some outliers. Which of the following do the three most outlying points have in common?
fit <- lm(log(price) ~ log(area), data = ames_train)
par(mfrow = c(2,2))
plot(fit)
# Using MASS package to extract standardize residuals from a linear model
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
par(mfrow = c(1,1))
ames_train$stdres <- stdres(fit)
ames_train$stdres_abs <- abs(stdres(fit))
ames_train$stdres_abs_q10 <- abs(stdres(fit)) > 3
ames_train[which(ames_train$stdres_abs_q10 == TRUE), c('Bedroom.AbvGr','Overall.Qual','Year.Built','Sale.Condition', 'stdres_abs')] %>% arrange(desc(stdres_abs)) %>% head(n = 3)
## # A tibble: 3 x 5
## Bedroom.AbvGr Overall.Qual Year.Built Sale.Condition stdres_abs
## <int> <ord> <int> <fct> <dbl>
## 1 2 2 1923 Abnorml 7.37
## 2 3 4 1920 Normal 4.83
## 3 3 4 1910 Abnorml 4.43
# **OR** We can use `broom` package to have the same result
library(broom)
fit_aug <- augment(fit)
ames_train$stdres <- fit_aug$.std.resid
ames_train$stdres_abs <- abs(fit_aug$.std.resid)
ames_train$stdres_gt2 <- abs(fit_aug$.std.resid) > 3
ames_train[which(ames_train$stdres_gt2==TRUE),
c('Bedroom.AbvGr','Overall.Qual','Year.Built','Sale.Condition',
'stdres_abs')] %>% arrange(desc(stdres_abs)) %>% head(n=3)
## # A tibble: 3 x 5
## Bedroom.AbvGr Overall.Qual Year.Built Sale.Condition stdres_abs
## <int> <ord> <int> <fct> <dbl>
## 1 2 2 1923 Abnorml 7.37
## 2 3 4 1920 Normal 4.83
## 3 3 4 1910 Abnorml 4.43
price
if used as a dependent variable in a linear regression?
price
is right-skewed.
price
cannot take on negative values.
price
can only take on integer values.ggplot(ames_train, aes(x= price)) +
geom_histogram(bins = 30)
Bldg.Type
= 1Fam
)
ames_train %>% group_by(Neighborhood) %>%
summarise(mean.Bldg.Type = mean(Bldg.Type == "1Fam")) %>%
filter(mean.Bldg.Type==1)
## # A tibble: 3 x 2
## Neighborhood mean.Bldg.Type
## <fct> <dbl>
## 1 ClearCr 1
## 2 NoRidge 1
## 3 Timber 1
area
) and the number of bedrooms above ground (Bedroom.AbvGr
)?
ggplot(ames_train, aes(x = Bedroom.AbvGr, y = log(area))) +
geom_jitter()
ames_train %>%
filter(!is.na(Bsmt.Unf.SF),Bsmt.Unf.SF > 0) %>%
summarize(mean = mean(Bsmt.Unf.SF))
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 595.