README: Using the multi-linear regression demo from class complete an analysis of this dataset trying to predict the median_house_value
#0.[-5pts]Load Libraries rename the file to include your name (lose 5 pts if you don’t do it)
library(ggplot2)
Warning message:
In normalizePath(quartoSrcFile, winslash = "/") :
path[1]="": No such file or directory
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
hp <- read.csv("https://raw.githubusercontent.com/jacopomazzoni/DIDA/main/week12/cali_housing.csv")
summary(hp)
longitude latitude housing_median_age total_rooms total_bedrooms population
Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2 Min. : 1.0 Min. : 3
1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448 1st Qu.: 296.0 1st Qu.: 787
Median :-118.5 Median :34.26 Median :29.00 Median : 2127 Median : 435.0 Median : 1166
Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636 Mean : 537.9 Mean : 1425
3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148 3rd Qu.: 647.0 3rd Qu.: 1725
Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320 Max. :6445.0 Max. :35682
NA's :207
households median_income median_house_value ocean_proximity
Min. : 1.0 Min. : 0.4999 Min. : 14999 Length:20640
1st Qu.: 280.0 1st Qu.: 2.5634 1st Qu.:119600 Class :character
Median : 409.0 Median : 3.5348 Median :179700 Mode :character
Mean : 499.5 Mean : 3.8707 Mean :206856
3rd Qu.: 605.0 3rd Qu.: 4.7432 3rd Qu.:264725
Max. :6082.0 Max. :15.0001 Max. :500001
hp <- na.omit(hp)
ggpairs(hp, columns = 3:9, progress = FALSE, lower=list(combo=wrap("facethist", binwidth=0.8)) )
#1.[10pts]Split it into a testing and training set, as before.
set.seed(210191)
#2.[5pts] Generate the model:
model <-
Error: Incomplete expression: model <-
#3.[5pts] Regularize the model:
#4.[10pts] Model Evaluation Calculate the predictions and residuals:
#5.[10pts] Pot Trends in Errors Plot the yhat and residual columns use ylim(-450000,450000)
#6.[5pts] Typical Error Size calculate the standard deviation for the residuals for this model
#7.[5pts] Check for Overfitting
#8.[5pts] Are we overfitting? yes or no, why?