This is my first document
library(datasets)
?datasets
## starting httpd help server ... done
help(datasets)
#library(help = "datasets")
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
round(var(iris[,-5]))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1 0 1 1
## Sepal.Width 0 0 0 0
## Petal.Length 1 0 3 1
## Petal.Width 1 0 1 1
round(cor(iris[,-5]))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1 0 1 1
## Sepal.Width 0 1 0 0
## Petal.Length 1 0 1 1
## Petal.Width 1 0 1 1
The Covariance Matrix
The Code: round(var(Advertising[,-1]), 2) computes the covariance matrix for the remaining variables and rounds the output to 2 decimal places. In R, using the var() function on a data frame calculates the covariance between all columns.
The Concept: Covariance measures the direction of the linear relationship between two variables. A positive number means they tend to increase together; a negative number means one increases while the other decreases.
The Correlation Matrix
The Code: round(cor(Advertising[,-1]), 3) computes the correlation matrix and rounds it to 3 decimal places. The Concept: Correlation is a standardized version of covariance. It tells you both the direction and the strength of the linear relationship
italics bold
load("arabba_rains.RDATA")
str(arabba)
## List of 6
## $ name : chr "VE_0080"
## $ maxima: num [1:36(1d)] 12.4 18 15.6 12.6 12.6 14.4 16.8 12.4 13.2 12.6 ...
## ..- attr(*, "dimnames")=List of 1
## .. ..$ : chr [1:36] "1985" "1986" "1987" "1988" ...
## $ u : num 8.2
## $ pot :'data.frame': 347 obs. of 2 variables:
## ..$ value: num [1:347] 9.4 10.4 12.4 9.4 8.4 11.4 13.8 15.6 10.4 18 ...
## ..$ date : POSIXct[1:347], format: "1985-04-03 15:00:00" "1985-08-02 00:00:00" ...
## $ coord : int [1:2] 716615 5151711
## $ elev : int 2154
rain_data = arabba$pot
rain_data
## value date
## 1 9.4 1985-04-03 15:00:00
## 2 10.4 1985-08-02 00:00:00
## 3 12.4 1985-08-06 12:00:00
## 4 9.4 1985-08-22 19:00:00
## 5 8.4 1986-05-20 03:00:00
## 6 11.4 1986-05-21 19:00:00
## 7 13.8 1986-05-29 14:00:00
## 8 15.6 1986-07-31 12:00:00
## 9 10.4 1986-07-31 13:00:00
## 10 18.0 1986-08-07 20:00:00
## 11 15.4 1986-08-14 01:00:00
## 12 8.6 1986-08-18 22:00:00
## 13 9.8 1986-08-23 14:00:00
## 14 10.2 1987-07-08 17:00:00
## 15 10.6 1987-07-15 19:00:00
## 16 8.8 1987-08-24 21:00:00
## 17 15.6 1987-08-24 22:00:00
## 18 11.4 1987-08-24 23:00:00
## 19 10.6 1987-08-25 02:00:00
## 20 11.4 1987-08-25 03:00:00
## 21 10.2 1987-08-25 04:00:00
## 22 11.8 1987-08-25 08:00:00
## 23 12.2 1987-11-25 18:00:00
## 24 8.8 1988-06-14 15:00:00
## 25 8.6 1988-06-24 18:00:00
## 26 9.2 1988-07-11 20:00:00
## 27 11.6 1988-07-11 21:00:00
## 28 9.4 1988-07-12 16:00:00
## 29 9.2 1988-07-15 03:00:00
## 30 12.6 1988-07-15 04:00:00
## 31 10.2 1988-07-15 05:00:00
## 32 8.4 1988-07-22 13:00:00
## 33 10.6 1988-07-29 20:00:00
## 34 8.2 1988-08-16 13:00:00
## 35 10.0 1988-08-29 22:00:00
## 36 9.4 1988-10-12 14:00:00
## 37 9.6 1988-10-12 21:00:00
## 38 10.2 1989-06-26 13:00:00
## 39 8.8 1989-06-28 12:00:00
## 40 9.8 1989-07-03 14:00:00
## 41 10.4 1989-07-22 19:00:00
## 42 11.8 1989-08-08 18:00:00
## 43 8.6 1989-08-08 19:00:00
## 44 12.6 1989-08-24 21:00:00
## 45 9.2 1989-08-26 22:00:00
## 46 8.6 1989-09-11 14:00:00
## 47 11.6 1990-06-08 14:00:00
## 48 14.4 1990-06-28 17:00:00
## 49 8.4 1990-07-01 08:00:00
## 50 8.8 1990-07-02 10:00:00
## 51 11.0 1990-07-10 14:00:00
## 52 9.2 1990-07-24 18:00:00
## 53 12.8 1990-08-15 00:00:00
## 54 10.4 1990-09-24 09:00:00
## 55 9.8 1991-06-17 16:00:00
## 56 8.8 1991-06-24 01:00:00
## 57 10.6 1991-07-04 16:00:00
## 58 9.2 1991-07-14 11:00:00
## 59 10.6 1991-07-14 12:00:00
## 60 11.4 1991-07-17 13:00:00
## 61 8.4 1991-07-21 20:00:00
## 62 16.8 1991-08-01 04:00:00
## 63 8.2 1991-08-24 20:00:00
## 64 12.4 1992-08-01 18:00:00
## 65 9.0 1992-08-04 15:00:00
## 66 8.4 1992-08-30 00:00:00
## 67 10.8 1993-06-20 16:00:00
## 68 13.2 1993-07-11 05:00:00
## 69 12.2 1993-07-16 20:00:00
## 70 12.0 1993-07-19 12:00:00
## 71 8.4 1993-07-19 20:00:00
## 72 12.0 1993-07-20 20:00:00
## 73 12.4 1993-08-04 12:00:00
## 74 10.4 1993-08-07 23:00:00
## 75 12.6 1994-06-24 16:00:00
## 76 11.2 1994-07-04 17:00:00
## 77 10.8 1994-08-08 13:00:00
## 78 10.0 1994-08-18 01:00:00
## 79 10.6 1994-08-18 02:00:00
## 80 11.8 1994-08-18 03:00:00
## 81 8.2 1994-08-31 19:00:00
## 82 12.4 1994-09-08 18:00:00
## 83 8.8 1994-09-13 19:00:00
## 84 9.2 1994-09-14 11:00:00
## 85 8.4 1994-09-14 13:00:00
## 86 9.0 1995-05-30 10:00:00
## 87 8.6 1995-06-01 17:00:00
## 88 8.8 1995-06-21 18:00:00
## 89 20.6 1995-06-21 20:00:00
## 90 11.8 1995-07-11 16:00:00
## 91 15.6 1995-07-14 16:00:00
## 92 8.8 1995-07-23 16:00:00
## 93 20.8 1995-07-23 17:00:00
## 94 10.6 1995-07-28 20:00:00
## 95 10.2 1996-05-19 17:00:00
## 96 9.2 1996-07-08 00:00:00
## 97 8.2 1996-07-08 06:00:00
## 98 8.2 1996-08-03 03:00:00
## 99 8.6 1996-08-04 20:00:00
## 100 10.0 1997-06-12 20:00:00
## 101 17.6 1997-07-14 16:00:00
## 102 8.4 1997-07-14 17:00:00
## 103 16.4 1997-07-24 13:00:00
## 104 8.6 1997-07-29 00:00:00
## 105 8.2 1997-09-05 20:00:00
## 106 12.6 1998-05-13 22:00:00
## 107 9.0 1998-07-07 16:00:00
## 108 9.2 1998-07-14 00:00:00
## 109 10.8 1998-07-14 01:00:00
## 110 8.6 1998-07-20 18:00:00
## 111 9.0 1998-07-25 18:00:00
## 112 9.0 1998-08-15 15:00:00
## 113 9.6 1998-09-05 13:00:00
## 114 8.2 1998-09-05 14:00:00
## 115 12.0 1999-05-05 00:00:00
## 116 10.4 1999-06-13 19:00:00
## 117 10.0 1999-06-30 00:00:00
## 118 9.0 1999-07-06 18:00:00
## 119 16.2 1999-08-05 18:00:00
## 120 10.0 1999-08-16 17:00:00
## 121 10.8 1999-08-20 04:00:00
## 122 9.2 1999-09-20 18:00:00
## 123 10.8 1999-09-20 19:00:00
## 124 11.4 1999-09-20 20:00:00
## 125 10.6 2000-06-14 15:00:00
## 126 11.0 2000-09-20 23:00:00
## 127 9.6 2000-11-14 16:00:00
## 128 9.0 2001-06-10 18:00:00
## 129 8.4 2001-06-16 21:00:00
## 130 14.0 2001-08-17 16:00:00
## 131 9.2 2001-08-31 19:00:00
## 132 8.8 2002-08-20 16:00:00
## 133 12.2 2002-08-23 19:00:00
## 134 58.0 2003-06-13 18:00:00
## 135 14.8 2003-06-27 23:00:00
## 136 10.0 2003-07-01 21:00:00
## 137 15.2 2003-07-17 12:00:00
## 138 13.4 2003-07-17 18:00:00
## 139 13.6 2003-07-23 02:00:00
## 140 10.0 2003-07-27 23:00:00
## 141 8.4 2003-07-28 17:00:00
## 142 18.4 2003-08-05 20:00:00
## 143 30.2 2003-08-14 15:00:00
## 144 10.2 2003-08-14 16:00:00
## 145 8.8 2003-08-20 23:00:00
## 146 10.2 2004-06-12 19:00:00
## 147 10.0 2004-07-01 17:00:00
## 148 9.0 2004-07-05 16:00:00
## 149 14.2 2004-07-06 19:00:00
## 150 16.0 2004-07-09 02:00:00
## 151 11.4 2004-07-11 08:00:00
## 152 9.2 2004-07-16 15:00:00
## 153 8.4 2004-08-07 16:00:00
## 154 11.2 2004-08-07 17:00:00
## 155 11.0 2004-08-13 22:00:00
## 156 13.6 2004-08-26 05:00:00
## 157 10.8 2004-09-14 20:00:00
## 158 15.2 2005-06-24 12:00:00
## 159 22.0 2005-06-24 13:00:00
## 160 10.6 2005-06-25 16:00:00
## 161 14.8 2005-06-29 14:00:00
## 162 8.4 2005-07-01 06:00:00
## 163 11.2 2005-07-05 02:00:00
## 164 9.4 2005-07-05 03:00:00
## 165 14.6 2005-07-18 15:00:00
## 166 9.4 2005-08-14 17:00:00
## 167 11.2 2005-08-20 12:00:00
## 168 8.6 2005-08-20 13:00:00
## 169 12.2 2005-09-03 17:00:00
## 170 19.6 2005-09-03 18:00:00
## 171 9.2 2005-09-04 17:00:00
## 172 13.0 2005-09-04 18:00:00
## 173 9.0 2006-07-27 22:00:00
## 174 10.8 2006-07-31 20:00:00
## 175 10.8 2006-08-01 09:00:00
## 176 8.2 2006-08-03 13:00:00
## 177 10.0 2006-08-26 23:00:00
## 178 10.4 2006-09-15 08:00:00
## 179 10.0 2006-09-15 09:00:00
## 180 10.2 2006-09-15 10:00:00
## 181 9.0 2007-05-22 16:00:00
## 182 12.6 2007-05-23 16:00:00
## 183 9.2 2007-06-26 18:00:00
## 184 11.8 2007-06-26 19:00:00
## 185 12.2 2007-08-07 15:00:00
## 186 9.2 2007-08-17 13:00:00
## 187 8.6 2007-08-20 07:00:00
## 188 12.2 2008-06-26 22:00:00
## 189 9.0 2008-06-27 17:00:00
## 190 10.8 2008-06-30 00:00:00
## 191 8.6 2008-07-17 19:00:00
## 192 8.4 2008-07-20 19:00:00
## 193 12.0 2008-07-26 13:00:00
## 194 8.2 2008-07-29 14:00:00
## 195 11.4 2008-08-02 01:00:00
## 196 18.4 2008-08-06 18:00:00
## 197 19.6 2008-08-06 19:00:00
## 198 8.6 2008-08-15 18:00:00
## 199 10.4 2008-08-15 19:00:00
## 200 8.8 2008-08-23 13:00:00
## 201 10.4 2009-05-22 15:00:00
## 202 16.8 2009-06-16 19:00:00
## 203 9.6 2009-06-19 22:00:00
## 204 11.6 2009-07-17 23:00:00
## 205 9.8 2009-07-18 00:00:00
## 206 13.6 2009-07-24 22:00:00
## 207 8.4 2009-08-03 18:00:00
## 208 13.8 2009-08-07 23:00:00
## 209 11.6 2009-08-08 13:00:00
## 210 16.6 2009-08-13 19:00:00
## 211 9.0 2010-06-17 13:00:00
## 212 12.0 2010-07-01 17:00:00
## 213 8.2 2010-07-02 19:00:00
## 214 11.8 2010-07-11 13:00:00
## 215 10.8 2010-07-11 14:00:00
## 216 10.4 2010-08-04 20:00:00
## 217 10.4 2010-08-08 19:00:00
## 218 20.0 2010-08-10 14:00:00
## 219 8.8 2010-08-13 02:00:00
## 220 8.2 2011-06-16 15:00:00
## 221 8.2 2011-06-18 18:00:00
## 222 8.6 2011-07-13 16:00:00
## 223 11.2 2011-07-13 21:00:00
## 224 11.0 2011-08-08 20:00:00
## 225 8.8 2011-08-15 11:00:00
## 226 9.4 2011-09-17 16:00:00
## 227 13.2 2012-06-04 10:00:00
## 228 9.0 2012-06-04 11:00:00
## 229 8.2 2012-06-19 14:00:00
## 230 14.4 2012-07-09 16:00:00
## 231 8.2 2012-07-09 20:00:00
## 232 9.8 2012-07-14 17:00:00
## 233 8.6 2012-07-21 00:00:00
## 234 9.4 2012-08-05 00:00:00
## 235 9.2 2012-08-06 19:00:00
## 236 9.2 2012-08-22 01:00:00
## 237 10.2 2012-08-30 14:00:00
## 238 10.8 2012-08-31 06:00:00
## 239 8.2 2012-08-31 07:00:00
## 240 8.4 2012-09-24 18:00:00
## 241 8.2 2013-05-17 11:00:00
## 242 8.4 2013-06-19 14:00:00
## 243 8.2 2013-07-13 01:00:00
## 244 19.0 2013-08-04 21:00:00
## 245 13.4 2013-08-19 17:00:00
## 246 8.2 2013-08-19 18:00:00
## 247 10.2 2013-08-24 23:00:00
## 248 14.8 2014-06-23 19:00:00
## 249 8.2 2014-07-27 14:00:00
## 250 18.2 2014-08-01 16:00:00
## 251 14.0 2014-08-04 19:00:00
## 252 12.8 2014-08-04 23:00:00
## 253 9.4 2014-08-05 00:00:00
## 254 11.0 2014-08-07 18:00:00
## 255 11.6 2014-08-13 12:00:00
## 256 14.8 2014-08-13 13:00:00
## 257 11.6 2014-08-31 18:00:00
## 258 8.4 2014-09-09 18:00:00
## 259 8.6 2014-09-19 17:00:00
## 260 9.0 2015-06-08 13:00:00
## 261 10.6 2015-06-08 14:00:00
## 262 8.4 2015-07-07 17:00:00
## 263 9.6 2015-07-21 17:00:00
## 264 13.8 2015-07-21 18:00:00
## 265 18.8 2015-07-24 13:00:00
## 266 21.2 2015-07-29 18:00:00
## 267 11.6 2015-08-04 18:00:00
## 268 9.4 2015-09-14 04:00:00
## 269 9.0 2015-09-14 05:00:00
## 270 8.4 2016-06-30 21:00:00
## 271 8.8 2016-07-12 16:00:00
## 272 10.0 2016-07-13 00:00:00
## 273 9.6 2016-07-21 18:00:00
## 274 11.8 2016-07-30 21:00:00
## 275 10.0 2016-07-31 12:00:00
## 276 11.6 2016-08-05 08:00:00
## 277 10.2 2016-08-05 09:00:00
## 278 9.4 2016-08-09 17:00:00
## 279 17.0 2016-08-18 21:00:00
## 280 12.8 2016-08-29 16:00:00
## 281 10.6 2016-09-17 02:00:00
## 282 9.2 2016-09-17 03:00:00
## 283 8.6 2016-09-17 04:00:00
## 284 9.6 2016-10-14 18:00:00
## 285 9.4 2016-10-14 19:00:00
## 286 13.8 2017-05-14 16:00:00
## 287 9.2 2017-05-19 17:00:00
## 288 11.0 2017-06-01 21:00:00
## 289 11.6 2017-06-02 13:00:00
## 290 10.6 2017-06-15 19:00:00
## 291 8.8 2017-06-15 20:00:00
## 292 9.2 2017-07-17 23:00:00
## 293 11.2 2017-07-24 13:00:00
## 294 9.2 2017-07-24 14:00:00
## 295 21.0 2017-08-06 12:00:00
## 296 10.8 2017-08-06 13:00:00
## 297 12.0 2017-08-09 20:00:00
## 298 11.4 2017-08-09 21:00:00
## 299 10.0 2018-05-06 15:00:00
## 300 9.2 2018-06-21 15:00:00
## 301 10.6 2018-07-10 17:00:00
## 302 10.8 2018-07-11 19:00:00
## 303 9.8 2018-07-14 20:00:00
## 304 8.4 2018-07-15 00:00:00
## 305 10.8 2018-08-01 19:00:00
## 306 12.4 2018-08-20 20:00:00
## 307 8.6 2018-08-20 21:00:00
## 308 10.2 2018-08-21 22:00:00
## 309 8.4 2018-08-23 15:00:00
## 310 11.6 2018-09-14 19:00:00
## 311 13.0 2018-10-29 15:00:00
## 312 8.6 2018-10-29 16:00:00
## 313 18.8 2018-10-29 18:00:00
## 314 8.8 2019-07-06 18:00:00
## 315 9.2 2019-07-17 18:00:00
## 316 16.4 2019-07-17 19:00:00
## 317 17.4 2019-07-26 14:00:00
## 318 8.8 2019-07-26 15:00:00
## 319 11.4 2019-07-30 19:00:00
## 320 8.4 2019-08-06 16:00:00
## 321 9.0 2019-08-15 20:00:00
## 322 9.0 2019-08-24 15:00:00
## 323 11.0 2019-10-15 20:00:00
## 324 9.0 2019-10-15 21:00:00
## 325 10.6 2020-05-23 19:00:00
## 326 11.0 2020-06-29 11:00:00
## 327 26.0 2020-07-01 12:00:00
## 328 11.4 2020-07-01 13:00:00
## 329 14.4 2020-07-11 11:00:00
## 330 9.2 2020-07-23 21:00:00
## 331 9.6 2020-07-23 22:00:00
## 332 10.6 2020-07-28 19:00:00
## 333 20.6 2020-07-29 20:00:00
## 334 11.2 2020-07-29 22:00:00
## 335 20.6 2020-07-30 21:00:00
## 336 19.6 2020-07-30 22:00:00
## 337 17.4 2020-08-01 19:00:00
## 338 15.2 2020-08-01 20:00:00
## 339 9.2 2020-08-09 23:00:00
## 340 9.0 2020-08-10 17:00:00
## 341 10.6 2020-08-13 20:00:00
## 342 16.2 2020-08-22 16:00:00
## 343 11.2 2020-08-28 22:00:00
## 344 8.2 2020-08-29 15:00:00
## 345 8.6 2020-08-29 18:00:00
## 346 11.6 2020-08-30 06:00:00
## 347 8.8 2020-08-30 07:00:00
it helps researchers to analyze how big the events occurs, how frequent the events are and did they cluster in certain time of the year.
hist(rain_data$value,freq = FALSE, main = "Histogram of arabba rain",xlab = "value" )
lines(density(rain_data$value), lwd = 2)
hist(rain_data$value, freq = FALSE, main = "histogram" , xlab = "value")
lines(density(rain_data$value), lwd = 2)
rain_data[1:5, ]
## value date
## 1 9.4 1985-04-03 15:00:00
## 2 10.4 1985-08-02 00:00:00
## 3 12.4 1985-08-06 12:00:00
## 4 9.4 1985-08-22 19:00:00
## 5 8.4 1986-05-20 03:00:00
##Simple Linear Regression model
advertising <- read.csv("advertising.csv")
str(advertising)
## 'data.frame': 200 obs. of 5 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ TV : num 230.1 44.5 17.2 151.5 180.8 ...
## $ radio : num 37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
## $ newspaper: num 69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
## $ sales : num 22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...
advertising_df <- advertising[,-1]
plot(advertising_df$newspaper,advertising_df$sales,pch = 20,xlab = "NEWPAPER BUDGET", ylab = "sales", col = "green", )
plot(advertising_df$radio,advertising_df$sales,pch = 20,xlab = "radio BUDGET", ylab = "sales", col = "red", )
plot(advertising_df$TV,advertising_df$sales, pch = 20, xlab = "TV budget" , ylab = "sales" , col= "blue" )
n <- length(advertising_df$sales)
print(paste("Sample size:", n))
## [1] "Sample size: 200"
cov(advertising_df$TV,advertising_df$sales)
## [1] 350.3902
cov.mat <- cov(advertising_df)
cov.mat
## TV radio newspaper sales
## TV 7370.94989 69.86249 105.91945 350.39019
## radio 69.86249 220.42774 114.49698 44.63569
## newspaper 105.91945 114.49698 474.30833 25.94139
## sales 350.39019 44.63569 25.94139 27.22185
cov2cor(cov.mat)
## TV radio newspaper sales
## TV 1.00000000 0.05480866 0.05664787 0.7822244
## radio 0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales 0.78222442 0.57622257 0.22829903 1.0000000
#par(mfrow(c(1,2))
library("corrplot")
## corrplot 0.95 loaded
?corrplot
corrplot(cor(advertising_df), method = 'ellipse', main = "\nCorrelation Ellipses")
corrplot(cor(advertising_df), method = 'number', main = "\nCorrelation Cofficients")
pairs(advertising_df, col = "green", pch = 20)
x <- advertising_df$TV
y <- advertising_df$sales
simple_rg <- lm(y~x)
summary(simple_rg)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## x 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
# Quick plotting of the model
plot(x, y, pch = 16, col = "blue", main = "Regression via lm()")
abline(simple_rg, col = "red", lwd = 2)
bete_hat <- coefficients(simple_rg)
y_fitted <- fitted.values(simple_rg)
e_simpleReg <- residuals(simple_rg)
sum(e_simpleReg)
## [1] -1.301736e-14
# Checking OLS properties
cat("Sum of residuals:", sum(e_simpleReg),
"\nSum of residuals * x:", sum(e_simpleReg * x),
"\nSum of residuals * fitted values:", sum(e_simpleReg * y_fitted))
## Sum of residuals: -1.301736e-14
## Sum of residuals * x: -5.354828e-12
## Sum of residuals * fitted values: -5.531131e-13
##Assignment 1 Simple Linear Regression
The Boston dataset The ISLR2 library contains the Boston data set, which records medv (median house value) for 506 census tracts in Boston. In this assignment we analyse the relationship between medv and lstat (percent of households with low socioeconomic status). To find out more about the data set, we can type ?Boston.
library("ISLR2")
data(Boston)
plot(Boston$lstat,Boston$medv, pch = 20, xlab = "lower status population", ylab = "house", col = "steelblue")
summary(Boston[,c("medv", "lstat")])
## medv lstat
## Min. : 5.00 Min. : 1.73
## 1st Qu.:17.02 1st Qu.: 6.95
## Median :21.20 Median :11.36
## Mean :22.53 Mean :12.65
## 3rd Qu.:25.00 3rd Qu.:16.95
## Max. :50.00 Max. :37.97
cov_mat <- cov(Boston)
cov_mat
## crim zn indus chas nox
## crim 73.9865782 -40.2159560 23.9923388 -0.122108643 0.419593894
## zn -40.2159560 543.9368137 -85.4126481 -0.252925293 -1.396148200
## indus 23.9923388 -85.4126481 47.0644425 0.109668806 0.607073693
## chas -0.1221086 -0.2529253 0.1096688 0.064512973 0.002684303
## nox 0.4195939 -1.3961482 0.6070737 0.002684303 0.013427636
## rm -1.3250378 5.1125134 -1.8879566 0.016284745 -0.024603450
## age 85.4053223 -373.9015482 124.5139031 0.618571205 2.385927202
## dis -6.8767215 32.6293041 -10.2280975 -0.053042959 -0.187695836
## rad 46.8477610 -63.3486949 35.5499714 -0.016295543 0.616929453
## tax 844.8215381 -1236.4537354 833.3602902 -1.523367119 13.046285530
## ptratio 5.3993308 -19.7765707 5.6921040 -0.066819160 0.047397324
## lstat 27.9861679 -68.7830369 29.5802703 -0.097816264 0.488946168
## medv -30.7185080 77.3151755 -30.5208228 0.409409463 -0.455412432
## rm age dis rad tax
## crim -1.32503785 85.4053223 -6.87672154 46.84776101 844.821538
## zn 5.11251341 -373.9015482 32.62930405 -63.34869487 -1236.453735
## indus -1.88795657 124.5139031 -10.22809746 35.54997135 833.360290
## chas 0.01628475 0.6185712 -0.05304296 -0.01629554 -1.523367
## nox -0.02460345 2.3859272 -0.18769584 0.61692945 13.046286
## rm 0.49367085 -4.7519292 0.30366342 -1.28381457 -34.583448
## age -4.75192919 792.3583985 -44.32937946 111.77084648 2402.690122
## dis 0.30366342 -44.3293795 4.43401514 -9.06825201 -189.664592
## rad -1.28381457 111.7708465 -9.06825201 75.81636598 1335.756577
## tax -34.58344778 2402.6901225 -189.66459173 1335.75657653 28404.759488
## ptratio -0.54076322 15.9369213 -1.05977455 8.76071616 168.153141
## lstat -3.07974141 121.0777246 -7.47332906 30.38544241 654.714520
## medv 4.49344588 -97.5890166 4.84022864 -30.56122804 -726.255716
## ptratio lstat medv
## crim 5.39933079 27.98616788 -30.7185080
## zn -19.77657066 -68.78303690 77.3151755
## indus 5.69210400 29.58027028 -30.5208228
## chas -0.06681916 -0.09781626 0.4094095
## nox 0.04739732 0.48894617 -0.4554124
## rm -0.54076322 -3.07974141 4.4934459
## age 15.93692134 121.07772456 -97.5890166
## dis -1.05977455 -7.47332906 4.8402286
## rad 8.76071616 30.38544241 -30.5612280
## tax 168.15314053 654.71451963 -726.2557164
## ptratio 4.68698912 5.78272856 -10.1106571
## lstat 5.78272856 50.99475951 -48.4475383
## medv -10.11065714 -48.44753832 84.5867236
cov2cor(cov_mat)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## lstat medv
## crim 0.4556215 -0.3883046
## zn -0.4129946 0.3604453
## indus 0.6037997 -0.4837252
## chas -0.0539293 0.1752602
## nox 0.5908789 -0.4273208
## rm -0.6138083 0.6953599
## age 0.6023385 -0.3769546
## dis -0.4969958 0.2499287
## rad 0.4886763 -0.3816262
## tax 0.5439934 -0.4685359
## ptratio 0.3740443 -0.5077867
## lstat 1.0000000 -0.7376627
## medv -0.7376627 1.0000000
cor(Boston[, c("medv", "lstat")])
## medv lstat
## medv 1.0000000 -0.7376627
## lstat -0.7376627 1.0000000
Boston_regression <- lm(Boston$medv~Boston$lstat)
Boston_regression
##
## Call:
## lm(formula = Boston$medv ~ Boston$lstat)
##
## Coefficients:
## (Intercept) Boston$lstat
## 34.55 -0.95
summary(Boston_regression)
##
## Call:
## lm(formula = Boston$medv ~ Boston$lstat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## Boston$lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
?Boston
beta_bo <- coefficients(Boston_regression)
y_fit <- fitted.values(Boston_regression)
e_boston <- residuals(Boston_regression)
plot(Boston$lstat, Boston$medv,
pch = 20,
xlab = "Lower Status Population (%)",
ylab = "Median Home Value ($1000s)",
col = "steelblue",
main = "Regression of Boston Dataset")
abline(Boston_regression, col = "red", lwd = 2)
loglstat <- log(Boston$lstat)
loglstat
## [1] 1.6054299 2.2126604 1.3937664 1.0784096 1.6733512 1.6505799 2.5201129
## [8] 2.9523027 3.3988613 2.8390785 3.0179829 2.5855058 2.7542975 2.1114246
## [15] 2.3282528 2.1365305 1.8840347 2.6858046 2.4587338 2.4230312 3.0454744
## [22] 2.6268401 2.9295925 2.9897142 2.7911651 2.8039663 2.6953026 2.8495498
## [29] 2.5494452 2.4832386 3.1179499 2.5680216 3.3217934 2.9096296 3.0125894
## [36] 2.2700619 2.4344902 2.1713368 2.3155013 1.4632554 0.6830968 1.5769147
## [43] 1.7595806 2.0068708 2.2565412 2.3233676 2.6497146 2.9338569 3.4278393
## [50] 2.7850112 2.5989791 2.2438961 1.6639261 2.1317968 2.6946272 1.5706971
## [57] 1.7526721 1.3737156 1.9257074 2.2213750 2.5764218 2.6700021 1.9065751
## [64] 2.2512918 2.0856721 1.5411591 2.3263016 2.0918641 2.5718486 2.1736147
## [71] 1.9050882 2.2905125 1.7083779 2.0202222 1.9139771 2.1905356 2.4824035
## [78] 2.3292270 2.5128460 2.2082744 1.6658182 1.9768550 1.9050882 2.0162355
## [85] 2.2638443 1.8764069 2.5541217 2.1329823 1.7047481 1.7404662 2.1758874
## [92] 2.1041342 2.0992442 1.8261609 2.3599102 1.8946169 2.4283363 1.4374626
## [99] 1.2725656 1.8229351 2.2428351 2.0373166 2.3636802 2.5982353 2.5120353
## [106] 2.8015405 2.9263822 2.6454653 2.5071573 2.7440606 2.5649494 2.3184584
## [113] 2.7856283 2.8384935 2.3466020 2.7574751 2.4882344 2.3321439 2.7324176
## [120] 2.6108048 2.6651427 2.6581594 2.8864753 3.2351428 2.8667619 2.6953026
## [127] 3.3054204 2.8443278 2.7337179 2.9090845 2.5336968 2.5063419 2.4087453
## [134] 2.7100482 2.8512844 2.8308576 2.8273136 2.6803364 3.0596456 2.9156062
## [141] 3.1846984 3.5383472 3.2891479 3.2741213 3.3772462 3.3250360 2.8124102
## [148] 3.3854067 3.3435683 3.0657246 2.6461748 2.5862591 2.4948570 2.7593768
## [155] 2.7160184 2.7093826 2.7813007 1.5238800 1.8609745 2.0001277 1.7047481
## [162] 0.5481214 0.6523252 1.1999648 2.4544474 2.2834023 1.3083328 2.4965058
## [169] 2.4069451 2.4265711 2.6693094 2.4874035 2.6871670 2.2016592 2.2659211
## [176] 1.6733512 2.3135250 1.8389611 1.9344158 1.6174061 2.0228712 2.2460147
## [183] 1.5727739 1.7369512 2.6376277 2.5764218 1.4929041 1.8991180 1.5173226
## [190] 1.6845454 1.6292405 1.5454326 1.0543120 1.6154200 1.4770487 1.0885620
## [197] 1.4060970 2.1529243 1.8900954 1.5173226 1.4929041 2.0055259 1.1346227
## [204] 1.3376292 1.0577903 2.3860067 2.3951643 2.8936995 2.6851227 3.1393996
## [211] 2.8489709 3.1772201 2.7744620 2.2385798 3.3860837 2.2481289 2.6034302
## [218] 2.2710944 2.8859174 2.3513753 2.2731563 3.0661907 2.2955605 2.0281482
## [225] 1.4206958 1.5325569 1.1410330 1.8500284 1.3660917 1.3244190 2.4553062
## [232] 1.6582281 0.9042182 1.3737156 2.0856721 2.3869262 2.2554935 1.5539252
## [239] 1.8500284 1.9974177 2.4318574 2.5176965 2.4176979 1.6467337 2.5257286
## [246] 2.9156062 2.2148462 2.3174737 2.2533948 1.8809906 1.7749524 1.2781522
## [253] 1.2612979 1.2641267 1.8825138 2.2246236 1.1346227 1.6331544 2.0528409
## [260] 1.9315214 2.2607209 1.9823798 1.7766458 2.4203681 2.0918641 2.3466020
## [267] 2.6939513 2.0068708 1.1505720 2.6137395 2.5649494 1.8855533 2.0451089
## [274] 1.8840347 1.2612979 1.0919233 1.8000583 1.4255151 1.9726912 1.5789787
## [281] 1.3244190 1.5238800 1.1019401 1.1505720 2.0605135 2.1077860 2.5595502
## [288] 1.9657128 2.0281482 2.2523439 1.2029723 1.2697605 1.5475625 2.1494339
## [295] 2.3418058 1.8357764 2.0001277 2.7625384 1.6034198 1.5560371 1.8033586
## [302] 2.2512918 2.1598688 1.5810384 1.9358598 2.1894164 1.8671761 2.0188950
## [309] 1.5129270 2.2995806 2.5368664 1.7884206 2.4612968 2.0668628 2.2278615
## [316] 2.4423470 2.9085391 2.7688317 2.3379522 2.5439614 1.9740810 1.9271641
## [323] 2.0412203 2.4630018 1.8115621 1.6253113 1.8164521 2.5486636 2.2995806
## [330] 1.9933388 2.2071749 2.5201129 2.0579625 1.7369512 1.9095425 2.0806908
## [337] 2.2823824 2.3570733 2.1412419 2.2762411 2.2289386 1.7029283 2.1575593
## [344] 1.9712994 1.5282279 2.3542283 2.5392370 1.8500284 1.7900914 1.7732560
## [351] 1.7884206 1.7029283 2.0528409 1.5040774 2.0856721 1.7173951 2.8678989
## [358] 2.5855058 2.4406064 2.5392370 2.0528409 2.6525375 2.3214068 2.6837575
## [365] 1.6658182 1.9629077 2.6390573 2.5900171 1.1817272 1.3164082 1.0851893
## [372] 2.2544447 2.1838016 3.5487549 3.6367964 2.5982353 3.1458749 3.0558862
## [379] 3.1650530 3.0809921 2.8454906 3.0483247 3.1612467 3.2011191 3.4219799
## [386] 3.4278393 3.3421548 3.4654234 3.4216534 3.0373539 2.8396631 2.9317269
## [393] 3.2457125 2.7193198 2.7942279 2.8402474 2.9637255 2.9917243 3.4206732
## [400] 3.4001969 3.2872819 3.0116056 3.0111134 2.9841656 3.3098128 3.1346243
## [407] 3.1501686 2.4956817 3.2733640 2.9846713 2.3135250 3.0549441 3.5371841
## [414] 2.9997243 3.6103772 3.3690185 3.2499868 3.2824138 3.0262615 3.1241255
## [421] 2.7093826 2.7536607 2.6461748 3.1480241 2.8425811 3.1941732 2.7530236
## [428] 2.6755270 3.0689827 3.1813816 2.8701691 2.9801109 2.4874035 2.7862450
## [435] 2.7193198 3.1471650 2.8931457 3.2752562 3.5269486 3.1302632 3.0960300
## [442] 2.9714396 2.8088001 2.9365129 3.1692653 3.1772201 2.8786365 2.7997174
## [449] 2.8975680 2.9606231 2.8587664 2.8752581 2.8489709 2.8178011 2.9290581
## [456] 2.8975680 2.9449652 2.8296777 2.7868614 2.6878475 2.7985001 2.6844403
## [463] 2.6383428 2.3311725 2.5817308 2.6483002 2.8419982 3.0596456 2.8975680
## [470] 2.6919208 2.7905514 2.5548990 2.6644466 2.4561642 2.8981194 3.1822118
## [477] 2.9274534 3.2152693 2.8920370 2.5733753 2.3739751 2.0464017 1.9473377
## [484] 2.3437270 2.5907670 2.3589654 2.7067160 2.4379897 2.8936995 3.1768030
## [491] 3.3904734 2.8942531 2.5915164 2.4857396 2.6093342 2.8678989 3.0511670
## [498] 2.6461748 2.5587765 2.7146947 2.6623552 2.2690283 2.2060742 1.7298841
## [505] 1.8687205 2.0643279
plot(loglstat, Boston$medv, pch = 20 , col = "blue" , xlab = "Lower Status Population (%) in log", ylab = "Median Home Value ($1000s)")
logboston <- lm(Boston$medv~loglstat)
logboston
##
## Call:
## lm(formula = Boston$medv ~ loglstat)
##
## Coefficients:
## (Intercept) loglstat
## 52.12 -12.48
summary(logboston)
##
## Call:
## lm(formula = Boston$medv ~ loglstat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4599 -3.5006 -0.6686 2.1688 26.0129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.1248 0.9652 54.00 <2e-16 ***
## loglstat -12.4810 0.3946 -31.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.329 on 504 degrees of freedom
## Multiple R-squared: 0.6649, Adjusted R-squared: 0.6643
## F-statistic: 1000 on 1 and 504 DF, p-value: < 2.2e-16
plot(loglstat, Boston$medv, pch = 20 , col = "blue" , xlab = "Lower Status Population (%) in log", ylab = "Median Home Value ($1000s)")
abline(logboston,col = "green", lwd = 2)
advertising_df
## TV radio newspaper sales
## 1 230.1 37.8 69.2 22.1
## 2 44.5 39.3 45.1 10.4
## 3 17.2 45.9 69.3 9.3
## 4 151.5 41.3 58.5 18.5
## 5 180.8 10.8 58.4 12.9
## 6 8.7 48.9 75.0 7.2
## 7 57.5 32.8 23.5 11.8
## 8 120.2 19.6 11.6 13.2
## 9 8.6 2.1 1.0 4.8
## 10 199.8 2.6 21.2 10.6
## 11 66.1 5.8 24.2 8.6
## 12 214.7 24.0 4.0 17.4
## 13 23.8 35.1 65.9 9.2
## 14 97.5 7.6 7.2 9.7
## 15 204.1 32.9 46.0 19.0
## 16 195.4 47.7 52.9 22.4
## 17 67.8 36.6 114.0 12.5
## 18 281.4 39.6 55.8 24.4
## 19 69.2 20.5 18.3 11.3
## 20 147.3 23.9 19.1 14.6
## 21 218.4 27.7 53.4 18.0
## 22 237.4 5.1 23.5 12.5
## 23 13.2 15.9 49.6 5.6
## 24 228.3 16.9 26.2 15.5
## 25 62.3 12.6 18.3 9.7
## 26 262.9 3.5 19.5 12.0
## 27 142.9 29.3 12.6 15.0
## 28 240.1 16.7 22.9 15.9
## 29 248.8 27.1 22.9 18.9
## 30 70.6 16.0 40.8 10.5
## 31 292.9 28.3 43.2 21.4
## 32 112.9 17.4 38.6 11.9
## 33 97.2 1.5 30.0 9.6
## 34 265.6 20.0 0.3 17.4
## 35 95.7 1.4 7.4 9.5
## 36 290.7 4.1 8.5 12.8
## 37 266.9 43.8 5.0 25.4
## 38 74.7 49.4 45.7 14.7
## 39 43.1 26.7 35.1 10.1
## 40 228.0 37.7 32.0 21.5
## 41 202.5 22.3 31.6 16.6
## 42 177.0 33.4 38.7 17.1
## 43 293.6 27.7 1.8 20.7
## 44 206.9 8.4 26.4 12.9
## 45 25.1 25.7 43.3 8.5
## 46 175.1 22.5 31.5 14.9
## 47 89.7 9.9 35.7 10.6
## 48 239.9 41.5 18.5 23.2
## 49 227.2 15.8 49.9 14.8
## 50 66.9 11.7 36.8 9.7
## 51 199.8 3.1 34.6 11.4
## 52 100.4 9.6 3.6 10.7
## 53 216.4 41.7 39.6 22.6
## 54 182.6 46.2 58.7 21.2
## 55 262.7 28.8 15.9 20.2
## 56 198.9 49.4 60.0 23.7
## 57 7.3 28.1 41.4 5.5
## 58 136.2 19.2 16.6 13.2
## 59 210.8 49.6 37.7 23.8
## 60 210.7 29.5 9.3 18.4
## 61 53.5 2.0 21.4 8.1
## 62 261.3 42.7 54.7 24.2
## 63 239.3 15.5 27.3 15.7
## 64 102.7 29.6 8.4 14.0
## 65 131.1 42.8 28.9 18.0
## 66 69.0 9.3 0.9 9.3
## 67 31.5 24.6 2.2 9.5
## 68 139.3 14.5 10.2 13.4
## 69 237.4 27.5 11.0 18.9
## 70 216.8 43.9 27.2 22.3
## 71 199.1 30.6 38.7 18.3
## 72 109.8 14.3 31.7 12.4
## 73 26.8 33.0 19.3 8.8
## 74 129.4 5.7 31.3 11.0
## 75 213.4 24.6 13.1 17.0
## 76 16.9 43.7 89.4 8.7
## 77 27.5 1.6 20.7 6.9
## 78 120.5 28.5 14.2 14.2
## 79 5.4 29.9 9.4 5.3
## 80 116.0 7.7 23.1 11.0
## 81 76.4 26.7 22.3 11.8
## 82 239.8 4.1 36.9 12.3
## 83 75.3 20.3 32.5 11.3
## 84 68.4 44.5 35.6 13.6
## 85 213.5 43.0 33.8 21.7
## 86 193.2 18.4 65.7 15.2
## 87 76.3 27.5 16.0 12.0
## 88 110.7 40.6 63.2 16.0
## 89 88.3 25.5 73.4 12.9
## 90 109.8 47.8 51.4 16.7
## 91 134.3 4.9 9.3 11.2
## 92 28.6 1.5 33.0 7.3
## 93 217.7 33.5 59.0 19.4
## 94 250.9 36.5 72.3 22.2
## 95 107.4 14.0 10.9 11.5
## 96 163.3 31.6 52.9 16.9
## 97 197.6 3.5 5.9 11.7
## 98 184.9 21.0 22.0 15.5
## 99 289.7 42.3 51.2 25.4
## 100 135.2 41.7 45.9 17.2
## 101 222.4 4.3 49.8 11.7
## 102 296.4 36.3 100.9 23.8
## 103 280.2 10.1 21.4 14.8
## 104 187.9 17.2 17.9 14.7
## 105 238.2 34.3 5.3 20.7
## 106 137.9 46.4 59.0 19.2
## 107 25.0 11.0 29.7 7.2
## 108 90.4 0.3 23.2 8.7
## 109 13.1 0.4 25.6 5.3
## 110 255.4 26.9 5.5 19.8
## 111 225.8 8.2 56.5 13.4
## 112 241.7 38.0 23.2 21.8
## 113 175.7 15.4 2.4 14.1
## 114 209.6 20.6 10.7 15.9
## 115 78.2 46.8 34.5 14.6
## 116 75.1 35.0 52.7 12.6
## 117 139.2 14.3 25.6 12.2
## 118 76.4 0.8 14.8 9.4
## 119 125.7 36.9 79.2 15.9
## 120 19.4 16.0 22.3 6.6
## 121 141.3 26.8 46.2 15.5
## 122 18.8 21.7 50.4 7.0
## 123 224.0 2.4 15.6 11.6
## 124 123.1 34.6 12.4 15.2
## 125 229.5 32.3 74.2 19.7
## 126 87.2 11.8 25.9 10.6
## 127 7.8 38.9 50.6 6.6
## 128 80.2 0.0 9.2 8.8
## 129 220.3 49.0 3.2 24.7
## 130 59.6 12.0 43.1 9.7
## 131 0.7 39.6 8.7 1.6
## 132 265.2 2.9 43.0 12.7
## 133 8.4 27.2 2.1 5.7
## 134 219.8 33.5 45.1 19.6
## 135 36.9 38.6 65.6 10.8
## 136 48.3 47.0 8.5 11.6
## 137 25.6 39.0 9.3 9.5
## 138 273.7 28.9 59.7 20.8
## 139 43.0 25.9 20.5 9.6
## 140 184.9 43.9 1.7 20.7
## 141 73.4 17.0 12.9 10.9
## 142 193.7 35.4 75.6 19.2
## 143 220.5 33.2 37.9 20.1
## 144 104.6 5.7 34.4 10.4
## 145 96.2 14.8 38.9 11.4
## 146 140.3 1.9 9.0 10.3
## 147 240.1 7.3 8.7 13.2
## 148 243.2 49.0 44.3 25.4
## 149 38.0 40.3 11.9 10.9
## 150 44.7 25.8 20.6 10.1
## 151 280.7 13.9 37.0 16.1
## 152 121.0 8.4 48.7 11.6
## 153 197.6 23.3 14.2 16.6
## 154 171.3 39.7 37.7 19.0
## 155 187.8 21.1 9.5 15.6
## 156 4.1 11.6 5.7 3.2
## 157 93.9 43.5 50.5 15.3
## 158 149.8 1.3 24.3 10.1
## 159 11.7 36.9 45.2 7.3
## 160 131.7 18.4 34.6 12.9
## 161 172.5 18.1 30.7 14.4
## 162 85.7 35.8 49.3 13.3
## 163 188.4 18.1 25.6 14.9
## 164 163.5 36.8 7.4 18.0
## 165 117.2 14.7 5.4 11.9
## 166 234.5 3.4 84.8 11.9
## 167 17.9 37.6 21.6 8.0
## 168 206.8 5.2 19.4 12.2
## 169 215.4 23.6 57.6 17.1
## 170 284.3 10.6 6.4 15.0
## 171 50.0 11.6 18.4 8.4
## 172 164.5 20.9 47.4 14.5
## 173 19.6 20.1 17.0 7.6
## 174 168.4 7.1 12.8 11.7
## 175 222.4 3.4 13.1 11.5
## 176 276.9 48.9 41.8 27.0
## 177 248.4 30.2 20.3 20.2
## 178 170.2 7.8 35.2 11.7
## 179 276.7 2.3 23.7 11.8
## 180 165.6 10.0 17.6 12.6
## 181 156.6 2.6 8.3 10.5
## 182 218.5 5.4 27.4 12.2
## 183 56.2 5.7 29.7 8.7
## 184 287.6 43.0 71.8 26.2
## 185 253.8 21.3 30.0 17.6
## 186 205.0 45.1 19.6 22.6
## 187 139.5 2.1 26.6 10.3
## 188 191.1 28.7 18.2 17.3
## 189 286.0 13.9 3.7 15.9
## 190 18.7 12.1 23.4 6.7
## 191 39.5 41.1 5.8 10.8
## 192 75.5 10.8 6.0 9.9
## 193 17.2 4.1 31.6 5.9
## 194 166.8 42.0 3.6 19.6
## 195 149.7 35.6 6.0 17.3
## 196 38.2 3.7 13.8 7.6
## 197 94.2 4.9 8.1 9.7
## 198 177.0 9.3 6.4 12.8
## 199 283.6 42.0 66.2 25.5
## 200 232.1 8.6 8.7 13.4
cov_mat1 <- cov(advertising_df)
cov_mat1
## TV radio newspaper sales
## TV 7370.94989 69.86249 105.91945 350.39019
## radio 69.86249 220.42774 114.49698 44.63569
## newspaper 105.91945 114.49698 474.30833 25.94139
## sales 350.39019 44.63569 25.94139 27.22185
Regarding Covariance vs. Strength (The Scale Problem): “While the covariance matrix indicates a positive directional relationship between the advertising budgets (TV, radio, newspaper) and sales, it cannot measure the true strength of these associations. Because covariance is highly dependent on the raw scale of the variables, the unstandardized values cannot be directly compared to determine which advertising medium is the most effective.”
Regarding Multicollinearity: “Furthermore, by inspecting the predictor-to-predictor relationships, we can observe a positive covariance between the radio and newspaper budgets. This internal association between independent variables introduces the risk of multicollinearity, which can inflate the variance of our estimators and make it difficult to isolate their individual effects on sales.
cov2cor(cov_mat1)
## TV radio newspaper sales
## TV 1.00000000 0.05480866 0.05664787 0.7822244
## radio 0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales 0.78222442 0.57622257 0.22829903 1.0000000
correlation is just the standized version of covariance and it tells us the relationship and strenghth between two variables .
1 strong perfect positive relationship 0 no relationship -1 strong negative relationship if you could see from the above tv and sales have the strong corelation
advertising_df
## TV radio newspaper sales
## 1 230.1 37.8 69.2 22.1
## 2 44.5 39.3 45.1 10.4
## 3 17.2 45.9 69.3 9.3
## 4 151.5 41.3 58.5 18.5
## 5 180.8 10.8 58.4 12.9
## 6 8.7 48.9 75.0 7.2
## 7 57.5 32.8 23.5 11.8
## 8 120.2 19.6 11.6 13.2
## 9 8.6 2.1 1.0 4.8
## 10 199.8 2.6 21.2 10.6
## 11 66.1 5.8 24.2 8.6
## 12 214.7 24.0 4.0 17.4
## 13 23.8 35.1 65.9 9.2
## 14 97.5 7.6 7.2 9.7
## 15 204.1 32.9 46.0 19.0
## 16 195.4 47.7 52.9 22.4
## 17 67.8 36.6 114.0 12.5
## 18 281.4 39.6 55.8 24.4
## 19 69.2 20.5 18.3 11.3
## 20 147.3 23.9 19.1 14.6
## 21 218.4 27.7 53.4 18.0
## 22 237.4 5.1 23.5 12.5
## 23 13.2 15.9 49.6 5.6
## 24 228.3 16.9 26.2 15.5
## 25 62.3 12.6 18.3 9.7
## 26 262.9 3.5 19.5 12.0
## 27 142.9 29.3 12.6 15.0
## 28 240.1 16.7 22.9 15.9
## 29 248.8 27.1 22.9 18.9
## 30 70.6 16.0 40.8 10.5
## 31 292.9 28.3 43.2 21.4
## 32 112.9 17.4 38.6 11.9
## 33 97.2 1.5 30.0 9.6
## 34 265.6 20.0 0.3 17.4
## 35 95.7 1.4 7.4 9.5
## 36 290.7 4.1 8.5 12.8
## 37 266.9 43.8 5.0 25.4
## 38 74.7 49.4 45.7 14.7
## 39 43.1 26.7 35.1 10.1
## 40 228.0 37.7 32.0 21.5
## 41 202.5 22.3 31.6 16.6
## 42 177.0 33.4 38.7 17.1
## 43 293.6 27.7 1.8 20.7
## 44 206.9 8.4 26.4 12.9
## 45 25.1 25.7 43.3 8.5
## 46 175.1 22.5 31.5 14.9
## 47 89.7 9.9 35.7 10.6
## 48 239.9 41.5 18.5 23.2
## 49 227.2 15.8 49.9 14.8
## 50 66.9 11.7 36.8 9.7
## 51 199.8 3.1 34.6 11.4
## 52 100.4 9.6 3.6 10.7
## 53 216.4 41.7 39.6 22.6
## 54 182.6 46.2 58.7 21.2
## 55 262.7 28.8 15.9 20.2
## 56 198.9 49.4 60.0 23.7
## 57 7.3 28.1 41.4 5.5
## 58 136.2 19.2 16.6 13.2
## 59 210.8 49.6 37.7 23.8
## 60 210.7 29.5 9.3 18.4
## 61 53.5 2.0 21.4 8.1
## 62 261.3 42.7 54.7 24.2
## 63 239.3 15.5 27.3 15.7
## 64 102.7 29.6 8.4 14.0
## 65 131.1 42.8 28.9 18.0
## 66 69.0 9.3 0.9 9.3
## 67 31.5 24.6 2.2 9.5
## 68 139.3 14.5 10.2 13.4
## 69 237.4 27.5 11.0 18.9
## 70 216.8 43.9 27.2 22.3
## 71 199.1 30.6 38.7 18.3
## 72 109.8 14.3 31.7 12.4
## 73 26.8 33.0 19.3 8.8
## 74 129.4 5.7 31.3 11.0
## 75 213.4 24.6 13.1 17.0
## 76 16.9 43.7 89.4 8.7
## 77 27.5 1.6 20.7 6.9
## 78 120.5 28.5 14.2 14.2
## 79 5.4 29.9 9.4 5.3
## 80 116.0 7.7 23.1 11.0
## 81 76.4 26.7 22.3 11.8
## 82 239.8 4.1 36.9 12.3
## 83 75.3 20.3 32.5 11.3
## 84 68.4 44.5 35.6 13.6
## 85 213.5 43.0 33.8 21.7
## 86 193.2 18.4 65.7 15.2
## 87 76.3 27.5 16.0 12.0
## 88 110.7 40.6 63.2 16.0
## 89 88.3 25.5 73.4 12.9
## 90 109.8 47.8 51.4 16.7
## 91 134.3 4.9 9.3 11.2
## 92 28.6 1.5 33.0 7.3
## 93 217.7 33.5 59.0 19.4
## 94 250.9 36.5 72.3 22.2
## 95 107.4 14.0 10.9 11.5
## 96 163.3 31.6 52.9 16.9
## 97 197.6 3.5 5.9 11.7
## 98 184.9 21.0 22.0 15.5
## 99 289.7 42.3 51.2 25.4
## 100 135.2 41.7 45.9 17.2
## 101 222.4 4.3 49.8 11.7
## 102 296.4 36.3 100.9 23.8
## 103 280.2 10.1 21.4 14.8
## 104 187.9 17.2 17.9 14.7
## 105 238.2 34.3 5.3 20.7
## 106 137.9 46.4 59.0 19.2
## 107 25.0 11.0 29.7 7.2
## 108 90.4 0.3 23.2 8.7
## 109 13.1 0.4 25.6 5.3
## 110 255.4 26.9 5.5 19.8
## 111 225.8 8.2 56.5 13.4
## 112 241.7 38.0 23.2 21.8
## 113 175.7 15.4 2.4 14.1
## 114 209.6 20.6 10.7 15.9
## 115 78.2 46.8 34.5 14.6
## 116 75.1 35.0 52.7 12.6
## 117 139.2 14.3 25.6 12.2
## 118 76.4 0.8 14.8 9.4
## 119 125.7 36.9 79.2 15.9
## 120 19.4 16.0 22.3 6.6
## 121 141.3 26.8 46.2 15.5
## 122 18.8 21.7 50.4 7.0
## 123 224.0 2.4 15.6 11.6
## 124 123.1 34.6 12.4 15.2
## 125 229.5 32.3 74.2 19.7
## 126 87.2 11.8 25.9 10.6
## 127 7.8 38.9 50.6 6.6
## 128 80.2 0.0 9.2 8.8
## 129 220.3 49.0 3.2 24.7
## 130 59.6 12.0 43.1 9.7
## 131 0.7 39.6 8.7 1.6
## 132 265.2 2.9 43.0 12.7
## 133 8.4 27.2 2.1 5.7
## 134 219.8 33.5 45.1 19.6
## 135 36.9 38.6 65.6 10.8
## 136 48.3 47.0 8.5 11.6
## 137 25.6 39.0 9.3 9.5
## 138 273.7 28.9 59.7 20.8
## 139 43.0 25.9 20.5 9.6
## 140 184.9 43.9 1.7 20.7
## 141 73.4 17.0 12.9 10.9
## 142 193.7 35.4 75.6 19.2
## 143 220.5 33.2 37.9 20.1
## 144 104.6 5.7 34.4 10.4
## 145 96.2 14.8 38.9 11.4
## 146 140.3 1.9 9.0 10.3
## 147 240.1 7.3 8.7 13.2
## 148 243.2 49.0 44.3 25.4
## 149 38.0 40.3 11.9 10.9
## 150 44.7 25.8 20.6 10.1
## 151 280.7 13.9 37.0 16.1
## 152 121.0 8.4 48.7 11.6
## 153 197.6 23.3 14.2 16.6
## 154 171.3 39.7 37.7 19.0
## 155 187.8 21.1 9.5 15.6
## 156 4.1 11.6 5.7 3.2
## 157 93.9 43.5 50.5 15.3
## 158 149.8 1.3 24.3 10.1
## 159 11.7 36.9 45.2 7.3
## 160 131.7 18.4 34.6 12.9
## 161 172.5 18.1 30.7 14.4
## 162 85.7 35.8 49.3 13.3
## 163 188.4 18.1 25.6 14.9
## 164 163.5 36.8 7.4 18.0
## 165 117.2 14.7 5.4 11.9
## 166 234.5 3.4 84.8 11.9
## 167 17.9 37.6 21.6 8.0
## 168 206.8 5.2 19.4 12.2
## 169 215.4 23.6 57.6 17.1
## 170 284.3 10.6 6.4 15.0
## 171 50.0 11.6 18.4 8.4
## 172 164.5 20.9 47.4 14.5
## 173 19.6 20.1 17.0 7.6
## 174 168.4 7.1 12.8 11.7
## 175 222.4 3.4 13.1 11.5
## 176 276.9 48.9 41.8 27.0
## 177 248.4 30.2 20.3 20.2
## 178 170.2 7.8 35.2 11.7
## 179 276.7 2.3 23.7 11.8
## 180 165.6 10.0 17.6 12.6
## 181 156.6 2.6 8.3 10.5
## 182 218.5 5.4 27.4 12.2
## 183 56.2 5.7 29.7 8.7
## 184 287.6 43.0 71.8 26.2
## 185 253.8 21.3 30.0 17.6
## 186 205.0 45.1 19.6 22.6
## 187 139.5 2.1 26.6 10.3
## 188 191.1 28.7 18.2 17.3
## 189 286.0 13.9 3.7 15.9
## 190 18.7 12.1 23.4 6.7
## 191 39.5 41.1 5.8 10.8
## 192 75.5 10.8 6.0 9.9
## 193 17.2 4.1 31.6 5.9
## 194 166.8 42.0 3.6 19.6
## 195 149.7 35.6 6.0 17.3
## 196 38.2 3.7 13.8 7.6
## 197 94.2 4.9 8.1 9.7
## 198 177.0 9.3 6.4 12.8
## 199 283.6 42.0 66.2 25.5
## 200 232.1 8.6 8.7 13.4
n <- nrow(advertising_df)
n
## [1] 200
names <- variable.names(advertising_df)
names
## [1] "TV" "radio" "newspaper" "sales"
multi_model <- lm(sales ~ .-X , data = advertising)
summary(multi_model)
##
## Call:
## lm(formula = sales ~ . - X, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
According to the F-statistic, the p-value is very small, which means we reject the global null hypothesis; at least one of the predictors provides significant information about the model. If you look at the p-values for TV and radio, they are less than the 0.05 threshold, meaning they are statistically significant. Conversely, the p-value for newspaper is very high, indicating that it is not statistically significant and provides no useful information. Finally, the Adjusted \(R^2\) explains that 89.6% of the total variance is accounted for by the model.The median of the residuals is close to zero, which suggests that the model’s errors are centered properly and do not show significant bias.The median of the residuals is close to zero, which suggests that the model’s errors are centered properly and do not show significant bias. As the newspaper is useless we can easily remove it to improve our model without losing the model predictive power.
e <- residuals(multi_model)
RSS <- sum(e^2)
TSS <- var(advertising_df$sales) * n-1
r.square <- 1 - RSS/TSS
r.square
## [1] 0.8977058
multi_new <- update(multi_model, ~ . -newspaper)
summary(multi_new)
##
## Call:
## lm(formula = sales ~ TV + radio, data = advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7977 -0.8752 0.2422 1.1708 2.8328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.92110 0.29449 9.919 <2e-16 ***
## TV 0.04575 0.00139 32.909 <2e-16 ***
## radio 0.18799 0.00804 23.382 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962
## F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
Before removing the variable we will use F test to do the comparison between the two models to be sure that if the variable is actually useless or not .
F TEST FOR MODEL COMPARISON
e1 <- residuals(multi_new)
RSS0 <- sum(e1^2)
a <- (RSS0 - RSS)
b <- RSS/(n-4)
F_distribution <- a/b
F_distribution
## [1] 0.03122805
#pvalue of f distribution
f_pvalue <- 1 - pf(F_distribution, df1 = 1, df2 = n-4)
f_pvalue
## [1] 0.8599151
Because the p-value for the newspaper variable is very high, it is not statistically significant. This means it provides no unique or useful information to the model. Therefore, we can safely remove this predictor without losing any valuable information, which helps us create a simpler and more efficient model.
cat("r square full: ", summary(multi_model)$r.squared)
## r square full: 0.8972106
cat("r square reduce: ", summary(multi_new)$r.squared)
## r square reduce: 0.8971943
cat("AIC FULL :", AIC(multi_model))
## AIC FULL : 782.3622
cat("AIC REDUCE :", AIC(multi_new))
## AIC REDUCE : 780.3941
cat("Adj R-squared (Full): ", summary(multi_model)$adj.r.squared, "\n")
## Adj R-squared (Full): 0.8956373
cat("Adj R-squared (NEW): ", summary(multi_new)$adj.r.squared, "\n")
## Adj R-squared (NEW): 0.8961505
I chose the reduced model because it is more efficient than the full model. The Adjusted \(R^2\) increased, which confirms that the removed variable was not contributing to the model’s accuracy, and the lower AIC indicates that the model is now more parsimonious—providing the best balance of accuracy and simplicity.
Interaction
model_int <- lm(sales ~ TV*radio, data = advertising_df)
summary(model_int)
##
## Call:
## lm(formula = sales ~ TV * radio, data = advertising_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
The interaction term between TV and radio is statistically significant, as indicated by a p-value of less than 0.05. This reveals a positive synergy: the effect of TV advertising on sales is enhanced when the radio budget is also increased. Furthermore, the increase in Adjusted \(R^2\) confirms that this interaction term adds meaningful explanatory power to the model, making it a better fit than the model with only main effects.
“The TV and radio interaction is significant (p < 0.05). This means that these two ads work better together than they do separately. Because the Adjusted \(R^2\) went up, we know that including this interaction makes our model better at predicting sales
PREDICTION
simple.model <- lm(sales ~ TV, data = advertising_df)
n.new <- data.frame(TV = 170)
predict(simple.model , newdata = n.new , interval = "confidence")
## fit lwr upr
## 1 15.11382 14.64338 15.58426
predict(simple.model , newdata = n.new , interval = "prediction")
## fit lwr upr
## 1 15.11382 8.670498 21.55715
“The confidence interval is narrower, as it only reflects the uncertainty of the model’s estimate of the average. Conversely, the prediction interval is wider because it reflects both the uncertainty of the model and the random noise (the error term) associated with a single new observation.”
plot(advertising_df$TV,advertising_df$sales , pch = 20, col = "grey")
abline(simple.model, col = "blue")
points(170,predict(simple.model,newdata = n.new)[1], pch =4,cex = 2, lwd = 3)
Assignment n.2 Multiple linear regression 2026-03-25 The Auto dataset
The ISLR2 library contains the Auto data set, which records 392
observations on the following 8 variables: mpg (miles per gallon),
cylinders, engine displacement (cu. inches), horsepower, vehicle weight
(lbs.), acceleration, that is time to accelerate from 0 to 60 mph
(sec.), model year (modulo 100), and origin of car (1. American, 2.
European, 3. Japanese).
Load the Auto dataset and remove the column name.
Fit the linear regression model of mpg on all the remaining variables in the dataset.
Provide an interpretation of the coefficient of the variable origin.
Initially, remove origin from the model, fit the full model, and provide an interpretation of the sign of the coefficients of the variables that are significantly different from zero.
Use the function update() to fit the reduced model containing only the variables that are significantly different from zero in the full model. Compare both the Multiple R2 and the Adjusted R2 between the full and the reduced model.
Compute, explicitly, the F-test for the comparison of the two models and compare your results with those you can get by applying the anova() function.
Define the variable origin as categorical and fit the full model again and provide an interpretation of the origin parameters.
Apply a one-way Analysis of Variance (ANOVA) to compare the mean efficiency of cars across the three origins (ignoring other variables for this specific test). Include a post-hoc analysis based on the Tukey HSD method. Comment on the results.
Show how the ANOVA test computed with the aov() function can also be computed with the lm() function.
Include an interaction term between origin and weight. Fit the model and comment on the output.
Load the Auto dataset and remove the column name.
library("ISLR2")
data("Auto")
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Auto_df <- Auto[, 1:8]
Auto_df
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18.0 8 307.0 130 3504 12.0 70 1
## 2 15.0 8 350.0 165 3693 11.5 70 1
## 3 18.0 8 318.0 150 3436 11.0 70 1
## 4 16.0 8 304.0 150 3433 12.0 70 1
## 5 17.0 8 302.0 140 3449 10.5 70 1
## 6 15.0 8 429.0 198 4341 10.0 70 1
## 7 14.0 8 454.0 220 4354 9.0 70 1
## 8 14.0 8 440.0 215 4312 8.5 70 1
## 9 14.0 8 455.0 225 4425 10.0 70 1
## 10 15.0 8 390.0 190 3850 8.5 70 1
## 11 15.0 8 383.0 170 3563 10.0 70 1
## 12 14.0 8 340.0 160 3609 8.0 70 1
## 13 15.0 8 400.0 150 3761 9.5 70 1
## 14 14.0 8 455.0 225 3086 10.0 70 1
## 15 24.0 4 113.0 95 2372 15.0 70 3
## 16 22.0 6 198.0 95 2833 15.5 70 1
## 17 18.0 6 199.0 97 2774 15.5 70 1
## 18 21.0 6 200.0 85 2587 16.0 70 1
## 19 27.0 4 97.0 88 2130 14.5 70 3
## 20 26.0 4 97.0 46 1835 20.5 70 2
## 21 25.0 4 110.0 87 2672 17.5 70 2
## 22 24.0 4 107.0 90 2430 14.5 70 2
## 23 25.0 4 104.0 95 2375 17.5 70 2
## 24 26.0 4 121.0 113 2234 12.5 70 2
## 25 21.0 6 199.0 90 2648 15.0 70 1
## 26 10.0 8 360.0 215 4615 14.0 70 1
## 27 10.0 8 307.0 200 4376 15.0 70 1
## 28 11.0 8 318.0 210 4382 13.5 70 1
## 29 9.0 8 304.0 193 4732 18.5 70 1
## 30 27.0 4 97.0 88 2130 14.5 71 3
## 31 28.0 4 140.0 90 2264 15.5 71 1
## 32 25.0 4 113.0 95 2228 14.0 71 3
## 34 19.0 6 232.0 100 2634 13.0 71 1
## 35 16.0 6 225.0 105 3439 15.5 71 1
## 36 17.0 6 250.0 100 3329 15.5 71 1
## 37 19.0 6 250.0 88 3302 15.5 71 1
## 38 18.0 6 232.0 100 3288 15.5 71 1
## 39 14.0 8 350.0 165 4209 12.0 71 1
## 40 14.0 8 400.0 175 4464 11.5 71 1
## 41 14.0 8 351.0 153 4154 13.5 71 1
## 42 14.0 8 318.0 150 4096 13.0 71 1
## 43 12.0 8 383.0 180 4955 11.5 71 1
## 44 13.0 8 400.0 170 4746 12.0 71 1
## 45 13.0 8 400.0 175 5140 12.0 71 1
## 46 18.0 6 258.0 110 2962 13.5 71 1
## 47 22.0 4 140.0 72 2408 19.0 71 1
## 48 19.0 6 250.0 100 3282 15.0 71 1
## 49 18.0 6 250.0 88 3139 14.5 71 1
## 50 23.0 4 122.0 86 2220 14.0 71 1
## 51 28.0 4 116.0 90 2123 14.0 71 2
## 52 30.0 4 79.0 70 2074 19.5 71 2
## 53 30.0 4 88.0 76 2065 14.5 71 2
## 54 31.0 4 71.0 65 1773 19.0 71 3
## 55 35.0 4 72.0 69 1613 18.0 71 3
## 56 27.0 4 97.0 60 1834 19.0 71 2
## 57 26.0 4 91.0 70 1955 20.5 71 1
## 58 24.0 4 113.0 95 2278 15.5 72 3
## 59 25.0 4 97.5 80 2126 17.0 72 1
## 60 23.0 4 97.0 54 2254 23.5 72 2
## 61 20.0 4 140.0 90 2408 19.5 72 1
## 62 21.0 4 122.0 86 2226 16.5 72 1
## 63 13.0 8 350.0 165 4274 12.0 72 1
## 64 14.0 8 400.0 175 4385 12.0 72 1
## 65 15.0 8 318.0 150 4135 13.5 72 1
## 66 14.0 8 351.0 153 4129 13.0 72 1
## 67 17.0 8 304.0 150 3672 11.5 72 1
## 68 11.0 8 429.0 208 4633 11.0 72 1
## 69 13.0 8 350.0 155 4502 13.5 72 1
## 70 12.0 8 350.0 160 4456 13.5 72 1
## 71 13.0 8 400.0 190 4422 12.5 72 1
## 72 19.0 3 70.0 97 2330 13.5 72 3
## 73 15.0 8 304.0 150 3892 12.5 72 1
## 74 13.0 8 307.0 130 4098 14.0 72 1
## 75 13.0 8 302.0 140 4294 16.0 72 1
## 76 14.0 8 318.0 150 4077 14.0 72 1
## 77 18.0 4 121.0 112 2933 14.5 72 2
## 78 22.0 4 121.0 76 2511 18.0 72 2
## 79 21.0 4 120.0 87 2979 19.5 72 2
## 80 26.0 4 96.0 69 2189 18.0 72 2
## 81 22.0 4 122.0 86 2395 16.0 72 1
## 82 28.0 4 97.0 92 2288 17.0 72 3
## 83 23.0 4 120.0 97 2506 14.5 72 3
## 84 28.0 4 98.0 80 2164 15.0 72 1
## 85 27.0 4 97.0 88 2100 16.5 72 3
## 86 13.0 8 350.0 175 4100 13.0 73 1
## 87 14.0 8 304.0 150 3672 11.5 73 1
## 88 13.0 8 350.0 145 3988 13.0 73 1
## 89 14.0 8 302.0 137 4042 14.5 73 1
## 90 15.0 8 318.0 150 3777 12.5 73 1
## 91 12.0 8 429.0 198 4952 11.5 73 1
## 92 13.0 8 400.0 150 4464 12.0 73 1
## 93 13.0 8 351.0 158 4363 13.0 73 1
## 94 14.0 8 318.0 150 4237 14.5 73 1
## 95 13.0 8 440.0 215 4735 11.0 73 1
## 96 12.0 8 455.0 225 4951 11.0 73 1
## 97 13.0 8 360.0 175 3821 11.0 73 1
## 98 18.0 6 225.0 105 3121 16.5 73 1
## 99 16.0 6 250.0 100 3278 18.0 73 1
## 100 18.0 6 232.0 100 2945 16.0 73 1
## 101 18.0 6 250.0 88 3021 16.5 73 1
## 102 23.0 6 198.0 95 2904 16.0 73 1
## 103 26.0 4 97.0 46 1950 21.0 73 2
## 104 11.0 8 400.0 150 4997 14.0 73 1
## 105 12.0 8 400.0 167 4906 12.5 73 1
## 106 13.0 8 360.0 170 4654 13.0 73 1
## 107 12.0 8 350.0 180 4499 12.5 73 1
## 108 18.0 6 232.0 100 2789 15.0 73 1
## 109 20.0 4 97.0 88 2279 19.0 73 3
## 110 21.0 4 140.0 72 2401 19.5 73 1
## 111 22.0 4 108.0 94 2379 16.5 73 3
## 112 18.0 3 70.0 90 2124 13.5 73 3
## 113 19.0 4 122.0 85 2310 18.5 73 1
## 114 21.0 6 155.0 107 2472 14.0 73 1
## 115 26.0 4 98.0 90 2265 15.5 73 2
## 116 15.0 8 350.0 145 4082 13.0 73 1
## 117 16.0 8 400.0 230 4278 9.5 73 1
## 118 29.0 4 68.0 49 1867 19.5 73 2
## 119 24.0 4 116.0 75 2158 15.5 73 2
## 120 20.0 4 114.0 91 2582 14.0 73 2
## 121 19.0 4 121.0 112 2868 15.5 73 2
## 122 15.0 8 318.0 150 3399 11.0 73 1
## 123 24.0 4 121.0 110 2660 14.0 73 2
## 124 20.0 6 156.0 122 2807 13.5 73 3
## 125 11.0 8 350.0 180 3664 11.0 73 1
## 126 20.0 6 198.0 95 3102 16.5 74 1
## 128 19.0 6 232.0 100 2901 16.0 74 1
## 129 15.0 6 250.0 100 3336 17.0 74 1
## 130 31.0 4 79.0 67 1950 19.0 74 3
## 131 26.0 4 122.0 80 2451 16.5 74 1
## 132 32.0 4 71.0 65 1836 21.0 74 3
## 133 25.0 4 140.0 75 2542 17.0 74 1
## 134 16.0 6 250.0 100 3781 17.0 74 1
## 135 16.0 6 258.0 110 3632 18.0 74 1
## 136 18.0 6 225.0 105 3613 16.5 74 1
## 137 16.0 8 302.0 140 4141 14.0 74 1
## 138 13.0 8 350.0 150 4699 14.5 74 1
## 139 14.0 8 318.0 150 4457 13.5 74 1
## 140 14.0 8 302.0 140 4638 16.0 74 1
## 141 14.0 8 304.0 150 4257 15.5 74 1
## 142 29.0 4 98.0 83 2219 16.5 74 2
## 143 26.0 4 79.0 67 1963 15.5 74 2
## 144 26.0 4 97.0 78 2300 14.5 74 2
## 145 31.0 4 76.0 52 1649 16.5 74 3
## 146 32.0 4 83.0 61 2003 19.0 74 3
## 147 28.0 4 90.0 75 2125 14.5 74 1
## 148 24.0 4 90.0 75 2108 15.5 74 2
## 149 26.0 4 116.0 75 2246 14.0 74 2
## 150 24.0 4 120.0 97 2489 15.0 74 3
## 151 26.0 4 108.0 93 2391 15.5 74 3
## 152 31.0 4 79.0 67 2000 16.0 74 2
## 153 19.0 6 225.0 95 3264 16.0 75 1
## 154 18.0 6 250.0 105 3459 16.0 75 1
## 155 15.0 6 250.0 72 3432 21.0 75 1
## 156 15.0 6 250.0 72 3158 19.5 75 1
## 157 16.0 8 400.0 170 4668 11.5 75 1
## 158 15.0 8 350.0 145 4440 14.0 75 1
## 159 16.0 8 318.0 150 4498 14.5 75 1
## 160 14.0 8 351.0 148 4657 13.5 75 1
## 161 17.0 6 231.0 110 3907 21.0 75 1
## 162 16.0 6 250.0 105 3897 18.5 75 1
## 163 15.0 6 258.0 110 3730 19.0 75 1
## 164 18.0 6 225.0 95 3785 19.0 75 1
## 165 21.0 6 231.0 110 3039 15.0 75 1
## 166 20.0 8 262.0 110 3221 13.5 75 1
## 167 13.0 8 302.0 129 3169 12.0 75 1
## 168 29.0 4 97.0 75 2171 16.0 75 3
## 169 23.0 4 140.0 83 2639 17.0 75 1
## 170 20.0 6 232.0 100 2914 16.0 75 1
## 171 23.0 4 140.0 78 2592 18.5 75 1
## 172 24.0 4 134.0 96 2702 13.5 75 3
## 173 25.0 4 90.0 71 2223 16.5 75 2
## 174 24.0 4 119.0 97 2545 17.0 75 3
## 175 18.0 6 171.0 97 2984 14.5 75 1
## 176 29.0 4 90.0 70 1937 14.0 75 2
## 177 19.0 6 232.0 90 3211 17.0 75 1
## 178 23.0 4 115.0 95 2694 15.0 75 2
## 179 23.0 4 120.0 88 2957 17.0 75 2
## 180 22.0 4 121.0 98 2945 14.5 75 2
## 181 25.0 4 121.0 115 2671 13.5 75 2
## 182 33.0 4 91.0 53 1795 17.5 75 3
## 183 28.0 4 107.0 86 2464 15.5 76 2
## 184 25.0 4 116.0 81 2220 16.9 76 2
## 185 25.0 4 140.0 92 2572 14.9 76 1
## 186 26.0 4 98.0 79 2255 17.7 76 1
## 187 27.0 4 101.0 83 2202 15.3 76 2
## 188 17.5 8 305.0 140 4215 13.0 76 1
## 189 16.0 8 318.0 150 4190 13.0 76 1
## 190 15.5 8 304.0 120 3962 13.9 76 1
## 191 14.5 8 351.0 152 4215 12.8 76 1
## 192 22.0 6 225.0 100 3233 15.4 76 1
## 193 22.0 6 250.0 105 3353 14.5 76 1
## 194 24.0 6 200.0 81 3012 17.6 76 1
## 195 22.5 6 232.0 90 3085 17.6 76 1
## 196 29.0 4 85.0 52 2035 22.2 76 1
## 197 24.5 4 98.0 60 2164 22.1 76 1
## 198 29.0 4 90.0 70 1937 14.2 76 2
## 199 33.0 4 91.0 53 1795 17.4 76 3
## 200 20.0 6 225.0 100 3651 17.7 76 1
## 201 18.0 6 250.0 78 3574 21.0 76 1
## 202 18.5 6 250.0 110 3645 16.2 76 1
## 203 17.5 6 258.0 95 3193 17.8 76 1
## 204 29.5 4 97.0 71 1825 12.2 76 2
## 205 32.0 4 85.0 70 1990 17.0 76 3
## 206 28.0 4 97.0 75 2155 16.4 76 3
## 207 26.5 4 140.0 72 2565 13.6 76 1
## 208 20.0 4 130.0 102 3150 15.7 76 2
## 209 13.0 8 318.0 150 3940 13.2 76 1
## 210 19.0 4 120.0 88 3270 21.9 76 2
## 211 19.0 6 156.0 108 2930 15.5 76 3
## 212 16.5 6 168.0 120 3820 16.7 76 2
## 213 16.5 8 350.0 180 4380 12.1 76 1
## 214 13.0 8 350.0 145 4055 12.0 76 1
## 215 13.0 8 302.0 130 3870 15.0 76 1
## 216 13.0 8 318.0 150 3755 14.0 76 1
## 217 31.5 4 98.0 68 2045 18.5 77 3
## 218 30.0 4 111.0 80 2155 14.8 77 1
## 219 36.0 4 79.0 58 1825 18.6 77 2
## 220 25.5 4 122.0 96 2300 15.5 77 1
## 221 33.5 4 85.0 70 1945 16.8 77 3
## 222 17.5 8 305.0 145 3880 12.5 77 1
## 223 17.0 8 260.0 110 4060 19.0 77 1
## 224 15.5 8 318.0 145 4140 13.7 77 1
## 225 15.0 8 302.0 130 4295 14.9 77 1
## 226 17.5 6 250.0 110 3520 16.4 77 1
## 227 20.5 6 231.0 105 3425 16.9 77 1
## 228 19.0 6 225.0 100 3630 17.7 77 1
## 229 18.5 6 250.0 98 3525 19.0 77 1
## 230 16.0 8 400.0 180 4220 11.1 77 1
## 231 15.5 8 350.0 170 4165 11.4 77 1
## 232 15.5 8 400.0 190 4325 12.2 77 1
## 233 16.0 8 351.0 149 4335 14.5 77 1
## 234 29.0 4 97.0 78 1940 14.5 77 2
## 235 24.5 4 151.0 88 2740 16.0 77 1
## 236 26.0 4 97.0 75 2265 18.2 77 3
## 237 25.5 4 140.0 89 2755 15.8 77 1
## 238 30.5 4 98.0 63 2051 17.0 77 1
## 239 33.5 4 98.0 83 2075 15.9 77 1
## 240 30.0 4 97.0 67 1985 16.4 77 3
## 241 30.5 4 97.0 78 2190 14.1 77 2
## 242 22.0 6 146.0 97 2815 14.5 77 3
## 243 21.5 4 121.0 110 2600 12.8 77 2
## 244 21.5 3 80.0 110 2720 13.5 77 3
## 245 43.1 4 90.0 48 1985 21.5 78 2
## 246 36.1 4 98.0 66 1800 14.4 78 1
## 247 32.8 4 78.0 52 1985 19.4 78 3
## 248 39.4 4 85.0 70 2070 18.6 78 3
## 249 36.1 4 91.0 60 1800 16.4 78 3
## 250 19.9 8 260.0 110 3365 15.5 78 1
## 251 19.4 8 318.0 140 3735 13.2 78 1
## 252 20.2 8 302.0 139 3570 12.8 78 1
## 253 19.2 6 231.0 105 3535 19.2 78 1
## 254 20.5 6 200.0 95 3155 18.2 78 1
## 255 20.2 6 200.0 85 2965 15.8 78 1
## 256 25.1 4 140.0 88 2720 15.4 78 1
## 257 20.5 6 225.0 100 3430 17.2 78 1
## 258 19.4 6 232.0 90 3210 17.2 78 1
## 259 20.6 6 231.0 105 3380 15.8 78 1
## 260 20.8 6 200.0 85 3070 16.7 78 1
## 261 18.6 6 225.0 110 3620 18.7 78 1
## 262 18.1 6 258.0 120 3410 15.1 78 1
## 263 19.2 8 305.0 145 3425 13.2 78 1
## 264 17.7 6 231.0 165 3445 13.4 78 1
## 265 18.1 8 302.0 139 3205 11.2 78 1
## 266 17.5 8 318.0 140 4080 13.7 78 1
## 267 30.0 4 98.0 68 2155 16.5 78 1
## 268 27.5 4 134.0 95 2560 14.2 78 3
## 269 27.2 4 119.0 97 2300 14.7 78 3
## 270 30.9 4 105.0 75 2230 14.5 78 1
## 271 21.1 4 134.0 95 2515 14.8 78 3
## 272 23.2 4 156.0 105 2745 16.7 78 1
## 273 23.8 4 151.0 85 2855 17.6 78 1
## 274 23.9 4 119.0 97 2405 14.9 78 3
## 275 20.3 5 131.0 103 2830 15.9 78 2
## 276 17.0 6 163.0 125 3140 13.6 78 2
## 277 21.6 4 121.0 115 2795 15.7 78 2
## 278 16.2 6 163.0 133 3410 15.8 78 2
## 279 31.5 4 89.0 71 1990 14.9 78 2
## 280 29.5 4 98.0 68 2135 16.6 78 3
## 281 21.5 6 231.0 115 3245 15.4 79 1
## 282 19.8 6 200.0 85 2990 18.2 79 1
## 283 22.3 4 140.0 88 2890 17.3 79 1
## 284 20.2 6 232.0 90 3265 18.2 79 1
## 285 20.6 6 225.0 110 3360 16.6 79 1
## 286 17.0 8 305.0 130 3840 15.4 79 1
## 287 17.6 8 302.0 129 3725 13.4 79 1
## 288 16.5 8 351.0 138 3955 13.2 79 1
## 289 18.2 8 318.0 135 3830 15.2 79 1
## 290 16.9 8 350.0 155 4360 14.9 79 1
## 291 15.5 8 351.0 142 4054 14.3 79 1
## 292 19.2 8 267.0 125 3605 15.0 79 1
## 293 18.5 8 360.0 150 3940 13.0 79 1
## 294 31.9 4 89.0 71 1925 14.0 79 2
## 295 34.1 4 86.0 65 1975 15.2 79 3
## 296 35.7 4 98.0 80 1915 14.4 79 1
## 297 27.4 4 121.0 80 2670 15.0 79 1
## 298 25.4 5 183.0 77 3530 20.1 79 2
## 299 23.0 8 350.0 125 3900 17.4 79 1
## 300 27.2 4 141.0 71 3190 24.8 79 2
## 301 23.9 8 260.0 90 3420 22.2 79 1
## 302 34.2 4 105.0 70 2200 13.2 79 1
## 303 34.5 4 105.0 70 2150 14.9 79 1
## 304 31.8 4 85.0 65 2020 19.2 79 3
## 305 37.3 4 91.0 69 2130 14.7 79 2
## 306 28.4 4 151.0 90 2670 16.0 79 1
## 307 28.8 6 173.0 115 2595 11.3 79 1
## 308 26.8 6 173.0 115 2700 12.9 79 1
## 309 33.5 4 151.0 90 2556 13.2 79 1
## 310 41.5 4 98.0 76 2144 14.7 80 2
## 311 38.1 4 89.0 60 1968 18.8 80 3
## 312 32.1 4 98.0 70 2120 15.5 80 1
## 313 37.2 4 86.0 65 2019 16.4 80 3
## 314 28.0 4 151.0 90 2678 16.5 80 1
## 315 26.4 4 140.0 88 2870 18.1 80 1
## 316 24.3 4 151.0 90 3003 20.1 80 1
## 317 19.1 6 225.0 90 3381 18.7 80 1
## 318 34.3 4 97.0 78 2188 15.8 80 2
## 319 29.8 4 134.0 90 2711 15.5 80 3
## 320 31.3 4 120.0 75 2542 17.5 80 3
## 321 37.0 4 119.0 92 2434 15.0 80 3
## 322 32.2 4 108.0 75 2265 15.2 80 3
## 323 46.6 4 86.0 65 2110 17.9 80 3
## 324 27.9 4 156.0 105 2800 14.4 80 1
## 325 40.8 4 85.0 65 2110 19.2 80 3
## 326 44.3 4 90.0 48 2085 21.7 80 2
## 327 43.4 4 90.0 48 2335 23.7 80 2
## 328 36.4 5 121.0 67 2950 19.9 80 2
## 329 30.0 4 146.0 67 3250 21.8 80 2
## 330 44.6 4 91.0 67 1850 13.8 80 3
## 332 33.8 4 97.0 67 2145 18.0 80 3
## 333 29.8 4 89.0 62 1845 15.3 80 2
## 334 32.7 6 168.0 132 2910 11.4 80 3
## 335 23.7 3 70.0 100 2420 12.5 80 3
## 336 35.0 4 122.0 88 2500 15.1 80 2
## 338 32.4 4 107.0 72 2290 17.0 80 3
## 339 27.2 4 135.0 84 2490 15.7 81 1
## 340 26.6 4 151.0 84 2635 16.4 81 1
## 341 25.8 4 156.0 92 2620 14.4 81 1
## 342 23.5 6 173.0 110 2725 12.6 81 1
## 343 30.0 4 135.0 84 2385 12.9 81 1
## 344 39.1 4 79.0 58 1755 16.9 81 3
## 345 39.0 4 86.0 64 1875 16.4 81 1
## 346 35.1 4 81.0 60 1760 16.1 81 3
## 347 32.3 4 97.0 67 2065 17.8 81 3
## 348 37.0 4 85.0 65 1975 19.4 81 3
## 349 37.7 4 89.0 62 2050 17.3 81 3
## 350 34.1 4 91.0 68 1985 16.0 81 3
## 351 34.7 4 105.0 63 2215 14.9 81 1
## 352 34.4 4 98.0 65 2045 16.2 81 1
## 353 29.9 4 98.0 65 2380 20.7 81 1
## 354 33.0 4 105.0 74 2190 14.2 81 2
## 356 33.7 4 107.0 75 2210 14.4 81 3
## 357 32.4 4 108.0 75 2350 16.8 81 3
## 358 32.9 4 119.0 100 2615 14.8 81 3
## 359 31.6 4 120.0 74 2635 18.3 81 3
## 360 28.1 4 141.0 80 3230 20.4 81 2
## 361 30.7 6 145.0 76 3160 19.6 81 2
## 362 25.4 6 168.0 116 2900 12.6 81 3
## 363 24.2 6 146.0 120 2930 13.8 81 3
## 364 22.4 6 231.0 110 3415 15.8 81 1
## 365 26.6 8 350.0 105 3725 19.0 81 1
## 366 20.2 6 200.0 88 3060 17.1 81 1
## 367 17.6 6 225.0 85 3465 16.6 81 1
## 368 28.0 4 112.0 88 2605 19.6 82 1
## 369 27.0 4 112.0 88 2640 18.6 82 1
## 370 34.0 4 112.0 88 2395 18.0 82 1
## 371 31.0 4 112.0 85 2575 16.2 82 1
## 372 29.0 4 135.0 84 2525 16.0 82 1
## 373 27.0 4 151.0 90 2735 18.0 82 1
## 374 24.0 4 140.0 92 2865 16.4 82 1
## 375 36.0 4 105.0 74 1980 15.3 82 2
## 376 37.0 4 91.0 68 2025 18.2 82 3
## 377 31.0 4 91.0 68 1970 17.6 82 3
## 378 38.0 4 105.0 63 2125 14.7 82 1
## 379 36.0 4 98.0 70 2125 17.3 82 1
## 380 36.0 4 120.0 88 2160 14.5 82 3
## 381 36.0 4 107.0 75 2205 14.5 82 3
## 382 34.0 4 108.0 70 2245 16.9 82 3
## 383 38.0 4 91.0 67 1965 15.0 82 3
## 384 32.0 4 91.0 67 1965 15.7 82 3
## 385 38.0 4 91.0 67 1995 16.2 82 3
## 386 25.0 6 181.0 110 2945 16.4 82 1
## 387 38.0 6 262.0 85 3015 17.0 82 1
## 388 26.0 4 156.0 92 2585 14.5 82 1
## 389 22.0 6 232.0 112 2835 14.7 82 1
## 390 32.0 4 144.0 96 2665 13.9 82 3
## 391 36.0 4 135.0 84 2370 13.0 82 1
## 392 27.0 4 151.0 90 2950 17.3 82 1
## 393 27.0 4 140.0 86 2790 15.6 82 1
## 394 44.0 4 97.0 52 2130 24.6 82 2
## 395 32.0 4 135.0 84 2295 11.6 82 1
## 396 28.0 4 120.0 79 2625 18.6 82 1
## 397 31.0 4 119.0 82 2720 19.4 82 1
auto_multi <- lm(mpg ~ ., data = Auto_df)
summary(auto_multi)
##
## Call:
## lm(formula = mpg ~ ., data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
According to the F-statistic, the p-value is well below the threshold of 0.05, which leads us to reject the global null hypothesis; this indicates that at least one of the predictors is significantly related to mpg. Specifically, displacement, weight, year, and origin have small individual p-values, suggesting they are statistically significant contributors to the model. The model explains 82.2% of the variance in mpg (\(R^2 = 0.8215\)). Conversely, variables such as cylinders, horsepower, and acceleration do not show significant individual contributions when included alongside the other predictors.
Auto_df$origin
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 2 2 2 2 1 1 1 1 1 3 1 3 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 2 1 3 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
## [75] 1 2 2 2 2 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 3 1 3 3
## [112] 1 1 2 1 1 2 2 2 2 1 2 3 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 2 2 2 3 3 1 2 2 3
## [149] 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 3 1 2 1 2 2 2 2 3 2 2 1 1 2
## [186] 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 2 3 3 1 2 1 2 3 2 1 1 1 1 3 1 2 1 3 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 2 1 3 1 1 1 3 2 3 2 3 2 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 3 3 1 3 1 1 3 2 2 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 2
## [297] 1 2 1 1 1 3 2 1 1 1 1 2 3 1 3 1 1 1 1 2 3 3 3 3 3 1 3 2 2 2 2 3 3 2 3 3 2
## [334] 3 1 1 1 1 1 3 1 3 3 3 3 3 1 1 1 2 3 3 3 3 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 2
## [371] 3 3 1 1 3 3 3 3 3 3 1 1 1 1 3 1 1 1 2 1 1 1
“Since origin is a discrete variable, we cannot interpret it as a numerical variable; therefore, we should convert it into a factor.”
Auto_df$origin <- as.factor(Auto_df$origin)
auto_multi <- lm(mpg ~ ., data = Auto_df)
summary(auto_multi)
##
## Call:
## lm(formula = mpg ~ ., data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## origin2 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## origin3 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
“With Origin 1 serving as the reference group, vehicles from Origin 2 and Origin 3 demonstrate, on average, a 2.63 mpg and 2.85 mpg higher fuel efficiency, respectively, holding all other variables constant. Both coefficients are statistically significant (as indicated by their small p-values), confirming that these origins contribute significantly to the prediction of mpg compared to the baseline.”
full_model_auto <- lm(mpg~.-origin, data = Auto_df)
summary(full_model_auto)
##
## Call:
## lm(formula = mpg ~ . - origin, data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6927 -2.3864 -0.0801 2.0291 14.3607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.454e+01 4.764e+00 -3.051 0.00244 **
## cylinders -3.299e-01 3.321e-01 -0.993 0.32122
## displacement 7.678e-03 7.358e-03 1.044 0.29733
## horsepower -3.914e-04 1.384e-02 -0.028 0.97745
## weight -6.795e-03 6.700e-04 -10.141 < 2e-16 ***
## acceleration 8.527e-02 1.020e-01 0.836 0.40383
## year 7.534e-01 5.262e-02 14.318 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.435 on 385 degrees of freedom
## Multiple R-squared: 0.8093, Adjusted R-squared: 0.8063
## F-statistic: 272.2 on 6 and 385 DF, p-value: < 2.2e-16
“In this model, weight and year are statistically significant predictors of mpg (p < 0.05). Conversely, predictors such as cylinders, displacement, and horsepower exhibit high p-values, suggesting that they do not provide statistically significant unique information in the presence of the other variables. This is likely due to multicollinearity, where these features share overlapping information regarding vehicle size and engine performance.”
auto_reduce <- update(full_model_auto,~.-cylinders-displacement-horsepower-acceleration )
summary(auto_reduce)
##
## Call:
## lm(formula = mpg ~ weight + year, data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8505 -2.3014 -0.1167 2.0367 14.3555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.435e+01 4.007e+00 -3.581 0.000386 ***
## weight -6.632e-03 2.146e-04 -30.911 < 2e-16 ***
## year 7.573e-01 4.947e-02 15.308 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.427 on 389 degrees of freedom
## Multiple R-squared: 0.8082, Adjusted R-squared: 0.8072
## F-statistic: 819.5 on 2 and 389 DF, p-value: < 2.2e-16
cat( " r square auto (full):" , summary(full_model_auto)$r.squared)
## r square auto (full): 0.8092553
cat( " adj r square auto (full):" , summary(full_model_auto)$adj.r.squared)
## adj r square auto (full): 0.8062826
cat( " r square auto (reduce):" , summary(auto_reduce)$r.squared)
## r square auto (reduce): 0.8081803
cat( " adj r square auto (reduce):" , summary(auto_reduce)$adj.r.squared)
## adj r square auto (reduce): 0.8071941
ner <- nrow(Auto_df)
ner
## [1] 392
er <- residuals(full_model_auto)
RSSer <- sum(er^2)
TSSer <- var(Auto_df$mpg) * ((ner)-1)
R_squareAuto <- 1 - (RSSer/TSSer)
R_squareAuto
## [1] 0.8092553
ero <- residuals(auto_reduce)
RSSero <- sum(ero^2)
R_squareAutore <- 1 - (RSSero/TSSer)
R_squareAutore
## [1] 0.8081803
F_test_auto <- ((RSSero-RSSer)/4)/(RSSer/(ner-7))
F_test_auto
## [1] 0.5424377
pvalaf <- 1-pf(F_test_auto, df1 = 4, df2 = ner - 7)
pvalaf
## [1] 0.7046591
“The F-test resulted in an F-statistic of 0.542 with a p-value of 0.704. Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. This indicates that the four removed predictors—cylinders, displacement, horsepower, and acceleration—do not contribute significantly to the explanation of mpg once weight and year are included in the model. Therefore, the reduced model is preferred for its simplicity and equivalent explanatory power.”
anova_auto <- anova(full_model_auto,auto_reduce)
anova_auto
## Analysis of Variance Table
##
## Model 1: mpg ~ (cylinders + displacement + horsepower + weight + acceleration +
## year + origin) - origin
## Model 2: mpg ~ weight + year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 385 4543.3
## 2 389 4569.0 -4 -25.605 0.5424 0.7047
“To determine the most parsimonious model, we compared the full model (containing six predictors) against a reduced model (containing only weight and year). The full model yielded an \(R^2\) of 0.809, while the reduced model yielded an \(R^2\) of 0.808, indicating that the four removed predictors—cylinders, displacement, horsepower, and acceleration—contributed negligible unique information. This was statistically confirmed by an F-test, which resulted in an F-statistic of 0.542 and a p-value of 0.705. Since the p-value significantly exceeds the 0.05 threshold, we fail to reject the null hypothesis, concluding that the reduced model provides a fit equivalent to the full model. Consequently, the reduced model is preferred for its simplicity and equivalent predictive performance.”
# Define the object in the same chunk where you use it
anova_origin <- aov(mpg ~ origin, data = Auto_df)
summary(anova_origin)
## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 7904 3952 96.6 <2e-16 ***
## Residuals 389 15915 41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_origin)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mpg ~ origin, data = Auto_df)
##
## $origin
## diff lwr upr p adj
## 2-1 7.569472 5.5068042 9.632139 0.0000000
## 3-1 10.417164 8.4701431 12.364184 0.0000000
## 3-2 2.847692 0.3583458 5.337038 0.0202502
“A one-way ANOVA was conducted to compare mpg across the three vehicle origins. The test returned a significant F-statistic (\(F(2, 389) = 96.6, p < 0.001\)), leading us to reject the null hypothesis of equal means. Subsequent post-hoc analysis using the Tukey HSD method revealed that all three groups differ significantly from one another (\(p_{adj} < 0.05\) for all pairs). Specifically, Origin 3 demonstrates the highest average efficiency, followed by Origin 2, with Origin 1 showing the lowest average efficiency.”
anova_origin <- aov(mpg ~ origin, data = Auto_df)
lm_origin <- lm(mpg ~ origin, data = Auto_df)
summary(lm_origin)
##
## Call:
## lm(formula = mpg ~ origin, data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.451 -5.034 -1.034 3.649 18.966
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0335 0.4086 49.025 <2e-16 ***
## origin2 7.5695 0.8767 8.634 <2e-16 ***
## origin3 10.4172 0.8276 12.588 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.396 on 389 degrees of freedom
## Multiple R-squared: 0.3318, Adjusted R-squared: 0.3284
## F-statistic: 96.6 on 2 and 389 DF, p-value: < 2.2e-16
summary(anova_origin)
## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 7904 3952 96.6 <2e-16 ***
## Residuals 389 15915 41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_int_auto <- lm(mpg ~ origin*weight, data = Auto_df )
summary(model_int_auto)
##
## Call:
## lm(formula = mpg ~ origin * weight, data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4928 -2.7715 -0.3895 2.2397 15.5163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.315e+01 1.186e+00 36.378 < 2e-16 ***
## origin2 1.125e+00 2.878e+00 0.391 0.69616
## origin3 1.111e+01 3.574e+00 3.109 0.00202 **
## weight -6.854e-03 3.423e-04 -20.020 < 2e-16 ***
## origin2:weight 3.575e-06 1.111e-03 0.003 0.99743
## origin3:weight -3.865e-03 1.541e-03 -2.508 0.01255 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.253 on 386 degrees of freedom
## Multiple R-squared: 0.7068, Adjusted R-squared: 0.703
## F-statistic: 186.1 on 5 and 386 DF, p-value: < 2.2e-16
“The model including the interaction term origin * weight reveals a significant interaction effect for Origin 3 (p = 0.0125). This indicates that the relationship between vehicle weight and fuel efficiency is not uniform across all origins; specifically, weight has a stronger negative impact on the mpg of vehicles from Origin 3 compared to the baseline. Conversely, the interaction between Origin 2 and weight was found to be statistically non-significant (p = 0.997), suggesting that the effect of weight on mpg for Origin 2 is consistent with that of Origin 1.”
model_ad <- lm( mpg ~ horsepower , data = Auto_df)
summary(model_ad)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
plot(Auto_df$horsepower, Auto_df$mpg, pch = 20, lwd = 3, col = "grey")
abline(model_ad, col ="blue")
par(mfrow = c(2,2))
plot(model_ad)
“The diagnostic plots reveal potential issues with the linear model. The
Residuals vs. Fitted and Scale-Location plots exhibit a non-linear
curvature, indicating a failure to capture the true underlying
relationship between variables, likely requiring a transformation or a
non-linear term. The Q-Q plot shows deviations at the tails, suggesting
that the residuals are not normally distributed. Finally, while there is
no extreme influence evidenced by Cook’s distance, the presence of
specific labeled points in the Residuals vs. Leverage plot suggests the
existence of influential outliers. Together, these diagnostics imply
that the current linear model is insufficient, and further data
transformation or model refinement is recommended.”
solution of underfitting data transformation
model_ad_poly <- lm(mpg ~ poly(horsepower, degree = 2), data = Auto_df)
summary(model_ad_poly)
##
## Call:
## lm(formula = mpg ~ poly(horsepower, degree = 2), data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, degree = 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, degree = 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
“The quadratic model is statistically significant overall, and both the linear and quadratic terms have p-values well below the 0.05 threshold. This indicates that there is a significant curved relationship between horsepower and the response variable. The model accounts for 68.6% of the variance in the data, and the residuals are well-centered around zero, suggesting the model is a robust fit.”
Does it make sense why we use degree = 2? It’s because the relationship between these two variables wasn’t just a straight line—it had a bend to it, and this model “bent” the regression line to match that curve perfectly.
v
plot(Auto_df$horsepower,Auto_df$mpg , pch = 20, col = "darkgrey")
xp <- seq(min(Auto_df$horsepower), max(Auto_df$horsepower), lenght = 100)
## Warning: In seq.default(min(Auto_df$horsepower), max(Auto_df$horsepower),
## lenght = 100) :
## extra argument 'lenght' will be disregarded
newdata <- data.frame(horsepower = xp)
lines( xp, predict(model_ad_poly, newdata = newdata), col = "darkblue")
par(mfrow = c(1,2))
plot(model_ad_poly)
Interpretation: The “Residuals vs Fitted” plot is now much flatter, and
the U-shape is gone. Adding the polynomial term significantly improved
our model’s fit!
However, if we look closely at the scale-location plot (standardized residuals), the assumption of homoscedasticity is not perfectly met. There is a slight funnel shape across the fitted values. Does this mean transforming the predictor wasn’t enough? Is there still unexplained variance that requires us to include even higher polynomial terms?
model_ad_poly_high <- lm(mpg ~ poly(horsepower, degree = 15), data = Auto_df)
summary(model_ad_poly_high)
##
## Call:
## lm(formula = mpg ~ poly(horsepower, degree = 15), data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.659 -2.534 -0.129 2.145 13.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2156 108.768 < 2e-16 ***
## poly(horsepower, degree = 15)1 -120.1377 4.2679 -28.149 < 2e-16 ***
## poly(horsepower, degree = 15)2 44.0895 4.2679 10.331 < 2e-16 ***
## poly(horsepower, degree = 15)3 -3.9488 4.2679 -0.925 0.35543
## poly(horsepower, degree = 15)4 -5.1878 4.2679 -1.216 0.22492
## poly(horsepower, degree = 15)5 13.2722 4.2679 3.110 0.00201 **
## poly(horsepower, degree = 15)6 -8.5462 4.2679 -2.002 0.04595 *
## poly(horsepower, degree = 15)7 7.9806 4.2679 1.870 0.06227 .
## poly(horsepower, degree = 15)8 2.1727 4.2679 0.509 0.61099
## poly(horsepower, degree = 15)9 -3.9182 4.2679 -0.918 0.35917
## poly(horsepower, degree = 15)10 -2.6146 4.2679 -0.613 0.54050
## poly(horsepower, degree = 15)11 3.5636 4.2679 0.835 0.40426
## poly(horsepower, degree = 15)12 1.1451 4.2679 0.268 0.78861
## poly(horsepower, degree = 15)13 0.6041 4.2679 0.142 0.88752
## poly(horsepower, degree = 15)14 -3.8267 4.2679 -0.897 0.37049
## poly(horsepower, degree = 15)15 13.4922 4.2679 3.161 0.00170 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.268 on 376 degrees of freedom
## Multiple R-squared: 0.7125, Adjusted R-squared: 0.701
## F-statistic: 62.11 on 15 and 376 DF, p-value: < 2.2e-16
“This model suffers from overfitting. By using a 15th-degree polynomial, the model has become overly complex and is capturing random noise rather than the underlying trend. The high number of insignificant p-values and the drop in Adjusted \(R^2\) confirm that a simpler model—likely a degree-2 polynomial—would be much more reliable and better at predicting future observations.”
plot(Auto_df$horsepower,Auto_df$mpg , pch = 20, col = "darkgrey")
xp0 <- seq(min(Auto_df$horsepower), max(Auto_df$horsepower), lenght = 100)
## Warning: In seq.default(min(Auto_df$horsepower), max(Auto_df$horsepower),
## lenght = 100) :
## extra argument 'lenght' will be disregarded
newdata <- data.frame(horsepower = xp0)
lines( xp0, predict(model_ad_poly_high, newdata = newdata), col = "darkblue")
plot(model_ad_poly_high)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Interpretation: Notice the wildly fluctuating red curve—it bends
unnaturally just to chase individual data points, particularly at the
boundaries. Can you imagine doing extrapolation with this model?
In the diagnostic plots, we see strange vertical striping patterns in the residuals and an extreme outlier/leverage point pulling heavily on the model. Clearly, arbitrarily increasing the polynomial degree is not the right way to fix our variance issues!
model_y <- lm(log(mpg) ~ poly(horsepower , degree = 2), data = Auto_df)
summary(model_y)
##
## Call:
## lm(formula = log(mpg) ~ poly(horsepower, degree = 2), data = Auto_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66460 -0.12041 0.00316 0.11349 0.66376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.098313 0.008908 347.820 < 2e-16 ***
## poly(horsepower, degree = 2)1 -5.581826 0.176366 -31.649 < 2e-16 ***
## poly(horsepower, degree = 2)2 1.397641 0.176366 7.925 2.44e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1764 on 389 degrees of freedom
## Multiple R-squared: 0.7324, Adjusted R-squared: 0.731
## F-statistic: 532.2 on 2 and 389 DF, p-value: < 2.2e-16
This model represents an optimal balance between accuracy and simplicity. By applying a log-transformation to the response variable and using a quadratic polynomial for the predictor, the model achieves a high Adjusted \(R^2\) of 0.731. The statistical significance of the quadratic term confirms that the relationship is non-linear, and the well-centered residuals indicate that the model is robust and reliable.
par(mfrow = c(2, 2))
plot(model_y)
The variance is now much better behaved, and the points on the normal Q-Q plot follow the theoretical line more closely.
Important: When you predict using a log-transformed model, the result (as well as the confidence intervals) will be in the log scale. You must use the exp() function to back-transform the predictions to the original scale (MPG).
Also note that even though our model includes a degree 2 polynomial for horsepower, we only need to provide the raw horsepower values to our newdata object. R automatically applies the correct polynomial transformation behind the scenes!
new.n <- data.frame(horsepower = 220)
prediction_level <- predict(model_y, newdata = new.n , interval = "confidence")
prediction_level
## fit lwr upr
## 1 2.526395 2.438396 2.614395
exp(prediction_level)
## fit lwr upr
## 1 12.50833 11.45465 13.65895
“I used the confidence interval because my primary goal was to estimate the expected average mpg for cars with 220 horsepower. This provides a precise estimate of the model’s average trend at that point. If I were predicting the efficiency of a single, specific vehicle, I would use a prediction interval to account for the additional variability in individual observations.”
Think of it like this:
Confidence Interval: “What is the average MPG for this group?” (Good for proving the model works).
Prediction Interval: “What is the MPG for this specific car?” (Good for real-world car buyers).
plot(Auto_df$horsepower, log(Auto_df$mpg), pch = 20 , col = "grey")
xp1 <- seq(min(Auto_df$horsepower), max(Auto_df$horsepower), length = 100)
# Instead of just the raw horsepower, include the poly syntax
lines(xp1, predict(model_y, newdata = data.frame(horsepower = xp1)), col = "darkred", lwd = 3)
“This scatter plot illustrates a negative, quadratic relationship
between vehicle horsepower and the log of miles per gallon. The use of a
log-transformation on the response variable helps to capture the
non-linear trend, and the red regression curve demonstrates that the
quadratic model provides a strong, well-centered fit to the observed
data, effectively capturing the diminishing effect that horsepower has
on fuel efficiency at higher levels.”
GLM
glm_auto <- glm(mpg ~ poly(horsepower, degree = 2), family = Gamma(link = "log"), data = Auto_df )
summary(glm_auto)
##
## Call:
## glm(formula = mpg ~ poly(horsepower, degree = 2), family = Gamma(link = "log"),
## data = Auto_df)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.113649 0.008929 348.696 < 2e-16 ***
## poly(horsepower, degree = 2)1 -5.631775 0.176793 -31.855 < 2e-16 ***
## poly(horsepower, degree = 2)2 1.405670 0.176793 7.951 2.03e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.03125591)
##
## Null deviance: 44.205 on 391 degrees of freedom
## Residual deviance: 12.023 on 389 degrees of freedom
## AIC: 2185.6
##
## Number of Fisher Scoring iterations: 4
The GLM demonstrates a significant improvement over the null model, as evidenced by the substantial reduction in deviance. All significant predictors (p < 0.05) exhibit a directional effect consistent with the theoretical relationship. The deviance residuals are symmetric and centered, and the AIC provides a benchmark for further model selection, confirming that this model is both statistically robust and an efficient fit for the data.This Gamma GLM is highly significant, as evidenced by the extreme reduction in deviance from the null model and the very low p-values for both the linear and quadratic horsepower terms. The significance of the quadratic term confirms a non-linear relationship. With an AIC of 2185.6 and a low dispersion parameter, the model provides a robust and well-fitting statistical representation of the relationship between horsepower and the response variable.
type = “link” returns the linear predictor (in this case, the predicted value on the log scale). type = “response” automatically applies the inverse-link function (the exponential) and returns the predicted expected value in the original scale (MPG). You do not need to manually compute exp()! # Create a sequence of horsepower values for a smooth curve
plot(Auto_df$horsepower, Auto_df$mpg, pch = 20, col = "lightblue" )
ne <- seq(min(Auto_df$horsepower), max(Auto_df$horsepower), length = 100 )
newD <- data.frame(horsepower = ne)
lines(ne, predict(glm_auto, newdata = newD, type = "response"), col = "purple", lwd = 3)
“This Gamma GLM is statistically robust, evidenced by the extreme reduction in deviance compared to the null model. Both the linear and quadratic terms for horsepower are highly significant, confirming a strong non-linear relationship between the predictor and the response. The low dispersion parameter and the resulting AIC indicate a well-fitting model that appropriately handles the positive, continuous nature of the data using a log link function.”
par(mfrow = c(2,2))
plot(glm_auto)
par(mfrow = c(1,1))
“While the inverse link is the canonical link for the Gamma distribution, I chose the log link because it provides more intuitive, interpretable results by modeling multiplicative effects rather than reciprocal ones. Furthermore, the log link offers superior numerical stability, particularly when predicting across a wider range of values, and prevents the risk of non-convergence associated with the inverse link.” “The diagnostic plots collectively indicate a well-fitting Gamma GLM. The Residuals vs. Fitted plot confirms the model structure is correctly specified, while the Normal Q-Q plot validates the choice of the Gamma distribution. The Scale-Location plot shows no evidence of non-constant variance, and the Residuals vs. Leverage plot confirms that the results are not unduly biased by influential outliers. Therefore, the model provides a robust and reliable statistical explanation of the data.”
“The diagnostic plots collectively indicate a well-fitting Gamma GLM. The Residuals vs. Fitted plot confirms the model structure is correctly specified, while the Normal Q-Q plot validates the choice of the Gamma distribution. The Scale-Location plot shows no evidence of non-constant variance, and the Residuals vs. Leverage plot confirms that the results are not unduly biased by influential outliers. Therefore, the model provides a robust and reliable statistical explanation of the data.”
Possion distribution
library("faraway")
data(gala)
gala
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
## Daphne.Minor 24 0 0.08 93 6.0 12.0 0.34
## Darwin 10 7 2.33 168 34.1 290.2 2.85
## Eden 8 4 0.03 71 0.4 0.4 17.95
## Enderby 2 2 0.18 112 2.6 50.2 0.10
## Espanola 97 26 58.27 198 1.1 88.3 0.57
## Fernandina 93 35 634.49 1494 4.3 95.3 4669.32
## Gardner1 58 17 0.57 49 1.1 93.1 58.27
## Gardner2 5 4 0.78 227 4.6 62.2 0.21
## Genovesa 40 19 17.35 76 47.4 92.2 129.49
## Isabela 347 89 4669.32 1707 0.7 28.1 634.49
## Marchena 51 23 129.49 343 29.1 85.9 59.56
## Onslow 2 2 0.01 25 3.3 45.9 0.10
## Pinta 104 37 59.56 777 29.1 119.6 129.49
## Pinzon 108 33 17.95 458 10.7 10.7 0.03
## Las.Plazas 12 9 0.23 94 0.5 0.6 25.09
## Rabida 70 30 4.89 367 4.4 24.4 572.33
## SanCristobal 280 65 551.62 716 45.2 66.6 0.57
## SanSalvador 237 81 572.33 906 0.2 19.8 4.89
## SantaCruz 444 95 903.82 864 0.6 0.0 0.52
## SantaFe 62 28 24.08 259 16.5 16.5 0.52
## SantaMaria 285 73 170.92 640 2.6 49.2 0.10
## Seymour 44 16 1.84 147 0.6 9.6 25.09
## Tortuga 16 8 1.24 186 6.8 50.9 17.95
## Wolf 21 12 2.85 253 34.1 254.7 2.33
mod_p <- glm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,family = poisson(link = "log") , data = gala)
summary(mod_p)
##
## Call:
## glm(formula = Species ~ Area + Elevation + Nearest + Scruz +
## Adjacent, family = poisson(link = "log"), data = gala)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***
## Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***
## Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***
## Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***
## Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***
## Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: 889.68
##
## Number of Fisher Scoring iterations: 5
A Note on Coefficient Significance in GLMs: Notice the z
-values and associated p-values in the summary() output above. These are based on Wald tests. In Generalized Linear Models, Wald tests can sometimes be mathematically unreliable (for instance, when coefficients are large, the standard error can become artificially inflated, leading to false negatives). Therefore, to robustly test whether a parameter (or a set of parameters) is truly significant, it is much better to perform a model comparison test, rather than just looking at the summary table.
A Note on Coefficient Significance in GLMs: Notice the z
-values and associated p-values in the summary() output above. These are based on Wald tests. In Generalized Linear Models, Wald tests can sometimes be mathematically unreliable (for instance, when coefficients are large, the standard error can become artificially inflated, leading to false negatives). Therefore, to robustly test whether a parameter (or a set of parameters) is truly significant, it is much better to perform a model comparison test, rather than just looking at the summary table.
model_p_reduce <- glm(Species ~ Area + Elevation + Adjacent,family = poisson(link = "log") , data = gala)
summary(model_p_reduce)
##
## Call:
## glm(formula = Species ~ Area + Elevation + Adjacent, family = poisson(link = "log"),
## data = gala)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.961e+00 4.773e-02 62.05 <2e-16 ***
## Area -5.703e-04 2.518e-05 -22.65 <2e-16 ***
## Elevation 3.589e-03 8.608e-05 41.70 <2e-16 ***
## Adjacent -7.508e-04 2.778e-05 -27.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 818.74 on 26 degrees of freedom
## AIC: 987.57
##
## Number of Fisher Scoring iterations: 5
Anova using chisq
anova(model_p_reduce, mod_p, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Species ~ Area + Elevation + Adjacent
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 26 818.74
## 2 24 716.85 2 101.89 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
“Upon comparing the Full and Reduced models, the Full Model yielded both a lower Residual Deviance and a lower AIC. Given that the Full Model also demonstrated statistical significance in the Likelihood Ratio Test, we conclude that it provides a superior fit to the data. While the Reduced Model is more parsimonious, the Full Model’s improved accuracy and predictive efficiency make it the preferred model for this analysis.”
lamdahat <- fitted(mod_p)
peoson_chi <- residuals(mod_p, type = "pearson")
phi_hat <- sum((peoson_chi)^2)/mod_p$df.residual
cat("Estimated dispersion parameter (phi):", phi_hat, "\n")
## Estimated dispersion parameter (phi): 31.74914
error <- (gala$Species - lamdahat)^2
plot(lamdahat, error, col = "black", main = "Mean vs variance" )
abline(0,1, lty = 1, col ="red")
plot(lamdahat, peoson_chi, main = "mean vs pearson_residual" )
abline(h = 0, lty = 2 )
Assignment n.3 GLM: Logistic Regression 2026-04-01 The orings dataset
The faraway library contains the orings dataset, which records the
number of damaged O-rings and the launch temperature for 23 space
shuttle flights prior to the Challenger disaster.
library(faraway)
data("orings")
head(orings)
## temp damage
## 1 53 5
## 2 57 1
## 3 58 1
## 4 63 1
## 5 66 0
## 6 67 0
orings$damage_bin <- ifelse(orings$damage > 0, 1,0)
orings
## temp damage damage_bin
## 1 53 5 1
## 2 57 1 1
## 3 58 1 1
## 4 63 1 1
## 5 66 0 0
## 6 67 0 0
## 7 67 0 0
## 8 67 0 0
## 9 68 0 0
## 10 69 0 0
## 11 70 1 1
## 12 70 0 0
## 13 70 1 1
## 14 70 0 0
## 15 72 0 0
## 16 73 0 0
## 17 75 0 0
## 18 75 1 1
## 19 76 0 0
## 20 76 0 0
## 21 78 0 0
## 22 79 0 0
## 23 81 0 0
# Plot temperature (y) against the binary damage variable (x)
# We convert damage_bin to a factor so R treats it as distinct categories
boxplot(temp ~ as.factor(damage_bin), data = orings,
main = "Temperature vs. Failure Occurrence",
xlab = "Failure (0=No, 1=Yes)",
ylab = "Temperature")
“Exploratory analysis via boxplot suggests a negative association
between temperature and failure occurrence. Specifically, the
distribution of temperatures associated with failure (\(1\)) is shifted toward lower values. This
implies that the risk of failure increases as temperature decreases, a
relationship I will quantify formally by fitting a logistic regression
model.”
lm_orings <- lm(damage_bin ~ temp, data = orings)
summary(lm_orings)
##
## Call:
## lm(formula = damage_bin ~ temp, data = orings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43762 -0.30679 -0.06381 0.17452 0.89881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.90476 0.84208 3.450 0.00240 **
## temp -0.03738 0.01205 -3.103 0.00538 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3987 on 21 degrees of freedom
## Multiple R-squared: 0.3144, Adjusted R-squared: 0.2818
## F-statistic: 9.63 on 1 and 21 DF, p-value: 0.005383
temp_p <- data.frame(temp = c(31,82))
prediction <- predict(lm_orings, newdata = temp_p)
prediction
## 1 2
## 1.7459524 -0.1604762
Standard linear regression is inappropriate here because it does not constrain predictions to the \([0, 1]\) interval, which is required for modeling the probability of an event. By using logistic regression with the logit link function, I can model the log-odds of failure as a linear function of temperature, effectively estimating the probability of damage occurring at any given temperature level.”
“The linear model is inappropriate because it produces predictions outside the valid range of probability (\([0, 1]\)). My predictions of 1.75 for 31°F and -0.16 for 82°F are mathematically impossible outcomes for a binary response variable. This clearly demonstrates that the linear regression assumption—that the response is continuous and unbounded—is violated by my binary damage_bin data. This failure necessitates the use of a logistic regression model, which ensures that all predictions remain strictly 0 and 1.”
?glm
md_temp <- glm(damage_bin ~ temp, data = orings, family = binomial(link = "logit") )
summary(md_temp)
##
## Call:
## glm(formula = damage_bin ~ temp, family = binomial(link = "logit"),
## data = orings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.0429 7.3786 2.039 0.0415 *
## temp -0.2322 0.1082 -2.145 0.0320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.315 on 21 degrees of freedom
## AIC: 24.315
##
## Number of Fisher Scoring iterations: 5
“The logistic regression results confirm a statistically significant negative relationship between temperature and the occurrence of damage (\(p = 0.0320\)). The negative coefficient of -0.23 indicates that an increase in temperature is associated with a decrease in the log-odds of failure. Furthermore, the reduction in deviance from the null (28.27) to the residual model (20.32) indicates that temperature provides a statistically significant improvement in predicting damage occurrence compared to the null model.”
md_re <- glm(damage_bin ~ 1, data = orings, family = binomial(link = "logit") )
summary(md_re)
##
## Call:
## glm(formula = damage_bin ~ 1, family = binomial(link = "logit"),
## data = orings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8267 0.4532 -1.824 0.0681 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 28.267 on 22 degrees of freedom
## AIC: 30.267
##
## Number of Fisher Scoring iterations: 4
help("anova")
anova(md_re,md_temp,test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: damage_bin ~ 1
## Model 2: damage_bin ~ temp
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 22 28.267
## 2 21 20.315 1 7.952 0.004804 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When your Residual Deviance is smaller than your Degrees of Freedom (Df), it indicates that your model is not overdispersed. In the context of a Binomial GLM, this is actually a sign of a “good” fit, as it suggests the model isn’t struggling with excess unexplained noise. “I performed a likelihood ratio test to compare the null model with the model containing temperature. The addition of the temperature predictor resulted in a statistically significant reduction in residual deviance of 7.95 (\(p = 0.0048\)). This allows me to reject the null hypothesis that the temperature coefficient is zero and conclude that temperature significantly influences the probability of failure.”
plot(orings$temp,orings$damage_bin, pch = 20 , col = "lightgreen")
tp <- seq(min(orings$temp), max(orings$temp), length = 100)
lines(tp, predict(md_temp, newdata = data.frame(temp = tp), type = "response"), col = "darkred", lwd = 3)
“This scatterplot displays the binary outcome of failure against
temperature. The overlaid curve is the fitted logistic regression model,
which demonstrates the inverse relationship between temperature and the
probability of damage. The S-shaped curve indicates that lower
temperatures are associated with a higher probability of failure, with
the model effectively capturing the transition from a high risk of
damage at lower temperatures to a low risk at higher temperatures.”
prediction <- predict(md_temp, newdata = data.frame(temp = c(31,82)), type = "response")
prediction
## 1 2
## 0.99960878 0.01808462
“Using the predict() function with type =”response”, I calculated the probability of damage for specific temperatures. At 31°F, the model predicts a probability of failure near 1.0, highlighting the extreme risk present at the temperature of the Challenger launch. Conversely, at 82°F, the model predicts a near-zero probability of damage. This calculation effectively quantifies the risk, moving beyond a simple visual trend to provide an actionable estimate of the probability of failure under specific environmental conditions.”
First, use predict() with type = “link” and se.fit = TRUE to get the fitted values and standard errors on the logit scale. Calculate the lower and upper bounds on this scale (fit ± 1.96 se.fit). Finally, use the plogis() function to back-transform your results into actual probabilities.
pred_link <- predict(md_temp, newdata = data.frame(temp = c(31,82)), type = "link", se.fit = TRUE)
lower_bound <- pred_link$fit - 1.96 * pred_link$se.fit
upper_bound <- pred_link$fit + 1.96 * pred_link$se.fit
lower_confint <- plogis(lower_bound)
upper_confint <- plogis(upper_bound)
lower_confint
## 1 2
## 0.4815742515 0.0007759897
upper_confint
## 1 2
## 0.9999999 0.3040064
“The difference in the width of the confidence intervals reflects the distribution of the underlying data. The model is highly certain about the high risk of failure at 31°F because that temperature is closer to the ‘damage’ cluster in the data. Conversely, the interval for 82°F is wider because the model has less data in that region, leading to higher uncertainty regarding the exact probability. This interval demonstrates the value of the model: it doesn’t just provide a point estimate, but also a realistic range that communicates where we are more or less certain.”
Classification
library(ISLR2)
data("Default")
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
# 1. Map labels to meaningful values
# This is much easier to read than "0*15+1"
my_colors <- c("black", "red")
my_shapes <- c(1, 19)
# 2. Use those variables in your plot
plot(Default$balance, Default$income,
col = my_colors[as.factor(Default$default)],
pch = my_shapes[as.factor(Default$default)])
One interesting consideration is the potential bidirectional
relationship between balance and default. While the model treats balance
as a predictor of default, it is possible that the default itself
contributes to an increasing balance due to accumulated interest and
penalties. However, for predictive purposes, balance remains a highly
reliable signal of high-risk accounts regardless of the causal
direction.The visual analysis shows that the income variable does not
provide a clear decision boundary for default status. We see significant
overlap of both classes across the entire range of incomes, suggesting
that income is not a strong discriminator between defaulters and
non-defaulters. Consequently, the model assigns it a low statistical
significance, as it offers little predictive power compared to the
credit balance
mod_d <- glm(default ~ income + balance , data = Default, family = binomial )
summary(mod_d)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
“In this model, the estimate represents the change in log-odds, indicating that the probability of default increases as both balance and income increase. Both predictors are statistically significant, as evidenced by their low p-values. However, relying solely on the summary table is insufficient. The standard errors and z-values are derived from the Wald test, which can lead to artificially inflated estimates when coefficients are large, potentially resulting in false negatives. Therefore, we must perform model comparison to rigorously evaluate the model’s efficiency and actual predictive power.”
mod_def <- glm(default ~ 1 , data = Default, family = binomial )
summary(mod_def)
##
## Call:
## glm(formula = default ~ 1, family = binomial, data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.36833 0.05574 -60.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 2920.6 on 9999 degrees of freedom
## AIC: 2922.6
##
## Number of Fisher Scoring iterations: 6
anova(mod_def,mod_d, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: default ~ 1
## Model 2: default ~ income + balance
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 9999 2920.7
## 2 9997 1579.0 2 1341.7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Decision boundary
#extract coefficients
Beta_hat <- coefficients(mod_d)
Beta0_hat <- Beta_hat[1]
Beta1_hat <- Beta_hat[2]
Beta2_hat <- Beta_hat[3]
#b0 + b1 * income + b2 * balance = 0
# b1 * income = -(b2*balance) - b0
#income = -(b2*balance)/b1 - b0 / b1
# slope = -b2/b1
#intercept = -b0/b1
slope = -Beta2_hat/Beta1_hat
intercept = -Beta0_hat/Beta1_hat
plot(Default$balance, Default$income,
col = my_colors[as.factor(Default$default)],
pch = my_shapes[as.factor(Default$default)])
abline(intercept,slope,lwd = 3,col = "darkblue")
confusion matrix
conf.matrix <- predict(mod_d, type = "response")
confit.class <- rep("NO", 10000)
confit.class[conf.matrix > 0.5 ] <- "YES"
#class[condition] <- value
conf_matrix <- table(predicted = confit.class, True = Default$default)
conf_matrix
## True
## predicted No Yes
## NO 9629 225
## YES 38 108
overall_error <-(conf_matrix[1,2] + conf_matrix[2,1])/ sum(conf_matrix)
overall_error
## [1] 0.0263
false_negative_error <- conf_matrix[1,2]/ sum(conf_matrix[, 2])
false_negative_error
## [1] 0.6756757
“While the overall error rate of 2.63% indicates high general accuracy, it is deceptive due to the class imbalance in the dataset. A more critical metric is the False Negative Rate, which is approximately 67.6%. This reveals that the model fails to identify the majority of actual defaulters when using a 0.5 probability threshold. Consequently, while the model is highly accurate at identifying safe accounts, it is currently ineffective at fulfilling the primary business objective of catching high-risk defaults.”
conf.matrix <- predict(mod_d, type = "response")
trivial.class <- rep("NO", 10000)
trivial_matrix <- table(predicted = trivial.class, True = Default$default)
trivial_matrix
## True
## predicted No Yes
## NO 9667 333
overall_error_t <- 333/10000
overall_error_t
## [1] 0.0333
“The analysis shows that while the logistic model outperforms the trivial classifier in overall accuracy (2.63% vs 3.33%), the 0.5 probability threshold results in a high False Negative Rate of 67.6%. This indicates that the current model configuration is biased towards precision over recall. To improve risk detection, future iterations should focus on threshold tuning or adjusting for class imbalance to better identify high-risk accounts.”
Threshold 0.2
conf.matrix2 <- predict(mod_d, type = "response")
confit.class2 <- rep("NO", 10000)
confit.class2[conf.matrix2 > 0.2] <- "YES"
#class[condition] <- value
conf_matrix2 <- table(predicted = confit.class2, True = Default$default)
conf_matrix2
## True
## predicted No Yes
## NO 9396 133
## YES 271 200
overall_error2 <-(conf_matrix2[1,2] + conf_matrix2[2,1])/ sum(conf_matrix2)
overall_error2
## [1] 0.0404
false_negative_error2 <- conf_matrix2[1,2]/ sum(conf_matrix2[, 2])
false_negative_error2
## [1] 0.3993994
Lowering the classification threshold to 0.2 demonstrates a deliberate shift in the model’s objective. While the overall error rate increased slightly from 2.6% to 4.0%, the False Negative Rate improved drastically from 67.6% to 39.9%. In a business context, this is a highly rational trade-off: it is generally better to incorrectly label a safe customer as ‘risky’ (a False Positive) than to fail to detect a customer who will actually default (a False Negative) and incur a total loss on that account.By lowering the threshold, our overall error rate increased to 4%, but our error rate among actual defaulters dropped massively from 67.5% to about 40%. We increase the sensitivity. We catch more defaults, but at the cost of investigating more false positives.
Default$default <- as.factor(Default$default)
perf.measure <- function(true.values, pred.values, lab.pos = 1){
conf.matrix <- table(pred.values,true.values)
n <- sum(conf.matrix)
print(conf.matrix)
print(rownames(conf.matrix))
lab.pos <- as.character(lab.pos)
lab <- rownames(conf.matrix)
lab.neg <- lab[lab != lab.pos]
TP <- conf.matrix[lab.pos, lab.pos]
TN <- conf.matrix[lab.neg, lab.neg]
FP <- conf.matrix[lab.pos, lab.neg]
FN <- conf.matrix[lab.neg, lab.pos]
P <- TP + FN
N <- TN + FP
P.ast <- TP + FP
OER <- (FP + FN) / n
PPV <- TP/P.ast
TPR <- TP/P
F <- 2 * (PPV * TPR) / (PPV + TPR)
TNR <- TN/N
FPR <- FP/N
FNR <- FN /P
return(list(Overall.ER = OER, PPV =PPV , TPR = TPR, F = F, TNR = TNR, FPR = FPR , FNR = FNR ))
}
# Standardize your predictions to match the dataset labels
confit.class2 <- factor(confit.class2, levels = c("NO", "YES"), labels = c("No", "Yes"))
# Now call the function using "Yes"
PM <- perf.measure(Default$default, confit.class2, lab.pos = "Yes")
## true.values
## pred.values No Yes
## No 9396 133
## Yes 271 200
## [1] "No" "Yes"
PM
## $Overall.ER
## [1] 0.0404
##
## $PPV
## [1] 0.4246285
##
## $TPR
## [1] 0.6006006
##
## $F
## [1] 0.4975124
##
## $TNR
## [1] 0.9719665
##
## $FPR
## [1] 0.02803352
##
## $FNR
## [1] 0.3993994
At the 0.2 threshold, the model demonstrates a Sensitivity (TPR) of 60.06%, indicating that it successfully captures the majority of defaults. While this results in a higher False Positive Rate (2.8%) compared to the 0.5 threshold, it significantly reduces the False Negative Rate (Miss Rate) to 39.94%. The F1 Score of 0.4975 reflects a balanced attempt to minimize both missing high-risk accounts and incorrectly flagging safe ones. Given the business context, this 0.2 threshold provides a much more effective risk mitigation strategy than the 0.5 threshold.”The F1 Score of 0.4975 provides a conservative estimate of the model’s predictive power. It indicates that the model’s performance is currently limited by the trade-off between its ability to detect defaults (Sensitivity) and its tendency to flag safe accounts incorrectly (Precision). A score of ~0.5 suggests that while the model has moved beyond random guessing, it requires further tuning—such as adjusting the classification threshold or utilizing more complex feature engineering—to achieve higher reliability.
Precision (Positive Predictive Value - PPV):
Definition: “Of all the times the model shouted ‘YES!’, how often was it actually right?”
General Rule: High precision means your model is “trustworthy
False Positive Rate (FPR):
Definition: “Of all the truly safe cases, how many did the model accidentally label as ‘Yes’?”
Sensitivity (TPR / Recall):
Definition: “Of all the actual ‘Yes’ cases in the world, what percentage did the model successfully catch?”
General Rule: This is your “Net’s efficiency.” If you are catching a disease or fraud, you want this to be very high
False Negative Rate (FNR / Miss Rate):
Definition: “Of all the actual ‘Yes’ cases, how many did the model fail to see?”
General Rule: This is your “blind spot.” In critical scenarios (like medicine), this is the most dangerous metric.
Specificity (TNR):
Definition: “Of all the truly safe/negative cases, what percentage did the model correctly call ‘No’?”
General Rule: This tells you how good the model is at recognizing “normal” behavior.
Overall Error Rate (OER):Definition: “What percentage of all my guesses (both Yes and No) were wrong?”General Rule: Simple, but often misleading if your dataset is “unbalanced” (e.g., if 99% of people are “No,” a model that says “No” to everyone is 99% accurate but useless).
F1 Score:Definition: “The mathematical balance between Precision and Sensitivity.”General Rule: Use this when you need a single number to represent a model that is “good at both” catching events and being accurate when it flags them.
ROC curve
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
conf.matrix3 <- predict(mod_d, type = "response")
table(Default$default)
##
## No Yes
## 9667 333
roc.out <- roc(Default$default, conf.matrix3, levels = c("No","Yes") )
## Setting direction: controls < cases
plot(roc.out, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate", ylab = "True Positive Rate",
main = "ROC Curve - Logistic Regression")
?coords
coords(roc.out , "best")
## threshold specificity sensitivity
## 1 0.03312645 0.8644874 0.8978979
coords(roc.out , 0.5)
## threshold specificity sensitivity
## 1 0.5 0.9960691 0.3243243
AUC <- round(auc(roc.out), 3)
AUC
## [1] 0.949
Assignment n.4 Classification methods 2026-04-15 We first provide a function for the computation of several performance measures.
perf.measure <- function(true.values, pred.values, lab.pos = 1){
# compute the confusion matrix and number of units
conf.matrix <- table(pred.values, true.values)
n <- sum(conf.matrix)
# force the label of positives to be a character string
lab.pos <- as.character(lab.pos)
# obtain the label of negatives
lab <- rownames(conf.matrix)
lab.neg <- lab[lab != lab.pos]
# extract relevant quantities from the confusion matrix
TP <- conf.matrix[lab.pos, lab.pos]
TN <- conf.matrix[lab.neg, lab.neg]
FP <- conf.matrix[lab.pos, lab.neg]
FN <- conf.matrix[lab.neg, lab.pos]
P <- TP + FN
N <- FP + TN
P.ast <- TP + FP
# compute the performance measures
OER <- (FP+FN)/n
PPV <- TP/P.ast
TPR <- TP/P
F1 <- 2*PPV*TPR/(PPV+TPR)
TNR <- TN/N
FPR <- FP/N
return(list(overall.ER = OER, PPV=PPV, TPR=TPR, F1=F1, TNR=TNR, FPR=FPR))
}
The Stock Market Data We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR2 package. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date). Our goal is to predict Direction (a qualitative response) using the other features.
library(ISLR2)
data("Smarket")
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
dim(Smarket)
## [1] 1250 9
summary(Smarket)
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
##
pairs(Smarket)
The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the Direction variable is qualitative.
round(cor(Smarket[, -9]), 3)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## Year 1.000 0.030 0.031 0.033 0.036 0.030 0.539 0.030
## Lag1 0.030 1.000 -0.026 -0.011 -0.003 -0.006 0.041 -0.026
## Lag2 0.031 -0.026 1.000 -0.026 -0.011 -0.004 -0.043 -0.010
## Lag3 0.033 -0.011 -0.026 1.000 -0.024 -0.019 -0.042 -0.002
## Lag4 0.036 -0.003 -0.011 -0.024 1.000 -0.027 -0.048 -0.007
## Lag5 0.030 -0.006 -0.004 -0.019 -0.027 1.000 -0.022 -0.035
## Volume 0.539 0.041 -0.043 -0.042 -0.048 -0.022 1.000 0.015
## Today 0.030 -0.026 -0.010 -0.002 -0.007 -0.035 0.015 1.000
As one would expect, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume. By plotting the data, which is ordered chronologically, we see that Volume is increasing over time.
plot(Smarket$Volume, main="Trading Volume over Time", ylab="Volume", xlab="Days")
Logistic Regression 1. Fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. 2. Apply the fitted logistic regression model to make a prediction as to whether the market will go up or down on a particular day (with a 0.5 threshold). 3. Obtain the confusion matrix, the overall error rate, and plot the ROC curve. 4. Refit the logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model. How do the results change, and why? 5. Change the classification threshold. Re-classify the predictions of your reduced model using a threshold of 0.52 instead of 0.5. Extract the performance measures and comment on how the True Positive Rate (Sensitivity) and Positive Predictive Value (Precision) have changed. Linear Discriminant Analysis (LDA) 6. Apply linear discriminant analysis using Lag1 and Lag2 as predictors on the full dataset. Quadratic Discriminant Analysis (QDA) 7. Apply quadratic discriminant analysis using Lag1 and Lag2 as predictors. Naive Bayes 8. Apply Naive Bayes analysis using Lag1 and Lag2 as predictors.
?Smarket
mod.s <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume , data = Smarket, family = binomial )
summary(mod.s)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
“The model fails to demonstrate predictive utility. The high p-values across all predictors suggest that the features (Lag returns and Volume) do not possess a statistically significant relationship with the market direction. The minimal reduction in deviance (\(3.6\)) compared to the Null Deviance, combined with a high AIC, suggests that the model’s predictive power is effectively negligible and comparable to random chance. These results suggest that the efficient market hypothesis may hold for this dataset, or that these specific variables are insufficient for forecasting. The model shows a high Residual Deviance relative to the degrees of freedom, which indicates a poor goodness-of-fit. Since overdispersion is a property of Poisson models, I interpret this as evidence of model misspecification, likely because the chosen predictors (Lags/Volume) lack sufficient explanatory power to predict the response.
prob.s <- predict(mod.s, type = "response")
glm.class <- rep("Down", 1250)
glm.class[prob.s > 0.5] <- "Up"
# 3. Check that the lengths match before tabling
length(glm.class) # Should be 1250
## [1] 1250
length(Smarket$Direction) # Should be 1250
## [1] 1250
con.mat <- table(Predicted = glm.class, True = Smarket$Direction)
con.mat
## True
## Predicted Down Up
## Down 145 141
## Up 457 507
overal_error_r <- con.mat[1,2]+con.mat[2,1]/sum(con.mat)
overal_error_r
## [1] 141.3656
library(pROC)
roc.s <- roc(Smarket$Direction, prob.s, levels=c("Down", "Up"))
## Setting direction: controls < cases
plot(roc.s, print.auc = TRUE, legacy.axes = TRUE)
“The ROC curve illustrates the model’s classification performance with an AUC of 0.539. This value indicates that the model has negligible discriminatory power between ‘Up’ and ‘Down’ market days, performing only slightly better than a random baseline. This reinforces the conclusion that the lagged return variables (Lag1 and Lag2) do not possess sufficient signal to reliably predict market direction in this dataset.”
mod.red <- glm(Direction ~ Lag1 + Lag2 , data = Smarket, family = binomial )
summary(mod.red)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.07425 0.05667 1.310 0.190
## Lag1 -0.07151 0.05010 -1.427 0.153
## Lag2 -0.04450 0.05000 -0.890 0.374
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1728.4 on 1247 degrees of freedom
## AIC: 1734.4
##
## Number of Fisher Scoring iterations: 3
“The model was refitted using only Lag1 and Lag2 to achieve greater parsimony. The reduction in AIC confirms that the reduced model is more efficient than the full model. However, the lack of statistical significance (\(p > 0.05\)) for the predictors confirms that these variables remain insufficient for forecasting market direction. This refit demonstrates that while model simplification improves efficiency, it cannot overcome the fundamental absence of a predictive relationship between these specific lagged returns and the market’s daily direction.”
prob.s2 <- predict(mod.s , type = "response")
class.s2 <- rep("Down", 1250)
class.s2[prob.s2 > 0.52] <- "Up"
con.mat2 <- table(predicted = class.s2, True = Smarket$Direction)
con.mat2
## True
## predicted Down Up
## Down 344 327
## Up 258 321
PM1 <- perf.measure(Smarket$Direction, class.s2, lab.pos = "Up" )
PM1
## $overall.ER
## [1] 0.468
##
## $PPV
## [1] 0.5544041
##
## $TPR
## [1] 0.4953704
##
## $F1
## [1] 0.5232274
##
## $TNR
## [1] 0.5714286
##
## $FPR
## [1] 0.4285714
PM2 <- perf.measure(Smarket$Direction, glm.class, lab.pos = "Up" )
PM2
## $overall.ER
## [1] 0.4784
##
## $PPV
## [1] 0.5259336
##
## $TPR
## [1] 0.7824074
##
## $F1
## [1] 0.6290323
##
## $TNR
## [1] 0.2408638
##
## $FPR
## [1] 0.7591362
The performance evaluation of the two models reveals a significant trade-off between detection capability and predictive reliability. Model 2 with threshold 0.52 (PM2) demonstrates a superior ability to identify “Up” market days, achieving a Sensitivity (TPR) of 78.2% compared to Model 1’s (PM1) 49.5%. However, this increased detection comes at a cost; PM2’s Precision (PPV) drops slightly to 52.6%, and its Specificity (TNR) falls sharply to 24.1%, indicating that the model is much more prone to misidentifying “Down” days as “Up” days. Despite these frequent false alarms, PM2 achieves a higher F1 Score of 0.629, suggesting that it provides a more balanced approach to capturing market trends than PM1, which remains more cautious but misses a large portion of potential “Up” market movements.
Linear Discriminant Analysis (LDA) 6. Apply linear discriminant analysis using Lag1 and Lag2 as predictors on the full dataset.
?lda
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## Boston
## The following object is masked from 'package:ISLR2':
##
## Boston
mod.ld <- lda(Direction ~ Lag1 + Lag2 , data = Smarket)
mod.ld
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket)
##
## Prior probabilities of groups:
## Down Up
## 0.4816 0.5184
##
## Group means:
## Lag1 Lag2
## Down 0.05068605 0.03229734
## Up -0.03969136 -0.02244444
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.7567605
## Lag2 -0.4707872
pred_lda <- predict(mod.ld,Smarket)
pred.class <- pred_lda$class
con.mat.ld <- table(Predicted = pred.class, True = Smarket$Direction)
con.mat.ld
## True
## Predicted Down Up
## Down 114 102
## Up 488 546
PM_L <- perf.measure(Smarket$Direction, pred.class, lab.pos = "Up")
PM_L
## $overall.ER
## [1] 0.472
##
## $PPV
## [1] 0.5280464
##
## $TPR
## [1] 0.8425926
##
## $F1
## [1] 0.6492271
##
## $TNR
## [1] 0.1893688
##
## $FPR
## [1] 0.8106312
library(pROC)
roc_ld <- roc(Smarket$Direction , pred_lda$posterior[,2], levels = c("Down", "Up"))
## Setting direction: controls < cases
plot(roc_ld , print.auc = TRUE, legacy.axes = TRUE)
“The LDA model reveals that the market direction is negatively
associated with the magnitude of previous day returns (Lag1 = -0.757;
Lag2 = -0.471). The group means indicate that ‘Up’ days are preceded by
lower average returns than ‘Down’ days, suggesting that the model is
capturing a subtle mean-reverting trend in the Smarket data.”
Quadratic Discriminant Analysis (QDA) 7. Apply quadratic discriminant analysis using Lag1 and Lag2 as predictors.
mod.qd <- qda(Direction ~ Lag1 + Lag2 , data = Smarket)
mod.qd
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket)
##
## Prior probabilities of groups:
## Down Up
## 0.4816 0.5184
##
## Group means:
## Lag1 Lag2
## Down 0.05068605 0.03229734
## Up -0.03969136 -0.02244444
pred_qda <- predict(mod.qd, Smarket)
pred.class.qda <- pred_qda$class
table(Predicted = pred.class.qda, Smarket$Direction)
##
## Predicted Down Up
## Down 109 94
## Up 493 554
PM_Q <- perf.measure(Smarket$Direction,pred.class.qda, lab.pos = "Up")
PM_Q
## $overall.ER
## [1] 0.4696
##
## $PPV
## [1] 0.5291309
##
## $TPR
## [1] 0.8549383
##
## $F1
## [1] 0.6536873
##
## $TNR
## [1] 0.1810631
##
## $FPR
## [1] 0.8189369
roc.qd <- roc(Smarket$Direction, pred_qda$posterior[,2], levels = c("Down", "Up"))
## Setting direction: controls < cases
plot(roc.qd, print.auc = TRUE,legacy.axes = TRUE)
“QDA was employed to relax the assumption of equal variance between the ‘Up’ and ‘Down’ groups. By utilizing a quadratic decision boundary, the model can capture non-linear relationships that LDA, with its rigid linear boundary, might overlook. This is particularly useful if the market exhibits different volatility characteristics during bullish versus bearish regimes.”
Naive Bayes 8. Apply Naive Bayes analysis using Lag1 and Lag2 as predictors.
library(e1071)
mod.NB <- naiveBayes(Direction ~ Lag1 + Lag2 , data = Smarket)
mod.NB
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4816 0.5184
##
## Conditional probabilities:
## Lag1
## Y [,1] [,2]
## Down 0.05068605 1.141070
## Up -0.03969136 1.130989
##
## Lag2
## Y [,1] [,2]
## Down 0.03229734 1.157174
## Up -0.02244444 1.116768
class_NB <- predict(mod.NB, Smarket)
table(Predicted = class_NB, True = Smarket$Direction)
## True
## Predicted Down Up
## Down 106 93
## Up 496 555
PM_NB <- perf.measure(Smarket$Direction, class_NB, lab.pos = "Up")
PM_NB
## $overall.ER
## [1] 0.4712
##
## $PPV
## [1] 0.5280685
##
## $TPR
## [1] 0.8564815
##
## $F1
## [1] 0.6533255
##
## $TNR
## [1] 0.1760797
##
## $FPR
## [1] 0.8239203
N.posterior <- predict(mod.NB, Smarket, type = "raw")
library(pROC)
roc.NB <- roc(Smarket$Direction, N.posterior[, 2], levels = c("Down", "Up"))
## Setting direction: controls < cases
plot(roc.NB, print.auc = TRUE , legacy.axes = TRUE)
“The Naive Bayes model confirms the statistical patterns identified in the LDA and QDA analyses, specifically the mean-reverting relationship where ‘Down’ days follow positive lagged returns. By assuming independence between Lag1 and Lag2, Naive Bayes provides a robust, simplified benchmark for classification. The consistency in the conditional means across all three models (LDA, QDA, and Naive Bayes) reinforces the validity of the identified market pattern, even though the overall predictive power remains constrained by the inherent volatility of the Smarket data.”
“The negative group means for ‘Up’ days indicate that market gains are statistically preceded by negative lagged returns. This suggests the presence of a mean-reversion process where the market corrects itself after downward pressure. The predictive models (LDA, QDA, and Naive Bayes) consistently leverage this ‘bounce-back’ behavior to classify market direction, confirming that past returns are inversely related to subsequent market movements in this dataset.”
MODEL SELECTION
library(ISLR2)
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
## [19] "Salary" "NewLeague"
data(Hitters)
head(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson 293 66 1 30 29 14 1 293 66 1
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson 30 29 14 A E 446 33 20
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## Salary NewLeague
## -Andy Allanson NA A
## -Alan Ashby 475.0 N
## -Alvin Davis 480.0 A
## -Andre Dawson 500.0 N
## -Andres Galarraga 91.5 N
## -Alfredo Griffin 750.0 A
sum(is.na(Hitters))
## [1] 59
Hitters <- na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters))
## [1] 0
WThe regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.
library(leaps)
## Warning: package 'leaps' was built under R version 4.6.1
hit.reg <- regsubsets(Salary ~ . , Hitters)
summary(hit.reg)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI. By default, regsubsets() only reports results up to the best eight-variable model. But the nvmax option can be used in order to return as many variables as are desired. Here we fit up to a 19-variable model.
library(leaps)
hit.reg <- regsubsets(Salary ~ . , Hitters, nvmax = 19)
summary(hit.reg)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters, nvmax = 19)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
reg_summary <- summary(hit.reg)
reg_summary$cp
## [1] 104.281319 50.723090 38.693127 27.856220 21.613011 14.023870
## [7] 13.128474 7.400719 6.158685 5.009317 5.874113 7.330766
## [13] 8.888112 10.481576 12.346193 14.187546 16.087831 18.011425
## [19] 20.000000
reg_summary$bic
## [1] -90.84637 -128.92622 -135.62693 -141.80892 -144.07143 -147.91690
## [7] -145.25594 -147.61525 -145.44316 -143.21651 -138.86077 -133.87283
## [13] -128.77759 -123.64420 -118.21832 -112.81768 -107.35339 -101.86391
## [19] -96.30412
reg_summary$adjr2
## [1] 0.3188503 0.4208024 0.4450753 0.4672734 0.4808971 0.4972001 0.5007849
## [8] 0.5137083 0.5180572 0.5222606 0.5225706 0.5217245 0.5206736 0.5195431
## [15] 0.5178661 0.5162219 0.5144464 0.5126097 0.5106270
reg_summary$rss
## [1] 36179679 30646560 29249297 27970852 27149899 26194904 25906548 25136930
## [9] 24814051 24500402 24387345 24333232 24289148 24248660 24235177 24219377
## [17] 24209447 24201837 24200700
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
adj_max <- which.max(reg_summary$adjr2)
points(adj_max, reg_summary$adjr2[adj_max], col = "red", cex = 2, pch = 20)
plot(hit.reg, scale = "bic")
coef(hit.reg,6)
## (Intercept) AtBat Hits Walks CRBI DivisionW
## 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
## PutOuts
## 0.2643076
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
set.seed(1)
n <- nrow(Hitters)
train <- sample(c(TRUE,FALSE), n , replace = TRUE)
test <- (!train)
reg.forward <- regsubsets(Salary ~ . , data = Hitters[train,], method = "forward", nvmax = 19)
val.errors <- rep(NA, 19)
for ( i in 1:19) {
pred <- predict(reg.forward , Hitters[test, ] , id = i)
val.errors[i] <- mean((Hitters$Salary[test] - pred)^2)
}
val.opt.size <- which.min(val.errors)
plot(val.errors, type = "b", pch = 16)
points(val.opt.size, val.errors[val.opt.size], pch = 20, col = "red")
k = 10
folds <- sample(rep((1:k), length = n))
cv.errors <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for (j in 1:k) {
# Train forward selection on all folds except j
best.fit <- regsubsets(Salary ~ ., data = Hitters[folds != j, ], nvmax = 19, method = "forward")
# Predict on fold j and calculate MSE
for (i in 1:19) {
pred <- predict(best.fit, Hitters[folds == j, ], id = i)
cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
opt.cv.size <- which.min(mean.cv.errors)
plot(1:19, mean.cv.errors, type = "b", pch = 16, xlab = "Number of Variables", ylab = "10-Fold CV MSE")
points(opt.cv.size, mean.cv.errors[opt.cv.size], col = "red", cex = 2, pch = 20)
reg.fwd.final <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19,
method = "forward")
coef(reg.fwd.final, 10)
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
## CRBI CWalks DivisionW PutOuts Assists
## 0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
Assignment n.5 Model selection 2026-04-22 We first provide our custom function for the computation of classification performance measures.
perf.measure <- function(true.values, pred.values, lab.pos = 1){
conf.matrix <- table(pred.values, true.values)
n <- sum(conf.matrix)
lab.pos <- as.character(lab.pos)
lab <- rownames(conf.matrix)
lab.neg <- lab[lab != lab.pos]
TP <- conf.matrix[lab.pos, lab.pos]
TN <- conf.matrix[lab.neg, lab.neg]
FP <- conf.matrix[lab.pos, lab.neg]
FN <- conf.matrix[lab.neg, lab.pos]
P <- TP + FN
N <- FP + TN
P.ast <- TP + FP
OER <- (FP+FN)/n
PPV <- TP/P.ast
TPR <- TP/P
F1 <- 2*PPV*TPR/(PPV+TPR)
TNR <- TN/N
FPR <- FP/N
return(list(overall.ER = OER, PPV=PPV, TPR=TPR, F1=F1, TNR=TNR, FPR=FPR))
}
The Default Dataset For this assignment, we will use the Default dataset from the ISLR2 package. The goal is to predict which customers will default on their credit card debt based on their student status, credit card balance, and income.
library(ISLR2)
data("Default")
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
def_glm <- glm(default ~ . , data = Default, family = binomial)
summary(def_glm)
##
## Call:
## glm(formula = default ~ ., family = binomial, data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
n <- nrow(Default)
help("step")
glm_full_def <- step(def_glm, direction = "backward", k = log(n))
## Start: AIC=1608.39
## default ~ student + balance + income
##
## Df Deviance AIC
## - income 1 1571.7 1599.3
## - student 1 1579.0 1606.6
## <none> 1571.5 1608.4
## - balance 1 2907.5 2935.1
##
## Step: AIC=1599.31
## default ~ student + balance
##
## Df Deviance AIC
## <none> 1571.7 1599.3
## - student 1 1596.5 1614.9
## - balance 1 2908.7 2927.1
summary(glm_full_def)
##
## Call:
## glm(formula = default ~ student + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 ***
## studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 ***
## balance 5.738e-03 2.318e-04 24.750 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.7 on 9997 degrees of freedom
## AIC: 1577.7
##
## Number of Fisher Scoring iterations: 8
Carry out a diagnostic check of the selected model by computing its performance measures using a standard 0.5 probability threshold. Considering the problem setting (predicting credit defaults), which is the best error measure to use in this case?
Try to improve the model by including a quadratic component for the balance predictor. Fit this model, calculate its BIC, and compare it to the model selected in Question 1.
Try to improve the model by including all two-way interaction effects between all original predictors. Use Backward Stepwise Selection (via BIC) again to see if any interaction terms should be retained.
Apply 10-fold Cross-Validation to estimate the test performance of our chosen model (student + balance). Hint: Use the cv.glm() function from the boot package. Because cv.glm() expects a numeric response, first create a 0/1 variable. Furthermore, since we established that the Overall Error Rate is misleading for this data, supply to the argument cost a custom cost function that uses your perf.measure() function to calculate the F1 Error (defined as 1 - F1).
Now, consider a subsample of 1000 customers. set.seed(123) sub_idx <- sample(1:nrow(Default), 1000) Default_sub <- Default[sub_idx, ] Based on this subsample run the following analysis. Suppose the bank determines that issuing a credit card to someone who defaults (a False Negative) is much more expensive (costs 20k on average) than denying a card to someone who would be safe (a False Positive, costing 1k on average). How should this affect your classification threshold? Construct your own customized cost function and perform leave-one-out cross-validation to adjust the threshold among a set of 5 pre-defined values. Hint: The cost function should be defined in the form cost.financial <- function(r, pi=0, tau=0.5). Since cv.glm() only accepts a cost function with two arguments (r and pi), you can wrap your customized function inside the CV loop to pass the current threshold. To keep computation time reasonable, take a random sample of 1,000 observations from the dataset before running your loop.
This dataset is part of the R-package ISLR2 and contains certain
statistics and salaries of major league baseball players for the years
1986–87. We wish to predict a baseball player’s Salary on
the basis of various statistics associated with performance. It consists
of 20 variables and 322 observations, only the Salary
variable has missing observations.
library(ISLR2)
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
## [19] "Salary" "NewLeague"
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters$Salary))
## [1] 0
First of all, we note that the Salary variable is
missing for some of the players. The is.na() function can
be used to identify the missing observations. It returns a vector of the
same length as the input vector, with a TRUE for any
elements that are missing, and a FALSE for non-missing
elements. The sum() function can then be used to count all
of the missing elements.
Hence we see that Salary is missing for 59 players. The
na.omit() function removes all of the rows that have
missing values in any variable.
Hitters <- na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters))
## [1] 0
After omitting them, we have 263 complete observations to use for modeling.
WThe regsubsets() function (part of the
leaps library) performs best subset selection by
identifying the best model that contains a given number of predictors,
where best is quantified using RSS. The syntax is the same as
for lm(). The summary() command outputs the
best set of variables for each model size.
library(leaps)
regfit.full <- regsubsets(Salary ~ ., Hitters)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
An asterisk indicates that a given variable is included in the
corresponding model. For instance, this output indicates that the best
two-variable model contains only Hits and
CRBI. By default, regsubsets() only reports
results up to the best eight-variable model. But the nvmax
option can be used in order to return as many variables as are desired.
Here we fit up to a 19-variable model.
regfit.full <- regsubsets(Salary ~ ., data = Hitters,
nvmax = 19)
reg.summary <- summary(regfit.full)
The summary() function also returns \(R^2\), RSS, adjusted \(R^2\), \(C_p\), and BIC. We can plot these,
highlighting the optimal model size for each metric, to try to select
the best overall model.
par(mfrow = c(2, 2))
# Plot Adjusted R-squared
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
adj_max <- which.max(reg.summary$adjr2)
points(adj_max, reg.summary$adjr2[adj_max], col = "red", cex = 2, pch = 20)
# Plot Cp
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min <- which.min(reg.summary$cp)
points(cp_min, reg.summary$cp[cp_min], col = "red", cex = 2, pch = 20)
# Plot BIC
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min <- which.min(reg.summary$bic)
points(bic_min, reg.summary$bic[bic_min], col = "red", cex = 2, pch = 20)
The regsubsets() function has a built-in
plot() command which can be used to display the selected
variables for the best model with a given number of predictors, ranked
according to the BIC, \(C_p\), or
adjusted \(R^2\).
plot(regfit.full, scale = "bic")
The top row of each plot contains a black square for each variable
selected according to the optimal model associated with that statistic.
For instance, we see that several models share a BIC close to \(-150\). However, the model with the lowest
BIC is the six-variable model that contains only AtBat,
Hits, Walks, CRBI,
DivisionW, and PutOuts. We can use the
coef() function to see the coefficient estimates associated
with this model.
coef(regfit.full, 6)
## (Intercept) AtBat Hits Walks CRBI DivisionW
## 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
## PutOuts
## 0.2643076
Analytical indexes are approximations. A more direct way to estimate the true test error is to use resampling techniques like the validation set approach and cross-validation.
Best subset selection can be computationally prohibitive on very large datasets. To demonstrate an alternative, we will use Forward Stepwise Selection for our resampling techniques.
First, we define a custom predict() method for
regsubsets objects, as one is not provided by default in
R.
predict.regsubsets <- function(object, newdata, id, ...) {
# 1. Extract the formula used in the original regsubsets() call
form <- as.formula(object$call[[2]])
# 2. Build a design matrix ('X' matrix) from the new data using that formula.
# This safely handles dummy variable creation and adds the intercept column.
mat <- model.matrix(form, newdata)
# 3. Extract the exact coefficients for the best model of size 'id'
coefi <- coef(object, id = id)
# 4. Identify the names of the specific predictors kept in this model size
xvars <- names(coefi)
# 5. Extract only the relevant columns from the design matrix and multiply
# them by the coefficients using matrix multiplication (X * beta)
mat[, xvars] %*% coefi
}
We begin by splitting the data into a training set and a validation set.
set.seed(1)
n <- nrow(Hitters)
train <- sample(c(TRUE, FALSE), n, replace = TRUE)
test <- (!train)
We perform forward stepwise selection only on the training data. This prevents information leakage and ensures our validation errors are honest estimates of test performance.
regfit.fwd <- regsubsets(Salary ~ ., data = Hitters[train, ], nvmax = 19,
method = "forward")
(Note: Backward stepwise selection can be performed just as easily by
passing method = "backward" to the function).
Next, we loop through all 19 model sizes, making predictions on the held-out test data and calculating the Mean Squared Error (MSE).
val.errors <- rep(NA, 19)
for (i in 1:19) {
pred <- predict(regfit.fwd, Hitters[test, ], id = i)
val.errors[i] <- mean((Hitters$Salary[test] - pred)^2)
}
Finally, we find the model size that minimizes the validation error and plot the results.
opt.val.size <- which.min(val.errors)
par(mfrow = c(1, 1))
plot(1:19, val.errors, type = "b", pch = 16, xlab = "Number of Variables", ylab = "Validation MSE")
points(opt.val.size, val.errors[opt.val.size], col = "red", cex = 2, pch = 20)
The validation set approach identifies the 7-variable model as the one with the lowest test MSE. However, this result is highly dependent on the initial random split.
To reduce the variance introduced by a single train/test split, we use 10-fold cross-validation. We first create our folds and an empty matrix to store the errors.
k <- 10
set.seed(1)
folds <- sample(rep(1:k, length = n))
cv.errors <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
We loop over each of the 10 folds. In each iteration, one fold acts as the test set while we train the forward stepwise model on the remaining 9 folds. We then compute the MSE for all 19 model sizes.
for (j in 1:k) {
# Train forward selection on all folds except j
best.fit <- regsubsets(Salary ~ ., data = Hitters[folds != j, ], nvmax = 19, method = "forward")
# Predict on fold j and calculate MSE
for (i in 1:19) {
pred <- predict(best.fit, Hitters[folds == j, ], id = i)
cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred)^2)
}
}
We average the errors across all 10 folds for each model size to find the overall optimal model.
mean.cv.errors <- apply(cv.errors, 2, mean)
opt.cv.size <- which.min(mean.cv.errors)
plot(1:19, mean.cv.errors, type = "b", pch = 16, xlab = "Number of Variables", ylab = "10-Fold CV MSE")
points(opt.cv.size, mean.cv.errors[opt.cv.size], col = "red", cex = 2, pch = 20)
Cross-validation provides a more robust estimate of test error and selects a 10-variable model as optimal.
Now that CV has identified the optimal size, we may refit the forward selection on the full dataset to extract the final coefficients.
reg.fwd.final <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19,
method = "forward")
coef(reg.fwd.final, 10)
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
## CRBI CWalks DivisionW PutOuts Assists
## 0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
The custom prediction loop above gives us total control over the
subsetting process. However, if we simply want to evaluate a specific
model using Leave-One-Out Cross-Validation, we can use the
cv.glm() function from the boot package that
implements functions for cross-validation for GLMs. For this reason we
need to fit linear regression models with the glm function.
Let’s test the specific 6-variable model we previously identified
using BIC. We fit a standard generalized linear model (which defaults to
linear regression) and pass it to cv.glm().
library(boot)
##
## Attaching package: 'boot'
## The following objects are masked from 'package:faraway':
##
## logit, melanoma
# Fit a GLM using the 6 variables selected by BIC
glm.fit.6 <- glm(Salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters)
# Perform LOOCV. Since K is not specified, it defaults to K = n.
n <- nrow(Hitters)
cv.err.loo.6 <- cv.glm(Hitters, glm.fit.6, K = n)
cv.err.loo.6$K
## [1] 263
cv.err.loo.6$delta
## [1] 107943.0 107926.7
The returned object contains a few useful components. $K
confirms the number of folds used (263, which is \(n\), indicating LOOCV).
The $delta component is a vector of length two:
The first component is the raw cross-validation estimate of the prediction error.
The second component is the adjusted cross-validation estimate, designed to compensate for the slight bias introduced by not using all the observations during the training folds. (For LOOCV, these two numbers are very very close).
Because Hitters is a relatively small dataset, this
\(n\)-fold loop runs almost
instantaneously. Now, let’s compare this to the 10-variable model
selected by our 10-fold cross-validation earlier.
# Fit a GLM using the 10 variables typically selected by the stepwise CV
glm.fit.10 <- glm(Salary ~ AtBat + Hits + Walks + CAtBat + CRuns + CRBI +
CWalks + League + Division + PutOuts, data = Hitters)
# Perform LOOCV on the 10-variable model
cv.err.loo.10 <- cv.glm(Hitters, glm.fit.10, K = n)
cv.err.loo.10$delta
## [1] 107306.8 107279.6
Finally, we can compare the raw LOOCV MSE of both models directly:
cat("6-Variable Model LOOCV Error: ", cv.err.loo.6$delta[2], "\n")
## 6-Variable Model LOOCV Error: 107926.7
cat("10-Variable Model LOOCV Error:", cv.err.loo.10$delta[2], "\n")
## 10-Variable Model LOOCV Error: 107279.6
The 10-variable model yields a slightly lower LOOCV error than the 6-variable model, confirming the results we saw in our previous \(k\)-fold cross-validation loop. While the 6-variable model is more parsimonious, the 10-variable model provides marginally better out-of-sample predictive accuracy.
For standard linear regression, there is a mathematical shortcut that allows us to calculate the exact LOOCV error without having to physically refit the model \(n\) times. We can fit the model just once on the full dataset and adjust each residual by dividing it by \(1 - h_i\), where \(h_i\) represents the leverage value (the diagonal element of the hat matrix) for that specific observation.
The formula for the exact LOOCV MSE is: \[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2\]
We can compute this manually in R using the residuals()
and hatvalues() functions. Let’s apply this to our
6-variable model and compare it to the output from
cv.glm():
# Fit the model using standard lm()
lm.fit.6 <- lm(Salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters)
# Compute the exact LOOCV error using the shortcut formula
loocv.exact <- mean((residuals(lm.fit.6) / (1 - hatvalues(lm.fit.6)))^2)
cat("Analytic Shortcut LOOCV Error: ", loocv.exact, "\n")
## Analytic Shortcut LOOCV Error: 107943
cat("cv.glm() LOOCV Error: ", cv.err.loo.6$delta[1], "\n")
## cv.glm() LOOCV Error: 107943
The analytic shortcut yields the exact same MSE as the \(n\)-fold loop performed by
cv.glm(). This algebraic trick is incredibly efficient,
making the computational cost of LOOCV essentially identical to a single
model fit when working with least squares regression.
The definitions of the variables of the dataset are as follows:
-AtBat: Number of shots made with a baseball bat during
the 1986–1987 season,
-Hits: Number of hits made in the 1986–1987 season,
-HmRun: Most valuable hits in the 1986–1987 season,
-Runs: The points he earned for his team in the
1986–1987 season,
-RBI: Number of players a batsman had jogged when he hit
in the season,
-Walks: Number of mistakes made by the opposing
player,
-Years: Player’s playing time in major league (in
year),
-CAtBat: Number of shots made with a baseball bat in
career,
-CHits: Number of hits made in the career,
-CHmRun: Most valuable hits in the career,
-CRuns: The points he earned for his team in his
career,
-CRBI: Number of players a batsman had jogged when he
hit in the career,
-CWalks: Number of mistakes made by the opposing player
in career,
-League: A factor with A and N levels showing the league
in which the player played until the end of the season,
-Division: A factor with levels E and W indicating the
position played by the player at the end of 1986,
-PutOuts: Helping your teammate in-game,
-Assists: Number of assists made by the player in the
1986–1987 season,
-Errors: Player’s errors in the 1986–1987 season,
-Salary: The salary of the player in the 1986–1987
season (in thousand),
-NewLeague: A factor with A and N levels showing the
player’s league at the start of the 1987 season.
In this lab, we will apply shrinkage methods (Ridge and Lasso regression) to the Ames Housing dataset.
# Load the pre-cleaned data
ames <- read.csv("AmesHousing_Cleaned.csv")
This dataset contains information on residential home sales in Ames, Iowa. It is a robust dataset consisting of 2,930 observations and over 80 features.
# explore the data
dim(ames)
## [1] 2930 81
str(ames)
## 'data.frame': 2930 obs. of 81 variables:
## $ MS_SubClass : chr "One_Story_1946_and_Newer_All_Styles" "One_Story_1946_and_Newer_All_Styles" "One_Story_1946_and_Newer_All_Styles" "One_Story_1946_and_Newer_All_Styles" ...
## $ MS_Zoning : chr "Residential_Low_Density" "Residential_High_Density" "Residential_Low_Density" "Residential_Low_Density" ...
## $ Lot_Frontage : int 141 80 81 93 74 78 41 43 39 60 ...
## $ Lot_Area : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr "No_Alley_Access" "No_Alley_Access" "No_Alley_Access" "No_Alley_Access" ...
## $ Lot_Shape : chr "Slightly_Irregular" "Regular" "Slightly_Irregular" "Regular" ...
## $ Land_Contour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ Lot_Config : chr "Corner" "Inside" "Corner" "Corner" ...
## $ Land_Slope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "North_Ames" "North_Ames" "North_Ames" "North_Ames" ...
## $ Condition_1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition_2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ Bldg_Type : chr "OneFam" "OneFam" "OneFam" "OneFam" ...
## $ House_Style : chr "One_Story" "One_Story" "One_Story" "One_Story" ...
## $ Overall_Qual : chr "Above_Average" "Average" "Above_Average" "Good" ...
## $ Overall_Cond : chr "Average" "Above_Average" "Above_Average" "Average" ...
## $ Year_Built : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ Year_Remod_Add : int 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
## $ Roof_Style : chr "Hip" "Gable" "Hip" "Hip" ...
## $ Roof_Matl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior_1st : chr "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ Exterior_2nd : chr "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ Mas_Vnr_Type : chr "Stone" "None" "BrkFace" "None" ...
## $ Mas_Vnr_Area : int 112 0 108 0 0 20 0 0 0 0 ...
## $ Exter_Qual : chr "Typical" "Typical" "Typical" "Good" ...
## $ Exter_Cond : chr "Typical" "Typical" "Typical" "Typical" ...
## $ Foundation : chr "CBlock" "CBlock" "CBlock" "CBlock" ...
## $ Bsmt_Qual : chr "Typical" "Typical" "Typical" "Typical" ...
## $ Bsmt_Cond : chr "Good" "Typical" "Typical" "Typical" ...
## $ Bsmt_Exposure : chr "Gd" "No" "No" "No" ...
## $ BsmtFin_Type_1 : chr "BLQ" "Rec" "ALQ" "ALQ" ...
## $ BsmtFin_SF_1 : int 2 6 1 1 3 3 3 1 3 7 ...
## $ BsmtFin_Type_2 : chr "Unf" "LwQ" "Unf" "Unf" ...
## $ BsmtFin_SF_2 : int 0 144 0 0 0 0 0 0 0 0 ...
## $ Bsmt_Unf_SF : int 441 270 406 1045 137 324 722 1017 415 994 ...
## $ Total_Bsmt_SF : int 1080 882 1329 2110 928 926 1338 1280 1595 994 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ Heating_QC : chr "Fair" "Typical" "Typical" "Excellent" ...
## $ Central_Air : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ First_Flr_SF : int 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
## $ Second_Flr_SF : int 0 0 0 0 701 678 0 0 0 776 ...
## $ Low_Qual_Fin_SF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gr_Liv_Area : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ Bsmt_Full_Bath : int 1 0 0 1 0 0 1 0 1 0 ...
## $ Bsmt_Half_Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full_Bath : int 1 1 1 2 2 2 2 2 2 2 ...
## $ Half_Bath : int 0 0 1 1 1 1 0 0 0 1 ...
## $ Bedroom_AbvGr : int 3 2 3 3 3 3 2 2 2 3 ...
## $ Kitchen_AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen_Qual : chr "Typical" "Typical" "Good" "Excellent" ...
## $ TotRms_AbvGrd : int 7 5 6 8 6 7 6 5 5 7 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 2 0 0 2 1 1 0 0 1 1 ...
## $ Fireplace_Qu : chr "Good" "No_Fireplace" "No_Fireplace" "Typical" ...
## $ Garage_Type : chr "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ Garage_Finish : chr "Fin" "Unf" "Unf" "Fin" ...
## $ Garage_Cars : int 2 1 1 2 2 2 2 2 2 2 ...
## $ Garage_Area : int 528 730 312 522 482 470 582 506 608 442 ...
## $ Garage_Qual : chr "Typical" "Typical" "Typical" "Typical" ...
## $ Garage_Cond : chr "Typical" "Typical" "Typical" "Typical" ...
## $ Paved_Drive : chr "Partial_Pavement" "Paved" "Paved" "Paved" ...
## $ Wood_Deck_SF : int 210 140 393 0 212 360 0 0 237 140 ...
## $ Open_Porch_SF : int 62 0 36 0 34 36 0 82 152 60 ...
## $ Enclosed_Porch : int 0 0 0 0 0 0 170 0 0 0 ...
## $ Three_season_porch: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Screen_Porch : int 0 120 0 0 0 0 0 144 0 0 ...
## $ Pool_Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool_QC : chr "No_Pool" "No_Pool" "No_Pool" "No_Pool" ...
## $ Fence : chr "No_Fence" "Minimum_Privacy" "No_Fence" "No_Fence" ...
## $ Misc_Feature : chr "None" "None" "Gar2" "None" ...
## $ Misc_Val : int 0 0 12500 0 0 0 0 0 0 0 ...
## $ Mo_Sold : int 5 6 6 4 3 6 4 1 3 6 ...
## $ Year_Sold : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Sale_Type : chr "WD " "WD " "WD " "WD " ...
## $ Sale_Condition : chr "Normal" "Normal" "Normal" "Normal" ...
## $ Sale_Price : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
## $ Longitude : num -93.6 -93.6 -93.6 -93.6 -93.6 ...
## $ Latitude : num 42.1 42.1 42.1 42.1 42.1 ...
Our target variable is Sale_Price. The predictor
variables cover almost every aspect of a home, including: *
Continuous variables: e.g., Lot_Area
(square footage), Gr_Liv_Area (above ground living area). *
Discrete/Ordinal variables: e.g.,
Full_Bath, Overall_Qual (rating from 1-10). *
Categorical/Nominal variables: e.g.,
Neighborhood (28 different locations),
Bldg_Type, House_Style.
When dealing with this many variables—especially once categorical variables are expanded into dozens of dummy variables—standard linear regression often suffers from high variance and multicollinearity. Shrinkage methods constrain or “regularize” the coefficient estimates, significantly reducing variance and improving out-of-sample prediction.
We begin by fitting a saturated linear regression model using all available predictors. This will serve as our baseline.
# Fit a saturated linear regression model
lm.mod <- lm(Sale_Price ~ ., data = ames)
summary(lm.mod)
##
## Call:
## lm(formula = Sale_Price ~ ., data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -303624 -9147 83 8942 145991
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error
## (Intercept) -9.864e+06 9.006e+06
## MS_SubClassOne_and_Half_Story_Finished_All_Ages 2.371e+04 1.626e+04
## MS_SubClassOne_and_Half_Story_PUD_All_Ages -2.454e+04 2.891e+04
## MS_SubClassOne_and_Half_Story_Unfinished_All_Ages 2.833e+04 1.933e+04
## MS_SubClassOne_Story_1945_and_Older 2.040e+04 1.565e+04
## MS_SubClassOne_Story_1946_and_Newer_All_Styles 1.611e+04 1.519e+04
## MS_SubClassOne_Story_PUD_1946_and_Newer -1.637e+03 1.739e+04
## MS_SubClassOne_Story_with_Finished_Attic_All_Ages 2.805e+04 1.829e+04
## MS_SubClassPUD_Multilevel_Split_Level_Foyer -5.647e+03 1.864e+04
## MS_SubClassSplit_Foyer 1.231e+04 1.589e+04
## MS_SubClassSplit_or_Multilevel 6.335e+03 1.715e+04
## MS_SubClassTwo_and_Half_Story_All_Ages 1.445e+04 1.799e+04
## MS_SubClassTwo_Family_conversion_All_Styles_and_Ages 1.093e+04 4.891e+03
## MS_SubClassTwo_Story_1945_and_Older 2.499e+04 1.616e+04
## MS_SubClassTwo_Story_1946_and_Newer 1.594e+04 1.565e+04
## MS_SubClassTwo_Story_PUD_1946_and_Newer -3.577e+03 1.797e+04
## MS_ZoningC_all 2.161e+04 3.684e+04
## MS_ZoningFloating_Village_Residential 3.044e+04 3.614e+04
## MS_ZoningI_all 2.812e+04 4.358e+04
## MS_ZoningResidential_High_Density 3.473e+04 3.658e+04
## MS_ZoningResidential_Low_Density 3.191e+04 3.592e+04
## MS_ZoningResidential_Medium_Density 2.850e+04 3.615e+04
## Lot_Frontage 1.198e+01 1.468e+01
## Lot_Area 5.652e-01 7.803e-02
## StreetPave 1.790e+04 7.378e+03
## AlleyNo_Alley_Access -1.054e+03 2.496e+03
## AlleyPaved 2.296e+01 3.841e+03
## Lot_ShapeModerately_Irregular 3.298e+01 6.228e+03
## Lot_ShapeRegular -3.611e+03 5.862e+03
## Lot_ShapeSlightly_Irregular -4.904e+03 5.836e+03
## Land_ContourHLS 7.625e+03 3.205e+03
## Land_ContourLow -4.570e+03 4.174e+03
## Land_ContourLvl 4.528e+03 2.351e+03
## UtilitiesNoSeWa -3.020e+04 2.382e+04
## UtilitiesNoSewr -1.176e+04 1.972e+04
## Lot_ConfigCulDSac 4.119e+03 2.092e+03
## Lot_ConfigFR2 -6.163e+03 2.688e+03
## Lot_ConfigFR3 -4.250e+03 6.097e+03
## Lot_ConfigInside -4.659e+02 1.123e+03
## Land_SlopeMod 4.971e+03 2.561e+03
## Land_SlopeSev -2.517e+04 7.808e+03
## NeighborhoodBlueste 2.099e+04 1.126e+04
## NeighborhoodBriardale 8.075e+03 7.667e+03
## NeighborhoodBrookside 9.502e+02 7.054e+03
## NeighborhoodClear_Creek 4.308e+03 8.475e+03
## NeighborhoodCollege_Creek 3.894e+03 8.998e+03
## NeighborhoodCrawford 1.888e+04 8.275e+03
## NeighborhoodEdwards -5.226e+03 8.173e+03
## NeighborhoodGilbert -1.067e+04 5.137e+03
## NeighborhoodGreen_Hills 1.234e+05 1.787e+04
## NeighborhoodGreens 9.377e+03 9.995e+03
## NeighborhoodIowa_DOT_and_Rail_Road -2.213e+03 8.103e+03
## NeighborhoodLandmark 6.722e+03 2.347e+04
## NeighborhoodMeadow_Village 1.145e+04 1.153e+04
## NeighborhoodMitchell 4.017e+03 1.060e+04
## NeighborhoodNorth_Ames -9.349e+03 5.946e+03
## NeighborhoodNorthpark_Villa 1.586e+04 1.024e+04
## NeighborhoodNorthridge 2.327e+04 6.010e+03
## NeighborhoodNorthridge_Heights 1.268e+04 5.218e+03
## NeighborhoodNorthwest_Ames -1.162e+04 5.621e+03
## NeighborhoodOld_Town -9.832e+03 7.194e+03
## NeighborhoodSawyer 8.041e+02 7.449e+03
## NeighborhoodSawyer_West -1.086e+03 7.883e+03
## NeighborhoodSomerset 9.752e+03 6.039e+03
## NeighborhoodSouth_and_West_of_Iowa_State_University -1.162e+03 8.584e+03
## NeighborhoodStone_Brook 3.128e+04 5.773e+03
## NeighborhoodTimberland 1.145e+04 1.044e+04
## NeighborhoodVeenker -5.008e+03 7.477e+03
## Condition_1Feedr 3.307e+03 3.084e+03
## Condition_1Norm 1.230e+04 2.567e+03
## Condition_1PosA 1.512e+04 5.860e+03
## Condition_1PosN 1.553e+04 4.529e+03
## Condition_1RRAe -1.866e+03 5.077e+03
## Condition_1RRAn 7.700e+03 4.266e+03
## Condition_1RRNe 1.890e+03 9.394e+03
## Condition_1RRNn 1.290e+00 7.831e+03
## Condition_2Feedr -6.461e+03 1.192e+04
## Condition_2Norm 2.930e+03 1.035e+04
## Condition_2PosA 2.946e+04 1.617e+04
## Condition_2PosN -1.259e+05 1.550e+04
## Condition_2RRAe -4.266e+04 3.054e+04
## Condition_2RRAn 2.470e+03 2.387e+04
## Condition_2RRNn 1.446e+04 1.850e+04
## Bldg_TypeOneFam 2.019e+03 1.459e+04
## Bldg_TypeTwnhs -5.225e+03 1.707e+04
## Bldg_TypeTwnhsE 5.260e+02 1.683e+04
## Bldg_TypeTwoFmCon NA NA
## House_StyleOne_and_Half_Unf 5.423e+03 1.104e+04
## House_StyleOne_Story 1.022e+04 5.321e+03
## House_StyleSFoyer 1.207e+04 6.976e+03
## House_StyleSLvl 1.577e+04 9.316e+03
## House_StyleTwo_and_Half_Fin -7.546e+03 1.165e+04
## House_StyleTwo_and_Half_Unf 6.292e+03 8.300e+03
## House_StyleTwo_Story 5.972e+02 4.883e+03
## Overall_QualAverage -2.826e+03 1.341e+03
## Overall_QualBelow_Average -5.202e+03 2.137e+03
## Overall_QualExcellent 5.180e+04 3.783e+03
## Overall_QualFair -3.304e+03 4.269e+03
## Overall_QualGood 5.364e+03 1.442e+03
## Overall_QualPoor -4.212e+03 7.168e+03
## Overall_QualVery_Excellent 8.152e+04 6.211e+03
## Overall_QualVery_Good 1.848e+04 2.116e+03
## Overall_QualVery_Poor 9.535e+03 1.863e+04
## Overall_CondAverage -5.527e+03 1.295e+03
## Overall_CondBelow_Average -1.387e+04 2.556e+03
## Overall_CondExcellent 1.907e+04 4.210e+03
## Overall_CondFair -2.605e+04 3.753e+03
## Overall_CondGood 7.790e+03 1.561e+03
## Overall_CondPoor -2.585e+04 9.225e+03
## Overall_CondVery_Good 1.101e+04 2.317e+03
## Overall_CondVery_Poor -1.726e+04 1.139e+04
## Year_Built 4.018e+02 5.267e+01
## Year_Remod_Add 6.545e+01 3.471e+01
## Roof_StyleGable 9.138e+03 9.895e+03
## Roof_StyleGambrel 4.774e+03 1.111e+04
## Roof_StyleHip 9.381e+03 9.967e+03
## Roof_StyleMansard -2.559e+03 1.227e+04
## Roof_StyleShed 2.778e+04 1.707e+04
## Roof_MatlCompShg 6.898e+05 3.162e+04
## Roof_MatlMembran 7.508e+05 4.061e+04
## Roof_MatlMetal 7.325e+05 4.042e+04
## Roof_MatlRoll 6.959e+05 3.847e+04
## Roof_MatlTar&Grv 6.900e+05 3.273e+04
## Roof_MatlWdShake 6.875e+05 3.277e+04
## Roof_MatlWdShngl 7.346e+05 3.260e+04
## Exterior_1stAsphShn -1.194e+04 2.332e+04
## Exterior_1stBrkComm 7.759e+03 1.330e+04
## Exterior_1stBrkFace 1.383e+04 7.867e+03
## Exterior_1stCBlock -5.702e+03 2.748e+04
## Exterior_1stCemntBd 1.205e+03 1.241e+04
## Exterior_1stHdBoard -6.228e+03 7.684e+03
## Exterior_1stImStucc -2.456e+04 2.334e+04
## Exterior_1stMetalSd 1.551e+03 8.632e+03
## Exterior_1stPlywood -4.318e+03 7.530e+03
## Exterior_1stPreCast 6.239e+04 2.293e+04
## Exterior_1stStone -2.294e+03 1.861e+04
## Exterior_1stStucco 1.305e+03 8.527e+03
## Exterior_1stVinylSd -9.551e+03 8.381e+03
## Exterior_1stWd Sdng -2.899e+03 7.431e+03
## Exterior_1stWdShing -6.136e+03 8.014e+03
## Exterior_2ndAsphShn 1.659e+04 1.761e+04
## Exterior_2ndBrk Cmn -3.326e+02 1.213e+04
## Exterior_2ndBrkFace -4.499e+02 8.629e+03
## Exterior_2ndCBlock 8.888e+03 2.247e+04
## Exterior_2ndCmentBd 2.371e+02 1.258e+04
## Exterior_2ndHdBoard 5.302e+03 7.908e+03
## Exterior_2ndImStucc 8.676e+03 9.559e+03
## Exterior_2ndMetalSd 1.086e+03 8.815e+03
## Exterior_2ndOther -1.118e+04 2.302e+04
## Exterior_2ndPlywood 3.272e+03 7.620e+03
## Exterior_2ndPreCast NA NA
## Exterior_2ndStone -5.752e+03 1.295e+04
## Exterior_2ndStucco 3.308e+03 8.698e+03
## Exterior_2ndVinylSd 1.154e+04 8.555e+03
## Exterior_2ndWd Sdng 4.171e+03 7.676e+03
## Exterior_2ndWd Shng 6.923e+03 8.038e+03
## Mas_Vnr_TypeBrkFace 3.107e+03 4.521e+03
## Mas_Vnr_TypeCBlock -7.058e+04 2.688e+04
## Mas_Vnr_TypeNone 6.745e+03 4.530e+03
## Mas_Vnr_TypeStone 8.371e+03 4.746e+03
## Mas_Vnr_Area 1.958e+01 3.756e+00
## Exter_QualFair -6.031e+03 6.081e+03
## Exter_QualGood -1.046e+04 3.345e+03
## Exter_QualTypical -1.162e+04 3.670e+03
## Exter_CondFair 1.003e+03 7.552e+03
## Exter_CondGood 3.167e+03 6.803e+03
## Exter_CondPoor -8.777e+03 1.857e+04
## Exter_CondTypical 5.240e+03 6.806e+03
## FoundationCBlock -8.659e+01 1.969e+03
## FoundationPConc 1.619e+03 2.133e+03
## FoundationSlab -3.174e+03 5.930e+03
## FoundationStone 1.089e+04 7.102e+03
## FoundationWood -1.776e+04 1.006e+04
## Bsmt_QualFair -9.787e+03 3.737e+03
## Bsmt_QualGood -1.276e+04 2.103e+03
## Bsmt_QualNo_Basement 1.783e+04 3.860e+04
## Bsmt_QualPoor -7.999e+03 1.798e+04
## Bsmt_QualTypical -1.105e+04 2.626e+03
## Bsmt_CondFair 1.349e+03 1.302e+04
## Bsmt_CondGood 8.209e+02 1.292e+04
## Bsmt_CondNo_Basement NA NA
## Bsmt_CondPoor -1.390e+03 1.769e+04
## Bsmt_CondTypical 7.412e+02 1.278e+04
## Bsmt_ExposureGd 1.013e+04 1.877e+03
## Bsmt_ExposureMn -5.889e+03 1.885e+03
## Bsmt_ExposureNo -4.784e+03 1.419e+03
## Bsmt_ExposureNo_Basement -1.113e+04 1.223e+04
## BsmtFin_Type_1BLQ -3.908e+03 6.650e+03
## BsmtFin_Type_1GLQ -3.384e+03 1.295e+04
## BsmtFin_Type_1LwQ -1.460e+04 1.940e+04
## BsmtFin_Type_1No_Basement NA NA
## BsmtFin_Type_1Rec -2.036e+04 3.222e+04
## BsmtFin_Type_1Unf -1.775e+04 3.865e+04
## BsmtFin_SF_1 3.550e+03 6.436e+03
## BsmtFin_Type_2BLQ -7.072e+03 4.152e+03
## BsmtFin_Type_2GLQ 6.535e+03 5.055e+03
## BsmtFin_Type_2LwQ -9.866e+03 4.052e+03
## BsmtFin_Type_2No_Basement -2.649e+04 2.166e+04
## BsmtFin_Type_2Rec -8.215e+03 3.863e+03
## BsmtFin_Type_2Unf -3.995e+03 3.967e+03
## BsmtFin_SF_2 -5.693e+00 4.647e+00
## Bsmt_Unf_SF -2.028e+01 1.898e+00
## Total_Bsmt_SF 3.889e+01 3.012e+00
## HeatingGasA 1.080e+04 2.233e+04
## HeatingGasW 1.008e+04 2.279e+04
## HeatingGrav 8.183e+03 2.392e+04
## HeatingOthW -1.345e+04 2.740e+04
## HeatingWall 2.083e+04 2.496e+04
## Heating_QCFair -4.836e+03 2.849e+03
## Heating_QCGood -1.762e+03 1.296e+03
## Heating_QCPoor -1.589e+04 1.553e+04
## Heating_QCTypical -3.240e+03 1.276e+03
## Central_AirY -2.879e+03 2.269e+03
## ElectricalFuseF -1.420e+03 3.762e+03
## ElectricalFuseP -1.841e+03 8.563e+03
## ElectricalMix 3.208e+04 3.011e+04
## ElectricalSBrkr -9.048e+02 1.846e+03
## ElectricalUnknown 5.572e+03 2.138e+04
## First_Flr_SF 4.946e+01 3.171e+00
## Second_Flr_SF 6.087e+01 3.503e+00
## Low_Qual_Fin_SF 3.005e+01 1.052e+01
## Gr_Liv_Area NA NA
## Bsmt_Full_Bath 1.545e+03 1.203e+03
## Bsmt_Half_Bath -1.711e+02 1.834e+03
## Full_Bath 3.923e+03 1.352e+03
## Half_Bath 2.495e+03 1.285e+03
## Bedroom_AbvGr -2.820e+03 8.567e+02
## Kitchen_AbvGr -8.627e+03 3.978e+03
## Kitchen_QualFair -1.638e+04 3.943e+03
## Kitchen_QualGood -1.328e+04 2.364e+03
## Kitchen_QualPoor 2.238e+04 2.276e+04
## Kitchen_QualTypical -1.469e+04 2.603e+03
## TotRms_AbvGrd 6.877e+02 5.745e+02
## FunctionalMaj2 1.156e+02 9.771e+03
## FunctionalMin1 1.614e+04 6.238e+03
## FunctionalMin2 1.250e+04 6.297e+03
## FunctionalMod 7.954e+03 6.892e+03
## FunctionalSal -1.076e+04 2.334e+04
## FunctionalSev -1.526e+04 1.764e+04
## FunctionalTyp 2.333e+04 5.643e+03
## Fireplaces 7.896e+03 1.591e+03
## Fireplace_QuFair -3.676e+03 4.405e+03
## Fireplace_QuGood -3.025e+02 3.520e+03
## Fireplace_QuNo_Fireplace 5.283e+03 4.010e+03
## Fireplace_QuPoor 2.153e+02 4.866e+03
## Fireplace_QuTypical -1.224e+03 3.622e+03
## Garage_TypeBasment 1.306e+03 4.243e+03
## Garage_TypeBuiltIn -7.939e+02 2.063e+03
## Garage_TypeCarPort -4.466e+03 6.080e+03
## Garage_TypeDetchd -9.058e+01 1.472e+03
## Garage_TypeMore_Than_Two_Types -1.187e+04 4.945e+03
## Garage_TypeNo_Garage 1.094e+04 1.764e+04
## Garage_FinishNo_Garage -2.119e+04 2.219e+04
## Garage_FinishRFn -1.531e+03 1.243e+03
## Garage_FinishUnf 4.216e+02 1.488e+03
## Garage_Cars 3.766e+03 1.431e+03
## Garage_Area 1.415e+01 4.793e+00
## Garage_QualFair -5.521e+04 1.875e+04
## Garage_QualGood -4.200e+04 1.857e+04
## Garage_QualNo_Garage NA NA
## Garage_QualPoor -6.145e+04 2.312e+04
## Garage_QualTypical -5.382e+04 1.858e+04
## Garage_CondFair 3.590e+04 1.798e+04
## Garage_CondGood 3.170e+04 1.846e+04
## Garage_CondNo_Garage NA NA
## Garage_CondPoor 3.532e+04 1.934e+04
## Garage_CondTypical 3.819e+04 1.771e+04
## Paved_DrivePartial_Pavement -2.010e+03 3.315e+03
## Paved_DrivePaved -4.863e+02 2.102e+03
## Wood_Deck_SF 7.655e+00 3.667e+00
## Open_Porch_SF 8.550e-01 7.007e+00
## Enclosed_Porch 1.056e+01 7.388e+00
## Three_season_porch 1.967e+01 1.654e+01
## Screen_Porch 4.506e+01 7.679e+00
## Pool_Area -1.795e+01 5.457e+01
## Pool_QCFair -4.416e+04 2.953e+04
## Pool_QCGood -3.743e+04 2.596e+04
## Pool_QCNo_Pool -8.124e+04 2.229e+04
## Pool_QCTypical -6.947e+04 1.847e+04
## FenceGood_Wood 2.156e+03 2.984e+03
## FenceMinimum_Privacy 2.773e+03 2.423e+03
## FenceMinimum_Wood_Wire -3.897e+02 6.603e+03
## FenceNo_Fence 8.514e+02 2.184e+03
## Misc_FeatureGar2 5.565e+05 2.893e+04
## Misc_FeatureNone 5.565e+05 3.661e+04
## Misc_FeatureOthr 5.690e+05 3.459e+04
## Misc_FeatureShed 5.563e+05 3.570e+04
## Misc_FeatureTenC 4.948e+05 4.628e+04
## Misc_Val 9.341e-01 1.652e+00
## Mo_Sold -1.398e+02 1.529e+02
## Year_Sold -5.532e+02 3.213e+02
## Sale_TypeCon 2.867e+04 1.007e+04
## Sale_TypeConLD 6.356e+03 5.293e+03
## Sale_TypeConLI -4.035e+03 7.682e+03
## Sale_TypeConLw 5.336e+03 8.335e+03
## Sale_TypeCWD 7.068e+03 6.734e+03
## Sale_TypeNew 1.228e+04 9.638e+03
## Sale_TypeOth 4.671e+03 8.469e+03
## Sale_TypeVWD -6.185e+03 2.172e+04
## Sale_TypeWD 2.034e+03 2.609e+03
## Sale_ConditionAdjLand 2.100e+04 7.209e+03
## Sale_ConditionAlloca 1.305e+04 5.395e+03
## Sale_ConditionFamily 2.933e+03 3.637e+03
## Sale_ConditionNormal 8.738e+03 1.858e+03
## Sale_ConditionPartial 8.433e+03 9.281e+03
## Longitude 4.054e+04 8.903e+04
## Latitude 3.006e+05 1.308e+05
## t value Pr(>|t|)
## (Intercept) -1.095 0.273511
## MS_SubClassOne_and_Half_Story_Finished_All_Ages 1.458 0.144976
## MS_SubClassOne_and_Half_Story_PUD_All_Ages -0.849 0.396093
## MS_SubClassOne_and_Half_Story_Unfinished_All_Ages 1.466 0.142794
## MS_SubClassOne_Story_1945_and_Older 1.303 0.192556
## MS_SubClassOne_Story_1946_and_Newer_All_Styles 1.060 0.289134
## MS_SubClassOne_Story_PUD_1946_and_Newer -0.094 0.925013
## MS_SubClassOne_Story_with_Finished_Attic_All_Ages 1.534 0.125214
## MS_SubClassPUD_Multilevel_Split_Level_Foyer -0.303 0.761984
## MS_SubClassSplit_Foyer 0.774 0.438771
## MS_SubClassSplit_or_Multilevel 0.369 0.711869
## MS_SubClassTwo_and_Half_Story_All_Ages 0.803 0.421869
## MS_SubClassTwo_Family_conversion_All_Styles_and_Ages 2.234 0.025548 *
## MS_SubClassTwo_Story_1945_and_Older 1.546 0.122215
## MS_SubClassTwo_Story_1946_and_Newer 1.018 0.308720
## MS_SubClassTwo_Story_PUD_1946_and_Newer -0.199 0.842277
## MS_ZoningC_all 0.587 0.557498
## MS_ZoningFloating_Village_Residential 0.842 0.399789
## MS_ZoningI_all 0.645 0.518823
## MS_ZoningResidential_High_Density 0.949 0.342517
## MS_ZoningResidential_Low_Density 0.888 0.374528
## MS_ZoningResidential_Medium_Density 0.788 0.430506
## Lot_Frontage 0.816 0.414490
## Lot_Area 7.243 5.75e-13 ***
## StreetPave 2.426 0.015342 *
## AlleyNo_Alley_Access -0.422 0.672879
## AlleyPaved 0.006 0.995231
## Lot_ShapeModerately_Irregular 0.005 0.995776
## Lot_ShapeRegular -0.616 0.537975
## Lot_ShapeSlightly_Irregular -0.840 0.400815
## Land_ContourHLS 2.379 0.017428 *
## Land_ContourLow -1.095 0.273603
## Land_ContourLvl 1.926 0.054196 .
## UtilitiesNoSeWa -1.268 0.204981
## UtilitiesNoSewr -0.597 0.550866
## Lot_ConfigCulDSac 1.969 0.049037 *
## Lot_ConfigFR2 -2.293 0.021926 *
## Lot_ConfigFR3 -0.697 0.485773
## Lot_ConfigInside -0.415 0.678397
## Land_SlopeMod 1.941 0.052343 .
## Land_SlopeSev -3.224 0.001280 **
## NeighborhoodBlueste 1.864 0.062415 .
## NeighborhoodBriardale 1.053 0.292359
## NeighborhoodBrookside 0.135 0.892858
## NeighborhoodClear_Creek 0.508 0.611299
## NeighborhoodCollege_Creek 0.433 0.665275
## NeighborhoodCrawford 2.281 0.022608 *
## NeighborhoodEdwards -0.639 0.522579
## NeighborhoodGilbert -2.077 0.037918 *
## NeighborhoodGreen_Hills 6.909 6.11e-12 ***
## NeighborhoodGreens 0.938 0.348238
## NeighborhoodIowa_DOT_and_Rail_Road -0.273 0.784762
## NeighborhoodLandmark 0.286 0.774590
## NeighborhoodMeadow_Village 0.993 0.320874
## NeighborhoodMitchell 0.379 0.704715
## NeighborhoodNorth_Ames -1.572 0.115973
## NeighborhoodNorthpark_Villa 1.549 0.121591
## NeighborhoodNorthridge 3.871 0.000111 ***
## NeighborhoodNorthridge_Heights 2.430 0.015153 *
## NeighborhoodNorthwest_Ames -2.067 0.038822 *
## NeighborhoodOld_Town -1.367 0.171827
## NeighborhoodSawyer 0.108 0.914040
## NeighborhoodSawyer_West -0.138 0.890391
## NeighborhoodSomerset 1.615 0.106482
## NeighborhoodSouth_and_West_of_Iowa_State_University -0.135 0.892302
## NeighborhoodStone_Brook 5.419 6.55e-08 ***
## NeighborhoodTimberland 1.097 0.272947
## NeighborhoodVeenker -0.670 0.503048
## Condition_1Feedr 1.072 0.283705
## Condition_1Norm 4.791 1.75e-06 ***
## Condition_1PosA 2.579 0.009955 **
## Condition_1PosN 3.430 0.000614 ***
## Condition_1RRAe -0.367 0.713306
## Condition_1RRAn 1.805 0.071203 .
## Condition_1RRNe 0.201 0.840547
## Condition_1RRNn 0.000 0.999869
## Condition_2Feedr -0.542 0.587921
## Condition_2Norm 0.283 0.777186
## Condition_2PosA 1.821 0.068668 .
## Condition_2PosN -8.121 7.04e-16 ***
## Condition_2RRAe -1.397 0.162532
## Condition_2RRAn 0.103 0.917584
## Condition_2RRNn 0.782 0.434542
## Bldg_TypeOneFam 0.138 0.889960
## Bldg_TypeTwnhs -0.306 0.759594
## Bldg_TypeTwnhsE 0.031 0.975071
## Bldg_TypeTwoFmCon NA NA
## House_StyleOne_and_Half_Unf 0.491 0.623177
## House_StyleOne_Story 1.920 0.055018 .
## House_StyleSFoyer 1.731 0.083650 .
## House_StyleSLvl 1.693 0.090641 .
## House_StyleTwo_and_Half_Fin -0.648 0.517291
## House_StyleTwo_and_Half_Unf 0.758 0.448505
## House_StyleTwo_Story 0.122 0.902673
## Overall_QualAverage -2.107 0.035171 *
## Overall_QualBelow_Average -2.435 0.014968 *
## Overall_QualExcellent 13.693 < 2e-16 ***
## Overall_QualFair -0.774 0.438966
## Overall_QualGood 3.720 0.000204 ***
## Overall_QualPoor -0.588 0.556856
## Overall_QualVery_Excellent 13.126 < 2e-16 ***
## Overall_QualVery_Good 8.737 < 2e-16 ***
## Overall_QualVery_Poor 0.512 0.608745
## Overall_CondAverage -4.270 2.03e-05 ***
## Overall_CondBelow_Average -5.426 6.28e-08 ***
## Overall_CondExcellent 4.531 6.13e-06 ***
## Overall_CondFair -6.941 4.90e-12 ***
## Overall_CondGood 4.991 6.41e-07 ***
## Overall_CondPoor -2.802 0.005117 **
## Overall_CondVery_Good 4.752 2.12e-06 ***
## Overall_CondVery_Poor -1.516 0.129637
## Year_Built 7.628 3.30e-14 ***
## Year_Remod_Add 1.886 0.059456 .
## Roof_StyleGable 0.923 0.355859
## Roof_StyleGambrel 0.430 0.667298
## Roof_StyleHip 0.941 0.346657
## Roof_StyleMansard -0.208 0.834869
## Roof_StyleShed 1.627 0.103774
## Roof_MatlCompShg 21.816 < 2e-16 ***
## Roof_MatlMembran 18.489 < 2e-16 ***
## Roof_MatlMetal 18.121 < 2e-16 ***
## Roof_MatlRoll 18.092 < 2e-16 ***
## Roof_MatlTar&Grv 21.083 < 2e-16 ***
## Roof_MatlWdShake 20.977 < 2e-16 ***
## Roof_MatlWdShngl 22.533 < 2e-16 ***
## Exterior_1stAsphShn -0.512 0.608628
## Exterior_1stBrkComm 0.583 0.559609
## Exterior_1stBrkFace 1.757 0.078963 .
## Exterior_1stCBlock -0.208 0.835620
## Exterior_1stCemntBd 0.097 0.922645
## Exterior_1stHdBoard -0.811 0.417672
## Exterior_1stImStucc -1.053 0.292646
## Exterior_1stMetalSd 0.180 0.857391
## Exterior_1stPlywood -0.573 0.566459
## Exterior_1stPreCast 2.721 0.006546 **
## Exterior_1stStone -0.123 0.901903
## Exterior_1stStucco 0.153 0.878379
## Exterior_1stVinylSd -1.140 0.254581
## Exterior_1stWd Sdng -0.390 0.696484
## Exterior_1stWdShing -0.766 0.443939
## Exterior_2ndAsphShn 0.942 0.346090
## Exterior_2ndBrk Cmn -0.027 0.978130
## Exterior_2ndBrkFace -0.052 0.958419
## Exterior_2ndCBlock 0.396 0.692463
## Exterior_2ndCmentBd 0.019 0.984962
## Exterior_2ndHdBoard 0.670 0.502611
## Exterior_2ndImStucc 0.908 0.364189
## Exterior_2ndMetalSd 0.123 0.901947
## Exterior_2ndOther -0.486 0.627289
## Exterior_2ndPlywood 0.429 0.667642
## Exterior_2ndPreCast NA NA
## Exterior_2ndStone -0.444 0.657012
## Exterior_2ndStucco 0.380 0.703769
## Exterior_2ndVinylSd 1.349 0.177584
## Exterior_2ndWd Sdng 0.543 0.586957
## Exterior_2ndWd Shng 0.861 0.389179
## Mas_Vnr_TypeBrkFace 0.687 0.491974
## Mas_Vnr_TypeCBlock -2.626 0.008687 **
## Mas_Vnr_TypeNone 1.489 0.136620
## Mas_Vnr_TypeStone 1.764 0.077872 .
## Mas_Vnr_Area 5.212 2.01e-07 ***
## Exter_QualFair -0.992 0.321364
## Exter_QualGood -3.128 0.001781 **
## Exter_QualTypical -3.166 0.001563 **
## Exter_CondFair 0.133 0.894313
## Exter_CondGood 0.466 0.641584
## Exter_CondPoor -0.473 0.636592
## Exter_CondTypical 0.770 0.441422
## FoundationCBlock -0.044 0.964923
## FoundationPConc 0.759 0.447809
## FoundationSlab -0.535 0.592540
## FoundationStone 1.533 0.125300
## FoundationWood -1.765 0.077632 .
## Bsmt_QualFair -2.619 0.008876 **
## Bsmt_QualGood -6.067 1.49e-09 ***
## Bsmt_QualNo_Basement 0.462 0.644110
## Bsmt_QualPoor -0.445 0.656426
## Bsmt_QualTypical -4.208 2.66e-05 ***
## Bsmt_CondFair 0.104 0.917456
## Bsmt_CondGood 0.064 0.949340
## Bsmt_CondNo_Basement NA NA
## Bsmt_CondPoor -0.079 0.937387
## Bsmt_CondTypical 0.058 0.953748
## Bsmt_ExposureGd 5.400 7.27e-08 ***
## Bsmt_ExposureMn -3.124 0.001804 **
## Bsmt_ExposureNo -3.371 0.000760 ***
## Bsmt_ExposureNo_Basement -0.910 0.362864
## BsmtFin_Type_1BLQ -0.588 0.556851
## BsmtFin_Type_1GLQ -0.261 0.793901
## BsmtFin_Type_1LwQ -0.753 0.451749
## BsmtFin_Type_1No_Basement NA NA
## BsmtFin_Type_1Rec -0.632 0.527562
## BsmtFin_Type_1Unf -0.459 0.646019
## BsmtFin_SF_1 0.552 0.581307
## BsmtFin_Type_2BLQ -1.703 0.088640 .
## BsmtFin_Type_2GLQ 1.293 0.196168
## BsmtFin_Type_2LwQ -2.435 0.014959 *
## BsmtFin_Type_2No_Basement -1.223 0.221477
## BsmtFin_Type_2Rec -2.126 0.033570 *
## BsmtFin_Type_2Unf -1.007 0.314011
## BsmtFin_SF_2 -1.225 0.220638
## Bsmt_Unf_SF -10.683 < 2e-16 ***
## Total_Bsmt_SF 12.912 < 2e-16 ***
## HeatingGasA 0.484 0.628535
## HeatingGasW 0.442 0.658440
## HeatingGrav 0.342 0.732318
## HeatingOthW -0.491 0.623617
## HeatingWall 0.835 0.404054
## Heating_QCFair -1.698 0.089694 .
## Heating_QCGood -1.360 0.174078
## Heating_QCPoor -1.023 0.306487
## Heating_QCTypical -2.540 0.011153 *
## Central_AirY -1.269 0.204691
## ElectricalFuseF -0.377 0.705917
## ElectricalFuseP -0.215 0.829748
## ElectricalMix 1.065 0.286828
## ElectricalSBrkr -0.490 0.624012
## ElectricalUnknown 0.261 0.794401
## First_Flr_SF 15.597 < 2e-16 ***
## Second_Flr_SF 17.377 < 2e-16 ***
## Low_Qual_Fin_SF 2.857 0.004309 **
## Gr_Liv_Area NA NA
## Bsmt_Full_Bath 1.283 0.199468
## Bsmt_Half_Bath -0.093 0.925692
## Full_Bath 2.902 0.003737 **
## Half_Bath 1.941 0.052388 .
## Bedroom_AbvGr -3.292 0.001010 **
## Kitchen_AbvGr -2.169 0.030203 *
## Kitchen_QualFair -4.155 3.36e-05 ***
## Kitchen_QualGood -5.616 2.16e-08 ***
## Kitchen_QualPoor 0.983 0.325452
## Kitchen_QualTypical -5.646 1.82e-08 ***
## TotRms_AbvGrd 1.197 0.231386
## FunctionalMaj2 0.012 0.990561
## FunctionalMin1 2.588 0.009703 **
## FunctionalMin2 1.985 0.047226 *
## FunctionalMod 1.154 0.248566
## FunctionalSal -0.461 0.644950
## FunctionalSev -0.865 0.387064
## FunctionalTyp 4.134 3.67e-05 ***
## Fireplaces 4.963 7.38e-07 ***
## Fireplace_QuFair -0.835 0.404047
## Fireplace_QuGood -0.086 0.931535
## Fireplace_QuNo_Fireplace 1.317 0.187833
## Fireplace_QuPoor 0.044 0.964702
## Fireplace_QuTypical -0.338 0.735440
## Garage_TypeBasment 0.308 0.758221
## Garage_TypeBuiltIn -0.385 0.700385
## Garage_TypeCarPort -0.735 0.462653
## Garage_TypeDetchd -0.062 0.950923
## Garage_TypeMore_Than_Two_Types -2.400 0.016468 *
## Garage_TypeNo_Garage 0.620 0.535284
## Garage_FinishNo_Garage -0.955 0.339685
## Garage_FinishRFn -1.232 0.218061
## Garage_FinishUnf 0.283 0.776873
## Garage_Cars 2.631 0.008554 **
## Garage_Area 2.951 0.003190 **
## Garage_QualFair -2.944 0.003265 **
## Garage_QualGood -2.262 0.023769 *
## Garage_QualNo_Garage NA NA
## Garage_QualPoor -2.658 0.007919 **
## Garage_QualTypical -2.896 0.003811 **
## Garage_CondFair 1.996 0.046007 *
## Garage_CondGood 1.717 0.086110 .
## Garage_CondNo_Garage NA NA
## Garage_CondPoor 1.826 0.067971 .
## Garage_CondTypical 2.157 0.031109 *
## Paved_DrivePartial_Pavement -0.606 0.544267
## Paved_DrivePaved -0.231 0.817059
## Wood_Deck_SF 2.087 0.036959 *
## Open_Porch_SF 0.122 0.902892
## Enclosed_Porch 1.430 0.152938
## Three_season_porch 1.189 0.234387
## Screen_Porch 5.867 4.98e-09 ***
## Pool_Area -0.329 0.742255
## Pool_QCFair -1.495 0.134919
## Pool_QCGood -1.442 0.149390
## Pool_QCNo_Pool -3.644 0.000273 ***
## Pool_QCTypical -3.762 0.000172 ***
## FenceGood_Wood 0.723 0.469964
## FenceMinimum_Privacy 1.144 0.252553
## FenceMinimum_Wood_Wire -0.059 0.952938
## FenceNo_Fence 0.390 0.696682
## Misc_FeatureGar2 19.240 < 2e-16 ***
## Misc_FeatureNone 15.202 < 2e-16 ***
## Misc_FeatureOthr 16.452 < 2e-16 ***
## Misc_FeatureShed 15.585 < 2e-16 ***
## Misc_FeatureTenC 10.692 < 2e-16 ***
## Misc_Val 0.566 0.571768
## Mo_Sold -0.914 0.360556
## Year_Sold -1.722 0.085255 .
## Sale_TypeCon 2.846 0.004458 **
## Sale_TypeConLD 1.201 0.229960
## Sale_TypeConLI -0.525 0.599439
## Sale_TypeConLw 0.640 0.522059
## Sale_TypeCWD 1.050 0.294015
## Sale_TypeNew 1.274 0.202605
## Sale_TypeOth 0.552 0.581301
## Sale_TypeVWD -0.285 0.775817
## Sale_TypeWD 0.780 0.435701
## Sale_ConditionAdjLand 2.914 0.003602 **
## Sale_ConditionAlloca 2.419 0.015614 *
## Sale_ConditionFamily 0.806 0.420044
## Sale_ConditionNormal 4.704 2.68e-06 ***
## Sale_ConditionPartial 0.909 0.363618
## Longitude 0.455 0.648861
## Latitude 2.299 0.021611 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20770 on 2630 degrees of freedom
## Multiple R-squared: 0.9393, Adjusted R-squared: 0.9324
## F-statistic: 136.1 on 299 and 2630 DF, p-value: < 2.2e-16
We have a good absolute fit in terms of R-squared (which is expected
given the large number of parameters to fit the data). However, the
presence of many non-significant variables strongly suggests the use of
a penalizing method. Additionally, there are NA values in
the estimated coefficients because standard OLS naturally drops
perfectly collinear dummy variables.
We will compute the \(L_1\) and \(L_2\) norms for the least squares estimates so we can later compare how much Ridge and Lasso shrink these coefficients.
Note that we omit the intercept (beta.hat.ls[-1])
because shrinkage penalties are generally only applied to the feature
coefficients, not the baseline intercept. We also use
na.rm = TRUE to handle the NA values when
computing the sums.
# Compute L1 and L2 norms for least square estimates (omit the intercept)
beta.hat.ls <- coefficients(lm.mod)[-1]
# L1 norm
ell_1.ls <- sum(abs(beta.hat.ls), na.rm = TRUE)
ell_1.ls
## [1] 11430475
# L2 norm
ell_2.ls <- sqrt(sum(beta.hat.ls^2, na.rm = TRUE))
ell_2.ls
## [1] 2292824
glmnetTo apply Ridge Regression, we will use the glmnet
package. Unlike standard lm(), glmnet() does
not use formula interfaces (e.g., y ~ x). It strictly
requires a design matrix X and a response vector
y.
We use the model.matrix() function to automatically
convert all categorical variables into dummy variables. We then drop the
first column of X to remove the intercept, as
glmnet handles the intercept internally.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.6.1
## Loading required package: Matrix
## Loaded glmnet 5.0
# Design matrix without the first column (intercept)
X <- model.matrix(Sale_Price ~ ., data = ames)
X <- X[, -1]
# Vector of responses
y <- ames$Sale_Price
Shrinkage methods require a grid of \(\lambda\) (lambda) values, which controls the strength of the penalty. We generate a sequence of 100 values ranging from \(10^9\) down to \(10^{-2}\). We use a larger upper bound here because housing prices scale into the hundreds of thousands.
grid <- 10^seq(9, -2, length = 100)
plot(1:100, grid, type = "b", ylab = "Lambda Values", pch = 20)
We are now ready to apply the glmnet() function. The
behavior of the function is governed by several key arguments: 1.
alpha: Controls the penalty type.
alpha = 0 applies Ridge regression, while
alpha = 1 (the default) applies Lasso. 2.
standardize: Defaults to
TRUE. It is crucial to standardize variables before
applying shrinkage so that the penalty is applied fairly across
variables of different scales (e.g., square feet vs. number of
bedrooms). Note that, by default, the output coefficients are scaled
back to their original units. This allows us, for instance, to compare
the coefficient norms under OLS with those obtained by applying
shrinkage methods. 3. lambda: The grid of
values to test. If omitted, glmnet() will automatically
choose its own sequence.
ridge.mod <- glmnet(X, y, alpha = 0, lambda = grid)
The plot() method for a glmnet object
produces a coefficient profile plot, showing how the coefficients shrink
as the penalty changes. The xvar argument dictates the
X-axis: * "norm": Plots against the \(L_1\)-norm of the coefficients. *
"lambda": Plots against the log of the lambda sequence. *
"dev": Plots against the percent of deviance explained.
par(mfrow = c(1, 3))
plot(ridge.mod, xvar = "lambda", label = TRUE)
plot(ridge.mod, xvar = "norm", label = TRUE)
plot(ridge.mod, xvar = "dev", label = TRUE)
The coef() function returns a matrix where each row is a
predictor and each column corresponds to a specific \(\lambda\) in our grid. We can extract
parameters estimated at different \(\lambda\) values and observe how the \(L_2\) norm shrinks relative to our OLS
baseline.
dim(coef(ridge.mod))
## [1] 307 100
# Coefficients and shrinkage for a strong penalty (lambda[30])
ridge.mod$lambda[30]
## [1] 599484.3
sqrt(sum(coef(ridge.mod)[-1, 30]^2))
## [1] 86499.82
sqrt(sum(coef(ridge.mod)[-1, 30]^2)) / ell_2.ls
## [1] 0.03772631
cat("--------- \n")
## ---------
# Coefficients and shrinkage for a weaker penalty (lambda[70])
ridge.mod$lambda[70]
## [1] 21.54435
sqrt(sum(coef(ridge.mod)[-1, 70]^2))
## [1] 2064281
sqrt(sum(coef(ridge.mod)[-1, 70]^2)) / ell_2.ls
## [1] 0.9003222
The predict() function is highly versatile for
glmnet objects. We can supply an arbitrary \(\lambda\) value using the s
argument. Even if the exact \(\lambda\)
wasn’t in our original grid, glmnet will smoothly
interpolate the coefficients. *
type = "coefficients": Returns the estimated \(\beta\) values. *
type = "response": Returns the predicted \(\hat{y}\) values given new data
(newx).
# Extract coefficients for an arbitrary lambda (s = 100,000)
predict(ridge.mod, s = 100000, type = "coefficients")[1:10, ]
## (Intercept)
## -6099889.6362
## MS_SubClassOne_and_Half_Story_Finished_All_Ages
## 418.2984
## MS_SubClassOne_and_Half_Story_PUD_All_Ages
## -20494.8448
## MS_SubClassOne_and_Half_Story_Unfinished_All_Ages
## 1132.5200
## MS_SubClassOne_Story_1945_and_Older
## -3510.7074
## MS_SubClassOne_Story_1946_and_Newer_All_Styles
## 1186.1489
## MS_SubClassOne_Story_PUD_1946_and_Newer
## -2817.2026
## MS_SubClassOne_Story_with_Finished_Attic_All_Ages
## 2259.6478
## MS_SubClassPUD_Multilevel_Split_Level_Foyer
## -5762.1503
## MS_SubClassSplit_Foyer
## -866.4811
# Generate price predictions for the first 5 houses in the dataset
y.hat <- predict(ridge.mod, s = 100000, newx = X[1:5, ], type = "response")
y.hat
## s=1e+05
## 1 226218.1
## 2 115253.9
## 3 138587.6
## 4 241669.2
## 5 192860.0
Instead of arbitrarily picking \(\lambda\), we use cross-validation to find
the value that minimizes test error. The cv.glmnet()
function automates this. The nfolds argument defaults to
10-fold cross-validation. If lambda is not provided, the
function intelligently builds its own sequence.
set.seed(1)
cv.out <- cv.glmnet(X, y, alpha = 0, nfolds = 10, lambda = grid)
The output contains several useful objects: *
cv.out$lambda: The sequence of lambdas used. *
cv.out$cvm: The mean cross-validated error for each lambda.
* cv.out$cvsd: The standard error of the mean
cross-validated error.
length(cv.out$lambda)
## [1] 100
cv.out$lambda[1:10]
## [1] 1000000000 774263683 599484250 464158883 359381366 278255940
## [7] 215443469 166810054 129154967 100000000
cv.out$cvm[1:10]
## [1] 6371830234 6368237538 6363595405 6357609417 6349894111 6339955846
## [7] 6327164001 6310715582 6289592404 6262510512
cv.out$cvsd[1:10]
## [1] 362266384 362163002 362029441 361857126 361634881 361348353 360979141
## [8] 360503704 359892009 359105881
We can plot the cross-validation curve. The plot displays the Mean
Squared Error (MSE) curve (red dotted line) along with standard
deviation bounds (error bars). The two vertical dotted lines represent
lambda.min (minimizes MSE) and lambda.1se (the
most regularized model within one standard error of the minimum).
par(mfrow = c(1, 1))
plot(cv.out)
Finally, we isolate the optimal \(\lambda\) and fit our selected model, checking the final amount of shrinkage applied.
# Identify the best lambda value
i.bestlam <- which.min(cv.out$cvm)
bestlam <- cv.out$lambda[i.bestlam]
# Note: this is identical to calling cv.out$lambda.min
bestlam
## [1] 21544.35
# Fit the selected model with all the available data
beta.R <- predict(ridge.mod, s = bestlam, type = "coefficients")
# Check the final amount of shrinkage relative to OLS
sqrt(sum(beta.R[-1, 1]^2)) / ell_2.ls
## [1] 0.1325611
We have shrunk the \(L_2\) norm to 13% of its original value.
Next, we explore Lasso regression by setting alpha = 1.
In addition to shrinking coefficients, Lasso can force coefficients
exactly to zero, performing automated variable selection. We also
specify thresh = 1e-10, which tightly controls the
convergence threshold for the coordinate descent algorithm used by
glmnet.
lasso.mod <- glmnet(X, y, alpha = 1, lambda = grid, thresh = 1e-10)
## Warning in .resolve_control(control = control, nvars = nvars, deprecated =
## list(thresh = if (!missing(thresh)) thresh, : Passing 'thresh' to glmnet() is
## deprecated. Use control = list(thresh = ...) instead.
par(mfrow = c(1, 3))
plot(lasso.mod, xvar = "lambda", label = TRUE)
plot(lasso.mod, xvar = "norm", label = TRUE)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
plot(lasso.mod, xvar = "dev", label = TRUE)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
Notice in the plots that many lines hit exactly zero at varying values of \(\lambda\). We can verify this numerically by counting the non-zero coefficients at a specific index.
i <- 50 # Try values like 10, 30, 50, 70, 90 to see the sparsity change
beta.L <- round(coef(lasso.mod)[, i], 4)
beta.L <- beta.L[-1] # remove the intercept
# Number of non-zero coefficients
sum(beta.L != 0)
## [1] 32
# Calculate L1 norm shrinkage relative to OLS
sum(abs(beta.L)) / ell_1.ls
## [1] 0.02819341
With lambda set equal to ‘r lasso.mod$lambda[i]’, the fraction of of selected variable is of only ’sum(beta.L != 0)/ncol(X).
We apply 10-fold cross-validation to select the optimal \(\lambda\) for Lasso. Notice we do not
explicitly supply the lambda grid here; it is often better
to let glmnet build a custom sequence optimized for Lasso’s
sparsity because it exploits warm starts. Indeed, as you may notice,
glmnet is incredibly efficient and runs very quickly, even
when evaluating across many lambda values and 10 folds.
set.seed(10)
cv.out.lasso <- cv.glmnet(X, y, alpha = 1, nfolds = 10)
par(mfrow = c(1, 1))
plot(cv.out.lasso)
bestlam.lasso <- cv.out.lasso$lambda.min
bestlam.lasso
## [1] 858.0304
Using the optimal \(\lambda\), we extract the final model parameters and evaluate how many variables the Lasso completely removed from our massive predictor matrix.
# Refit the lasso model with automatic lambda sequence on all data
lasso.mod.auto <- glmnet(X, y, alpha = 1)
beta.L.best <- predict(lasso.mod.auto, s = bestlam.lasso, type = "coefficients")
# Number of non-zero coefficients (how many features survived?)
sum(beta.L.best[-1, 1] != 0)
## [1] 105
# Check the final amount of L1 shrinkage
sum(abs(beta.L.best[-1, 1])) / ell_1.ls
## [1] 0.09624803
In the Lasso framework, it is often preferable to rely on a more parsimonious model that achieves a similar error rate (specifically, within one standard error of the minimum MSE). This greatly simplifies interpretation by isolating only the main drivers of the target variable.
Therefore, we select the lambda.1se value and print the
resulting non-zero coefficients.
# Extract the lambda within 1 standard error of the minimum
bestlam.1se <- cv.out.lasso$lambda.1se
# Extract coefficients using the 1-SE lambda
beta.L.1se <- predict(cv.out.lasso, s = bestlam.1se, type = "coefficients")
# Filter and print only the non-zero coefficients for easier interpretation
non_zero_coeffs <- beta.L.1se[beta.L.1se[, 1] != 0, ]
non_zero_coeffs
## (Intercept) Lot_Area
## -1.062355e+06 7.668697e-02
## Land_ContourHLS NeighborhoodCrawford
## 1.091858e+03 9.640941e+03
## NeighborhoodNorthridge NeighborhoodNorthridge_Heights
## 1.566855e+04 1.845214e+04
## NeighborhoodStone_Brook Bldg_TypeOneFam
## 2.040019e+04 1.053676e+04
## Overall_QualExcellent Overall_QualVery_Excellent
## 7.046092e+04 8.096670e+04
## Overall_QualVery_Good Overall_CondFair
## 1.976773e+04 -5.978598e+03
## Year_Built Year_Remod_Add
## 2.922360e+02 2.690820e+02
## Roof_StyleHip Roof_MatlWdShngl
## 4.096688e+02 8.542416e+01
## Mas_Vnr_Area Exter_QualTypical
## 1.019847e+01 -6.633668e+03
## Bsmt_ExposureGd Bsmt_ExposureNo
## 1.356446e+04 -5.510536e+02
## BsmtFin_Type_1GLQ BsmtFin_SF_1
## 6.679853e+03 -2.278525e+01
## Total_Bsmt_SF Gr_Liv_Area
## 1.795301e+01 4.852856e+01
## Bsmt_Full_Bath Kitchen_AbvGr
## 4.431592e+03 -3.894106e+03
## Kitchen_QualTypical FunctionalTyp
## -5.684014e+03 3.800630e+03
## Fireplaces Garage_Cars
## 8.563160e+03 8.573784e+03
## Garage_Area Screen_Porch
## 1.032046e+01 1.350973e+01
## Pool_QCGood Misc_Val
## -1.050901e+04 -4.583767e-01
## Sale_TypeNew
## 5.473943e+03
length(non_zero_coeffs)
## [1] 35
With only 35 variables, we are able to achieve similar prediction performance while simultaneously creating a much simpler and more interpretable pricing algorithm.
Because the variables are on different scales (e.g., square footage
vs. categorical dummy variables), and the glmnet output
coefficients are scaled back to their original scale, we cannot simply
look at the raw glmnet output to determine feature
importance. Instead, we must calculate the standardized
coefficients, which represent the change in house price for a
one-standard-deviation increase in the predictor.
# Identify the names of the selected variables (excluding the intercept)
selected_vars <- names(non_zero_coeffs[-1])
# Calculate standard deviations ONLY for the selected predictors
sd_X_selected <- apply(X[, selected_vars], 2, sd)
# Calculate standardized coefficients
coef_std <- non_zero_coeffs[-1] * sd_X_selected
# Sort by absolute magnitude to find the most important drivers
sort(abs(coef_std), decreasing = TRUE)[1:5]
## Gr_Liv_Area Overall_QualExcellent
## 24531.620 13219.108
## Year_Built Overall_QualVery_Excellent
## 8838.782 8285.484
## Total_Bsmt_SF
## 7916.705
Looking at the standardized coefficients, the most relevant feature
driving the house price is the above-ground living area
(Gr_Liv_Area). For example, a one-standard-deviation
increase in Gr_Liv_Area (corresponding to an increase of
506 sq. feet) is associated with a roughly $24,532 increase in the
predicted price. Other critical drivers relate strictly to the overall
quality and age of the house: having an “Excellent” or “Very Excellent”
overall quality rating (Overall_QualExcellent,
Overall_QualVery_Excellent) substantially boosts the
property’s value, as do newer construction years
(Year_Built) and larger basement spaces
(Total_Bsmt_SF).
Elastic Net acts as a compromise between Ridge and Lasso. By setting
alpha to a value between 0 and 1, we apply a mix of both
\(L_1\) and \(L_2\) penalties.
This is particularly useful for datasets like Ames housing. When
multiple features are highly correlated (e.g., Garage_Area
and Garage_Cars), Lasso tends to randomly select one and
drop the others. Ridge keeps them all but shrinks them together. Elastic
Net will often select the correlated group and keep them together while
still dropping completely unhelpful variables.
Here, we set alpha = 0.5 for an even 50/50 mix of the
penalties.
set.seed(42)
# Fit cross-validated Elastic Net
cv.out.enet <- cv.glmnet(X, y, alpha = 0.5, nfolds = 10)
plot(cv.out.enet)
# Extract optimal lambda
bestlam.enet <- cv.out.enet$lambda.min
# Check sparsity of the Elastic Net
beta.EN <- predict(cv.out.enet, s = bestlam.enet, type = "coefficients")
Notice how Elastic Netdrops variables, but generally fewer than pure Lasso.
sum(beta.EN[-1, 1] != 0)
## [1] 169
Ideally, we would need a fresh, unseen test set to properly compare the final performance of the three methods. However, we can look at the minimum and 1-standard-error cross-validation errors (MSE) of all three models to get a rough idea of their relative predictive accuracy.
# Extract the minimum and 1-SE cross-validation errors for Ridge
mse.ridge.min <- cv.out$cvm[cv.out$lambda == cv.out$lambda.min]
mse.ridge.1se <- cv.out$cvm[cv.out$lambda == cv.out$lambda.1se]
# Extract the minimum and 1-SE cross-validation errors for Lasso
mse.lasso.min <- cv.out.lasso$cvm[cv.out.lasso$lambda == cv.out.lasso$lambda.min]
mse.lasso.1se <- cv.out.lasso$cvm[cv.out.lasso$lambda == cv.out.lasso$lambda.1se]
# Extract the minimum and 1-SE cross-validation errors for Elastic Net
mse.enet.min <- cv.out.enet$cvm[cv.out.enet$lambda == cv.out.enet$lambda.min]
mse.enet.1se <- cv.out.enet$cvm[cv.out.enet$lambda == cv.out.enet$lambda.1se]
# Create a clean data frame to display the comparison
comparison_table <- data.frame(
Model = c("Ridge", "Lasso", "Elastic Net (alpha=0.5)"),
MSE_min = c(mse.ridge.min, mse.lasso.min, mse.enet.min),
MSE_1se = c(mse.ridge.1se, mse.lasso.1se, mse.enet.1se)
)
# Print the comparison
comparison_table
## Model MSE_min MSE_1se
## 1 Ridge 845213971 1020466596
## 2 Lasso 870343220 997930426
## 3 Elastic Net (alpha=0.5) 841270631 966763405
Looking at the cross-validation errors, Elastic Net
emerges as the strongest overall performer for this dataset, achieving
both the lowest minimum error (MSE_min) and the lowest
regularized error (MSE_1se).
While Ridge slightly outperforms Lasso in raw predictive accuracy at the minimum error threshold, it does not perform variable selection, leaving us with a highly complex and difficult-to-interpret model containing hundreds of predictors. Lasso provides a much more interpretable, parsimonious model but sacrifices a small amount of predictive accuracy compared to Ridge.
Therefore, Elastic Net may represent the optimal compromise for the Ames dataset: it leverages the grouping effect of Ridge to handle the highly correlated housing features (yielding the best overall predictive accuracy), while still applying enough Lasso penalty to drop unhelpful variables.
UScrime datasetCriminologists are interested in the effect of punishment regimes on crime rates. This has been studied using aggregate data on 47 states of the USA for 1960. The data set contains the following columns:
M percentage of males aged 14–24 in total state
population
So indicator variable for a southern state
Ed mean years of schooling of the population aged 25
years or over
Po1 per capita expenditure on police protection in
1960
Po2 per capita expenditure on police protection in
1959
LF labour force participation rate of civilian urban
males in the age-group 14-24
M.F number of males per 100 females
Pop state population in 1960 in hundred
thousands
NW percentage of nonwhites in the
population
U1 unemployment rate of urban males 14–24
U2 unemployment rate of urban males 35–39
Wealth wealth: median value of transferable assets
or family income
Ineq income inequality: percentage of families
earning below half the median income
Prob probability of imprisonment: ratio of number of
commitments to number of offenses
Time average time in months served by offenders in
state prisons before their first release
Crime crime rate: number of offenses per 100,000
population in 1960
UScrime <- read.table("UScrime.txt", head=TRUE)
str(UScrime)
## 'data.frame': 47 obs. of 16 variables:
## $ M : num 15.1 14.3 14.2 13.6 14.1 12.1 12.7 13.1 15.7 14 ...
## $ So : int 1 0 1 0 0 0 1 1 1 0 ...
## $ Ed : num 9.1 11.3 8.9 12.1 12.1 11 11.1 10.9 9 11.8 ...
## $ Po1 : num 5.8 10.3 4.5 14.9 10.9 11.8 8.2 11.5 6.5 7.1 ...
## $ Po2 : num 5.6 9.5 4.4 14.1 10.1 11.5 7.9 10.9 6.2 6.8 ...
## $ LF : num 0.51 0.583 0.533 0.577 0.591 0.547 0.519 0.542 0.553 0.632 ...
## $ M.F : num 95 101.2 96.9 99.4 98.5 ...
## $ Pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ NW : num 30.1 10.2 21.9 8 3 4.4 13.9 17.9 28.6 1.5 ...
## $ U1 : num 0.108 0.096 0.094 0.102 0.091 0.084 0.097 0.079 0.081 0.1 ...
## $ U2 : num 4.1 3.6 3.3 3.9 2 2.9 3.8 3.5 2.8 2.4 ...
## $ Wealth: int 3940 5570 3180 6730 5780 6890 6200 4720 4210 5260 ...
## $ Ineq : num 26.1 19.4 25 16.7 17.4 12.6 16.8 20.6 23.9 17.4 ...
## $ Prob : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ Time : num 26.2 25.3 24.3 29.9 21.3 ...
## $ Crime : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
Crime. Compute and
print or plot the correlation matrix for all variables. Identify any
obvious structural problems among the predictors. Finally, report the
marginal correlation between Po2 and the response, and
between Ineq and the response, and comment on the
result.hist(UScrime$Crime, main="Histogram of Crime Rate", xlab="Crime Rate", breaks=10, col="lightblue")
cor_matrix <- cor(UScrime)
corrplot(cor_matrix, method="color", type="upper", tl.cex=0.7)
cor_Po2 <- cor(UScrime$Po2, UScrime$Crime)
cor_Ineq <- cor(UScrime$Ineq, UScrime$Crime)
cat("Cor(Po2, y):", cor_Po2, "\n")
## Cor(Po2, y): 0.6667141
cat("Cor(Ineq, y):", cor_Ineq, "\n")
## Cor(Ineq, y): -0.1790237
The histogram shows that the crime rate is strictly positive, unimodal, and mildly right-skewed. Because of this skewness, the Gaussian assumption of the error terms in a standard OLS model should be carefully checked via diagnostic plots after fitting the model.
The correlation matrix reveals blocks of highly correlated
predictors, indicating possible structural problems (e.g.,
Po1 and Po2, or U1 and
U2). As expected, Po2 (police expenditure the
year before) is strongly positively correlated with the response.
However, rather than implying that more police causes more crime, this
is likely a spurious correlation driven by confounding variables (e.g.,
with historically higher crime rates naturally require larger police
budgets every year). Conversely, Ineq (Income inequality)
exhibits a negative marginal correlation with the crime rate, suggesting
that a larger percentage of poor families implies less crime. However,
even in this case, this causal interpretation is counterintuitive and is
likely the result of confounding bias.
In absolute terms, Po2 is more highly correlated with
the response than Ineq. However, we must interpret these
pairwise associations with caution, as marginal correlations ignore the
presence of all other variables in the Data-Generating Process.
Po2 and Ineq. How do you explain this
phenomenon statistically? Why do the standard errors of the coefficients
behave this way in this specific dataset? Why does this render our
hypothesis tests (t-tests) unreliable, even if the model’s overall \(R^2\) remains very high?full.mod <- lm(Crime ~ ., data = UScrime)
summary(full.mod)
##
## Call:
## lm(formula = Crime ~ ., data = UScrime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
## M 8.783e+01 4.171e+01 2.106 0.043443 *
## So -3.803e+00 1.488e+02 -0.026 0.979765
## Ed 1.883e+02 6.209e+01 3.033 0.004861 **
## Po1 1.928e+02 1.061e+02 1.817 0.078892 .
## Po2 -1.094e+02 1.175e+02 -0.931 0.358830
## LF -6.638e+02 1.470e+03 -0.452 0.654654
## M.F 1.741e+01 2.035e+01 0.855 0.398995
## Pop -7.330e-01 1.290e+00 -0.568 0.573845
## NW 4.204e+00 6.481e+00 0.649 0.521279
## U1 -5.827e+03 4.210e+03 -1.384 0.176238
## U2 1.678e+02 8.234e+01 2.038 0.050161 .
## Wealth 9.617e-02 1.037e-01 0.928 0.360754
## Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
## Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
## Time -3.479e+00 7.165e+00 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
Despite Po2 having a very high marginal correlation with
Crime in Question 1, its coefficient is entirely
insignificant in the multiple regression model (p-value > 0.05).
Similarly, the structural significance of Ineq changes when
evaluated alongside all other features. This occurs because marginal
correlation measures simple, isolated association, whereas a multiple
regression coefficient measures the partial effect (the effect
of a variable holding all other covariates constant). Because
Po1 and Po2 contain nearly identical
information, Po2 adds no unique explanatory power to the
DGP once Po1 is already accounted for in the model.
The standard errors for these coefficients become large due to severe multicollinearity (variance inflation). When predictors are highly correlated, the design matrix \(X^T X\) becomes nearly singular (non-invertible). This mathematically destroys the structural identifiability of the individual parameters, massively inflating the variance of the OLS estimator. The model’s overall predictive power (\(R^2\)) remains high because the overall regression hyperplane is well-estimated, but the specific slopes along those collinear axes cannot be precisely isolated, rendering their individual t-tests meaningless.
vif() function from the
car package to calculate the variance inflation factors.
You can initialize an R list, such as
models_list <- list(), to store the sequence of fitted
models (e.g.,
models_list[[15]] <- lm(Crime ~ ., data = UScrime)). If
you are running short on time, you may stop the backward selection
process early as soon as your chosen goodness-of-fit criterion starts to
worsen.library(car)
## Warning: package 'car' was built under R version 4.6.1
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.6.1
## Registered S3 method overwritten by 'car':
## method from
## na.action.merMod lme4
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
## The following objects are masked from 'package:faraway':
##
## logit, vif
# Ensure the full model is fit first
full.mod <- lm(Crime ~ ., data = UScrime)
# Initialize a list to store the models
# We will use the index to represent the number of predictors in the model
models_list <- list()
models_list[[15]] <- full.mod
aic = rep(NA, 15)
aic[15] = AIC(full.mod)
current_mod <- full.mod
# Loop backwards starting from 15 predictors down to 3.
# In each iteration, we drop one variable, leaving a model with i-1 predictors.
# The final iteration (i=3) will drop a variable and leave exactly 2 predictors.
for (i in 15:3) {
# Calculate VIFs for the current model
vifs <- vif(current_mod)
# Identify the variable with the maximum VIF
var_to_remove <- names(which.max(vifs))
# Update the model by removing that specific variable
current_mod <- update(current_mod, paste(". ~ . -", var_to_remove))
# Store the new model at the index corresponding to its number of predictors
models_list[[i - 1]] <- current_mod
# Calculate AIC for the current model
aic[i-1] = AIC(current_mod)
}
# Extract the p-1 models (models with 2 to 15 predictors) to compare
valid_models <- models_list[2:15]
# Find the index of the model with the lowest AIC
best_num_predictors <- which.min(aic)
best_lm_model = models_list[[best_num_predictors]]
cat("Best model has", best_num_predictors, "predictors with an AIC of", min(aic, na.rm=T), "\n")
## Best model has 13 predictors with an AIC of 648.4461
Comparing the models using the training MSE or standard \(R^2\) is structurally flawed, as these metrics will always favor the more complex full model even if it is overfitting the 47 observations.
Instead, we must use a penalized likelihood criterion like AIC, Adjusted \(R^2\), Mallows’ \(C_p\), or BIC, which explicitly accounts for the bias-variance trade-off. By adding a penalty for the number of parameters, these criteria estimate the model’s out-of-sample generalization.
Looking at the results, the model with 13 predictors has the lowest AIC, indicating that selectively dropping the two highly collinear variables removed more noise and variance than it did actual structural signal, ultimately yielding a model that better represents the underlying Data-Generating Process.
summary(best_lm_model)
##
## Call:
## lm(formula = Crime ~ M + So + Ed + Po1 + LF + M.F + Pop + NW +
## U1 + U2 + Ineq + Prob + Time, data = UScrime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -469.4 -93.1 12.6 117.3 506.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6041.0176 1515.7345 -3.986 0.000351 ***
## M 84.0350 40.8957 2.055 0.047879 *
## So 35.2894 143.7092 0.246 0.807543
## Ed 185.9198 59.8202 3.108 0.003861 **
## Po1 105.0940 21.7659 4.828 3.06e-05 ***
## LF -127.9865 1392.3561 -0.092 0.927317
## M.F 20.1254 20.1066 1.001 0.324141
## Pop -0.6822 1.2761 -0.535 0.596494
## NW 1.3912 6.0482 0.230 0.819502
## U1 -5748.4126 4146.8729 -1.386 0.174980
## U2 180.7362 80.8400 2.236 0.032251 *
## Ineq 60.7323 17.9172 3.390 0.001829 **
## Prob -4517.0792 2160.3360 -2.091 0.044315 *
## Time -0.5337 6.6346 -0.080 0.936366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207.9 on 33 degrees of freedom
## Multiple R-squared: 0.7927, Adjusted R-squared: 0.711
## F-statistic: 9.707 on 13 and 33 DF, p-value: 7.32e-08
Looking at the summary of the reduced model, the p-values for the
remaining predictors (like Po1 and U2) appear
now significant.
However, from a statistical learning perspective, these hypothesis tests are overconfident. This occurs due to the “double-dipping” (or selective inference) trap. We used the data to manually search for the best model structure by repeatedly checking VIFs and evaluating the remaining variables. We are now using that exact same data to test the significance of those variables, pretending the model structure was fixed before we ever saw the data. Because the search strategy already filtered out the “weak” variables, classical p-values no longer follow the assumed theoretical distributions.
par(mfrow=c(2,2))
plot(best_lm_model)
par(mfrow=c(1,1))
Looking at the four standard diagnostic plots, we can evaluate the structural assumptions of our OLS model:
Because the Gaussian error assumptions (normality and homoscedasticity) seem satisfied, standard OLS inference remains mathematically valid.
library(glmnet)
# Define matrices for glmnet
X <- model.matrix(Crime ~ ., data = UScrime)[,-1]
Y <- UScrime$Crime
# Ridge regression (alpha = 0) tuned via LOOCV (nfolds = n)
cv.ridge <- cv.glmnet(X, Y, alpha = 0, nfolds = nrow(X))
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
plot(cv.ridge)
best_lambda <- cv.ridge$lambda.min
# Fit final Ridge on all data
ridge.final <- glmnet(X, Y, alpha = 0)
plot(ridge.final)
abline(v = -log(best_lambda))
Rather than manually dropping collinear variables (which discards
potentially useful information), Ridge Regression is the ideal
alternative estimator here. Ridge mathematically stabilizes the
inversion of the highly correlated \(X^T
X\) matrix by adding an \(L2\)
penalty (\(\lambda I\)) to the
diagonal. This shrinks the coefficients of highly correlated variables
(like
Po1 and Po2) proportionally toward zero
without forcing them out of the model, allowing us to borrow strength
across all predictors safely.
We used Leave-One-Out Cross-Validation (setting the number of folds equal to the number of observations, \(n=47\)) to find the optimal penalty term \(\lambda\) that minimizes the out-of-sample prediction error.
cv.glm from the package
boot to compute the leave-one-out cv error for the linear
modellibrary(boot)
# 1. Training MSE of Reduced Model
train_mse_reduced <- mean(best_lm_model$residuals^2)
# 2. LOOCV MSE of Reduced Model using cv.glm()
glm.reduced <- glm(formula(best_lm_model), data = UScrime)
loocv_reduced <- cv.glm(UScrime, glm.reduced)$delta[1]
# 3. Training MSE of Ridge model
# (Using the X matrix and Y vector defined in Question 6)
ridge_train_preds <- predict(ridge.final, s = best_lambda, newx = X)
train_mse_ridge <- mean((Y - ridge_train_preds)^2)
# 4. LOOCV MSE of Ridge (from cv.glmnet output)
loocv_ridge <- cv.ridge$cvm[cv.ridge$lambda == best_lambda]
cat("Linear Model Train MSE:", train_mse_reduced, "\n")
## Linear Model Train MSE: 30349.95
cat("Linear Model LOOCV MSE:", loocv_reduced, "\n")
## Linear Model LOOCV MSE: 70638.74
cat("Ridge Train MSE:", train_mse_ridge, "\n")
## Ridge Train MSE: 35570.69
cat("Ridge LOOCV MSE:", loocv_ridge, "\n")
## Ridge LOOCV MSE: 67888.93
The training MSE of the reduced linear model is artificially low because it is evaluated on the exact same 47 states used to estimate the coefficients. Furthermore, it is lower than the training MSE of the Ridge estimator because standard OLS explicitly minimizes the sum of squared residuals without constraints, making it highly flexible and prone to memorizing the training data. By contrast, Ridge regression penalizes the coefficients, intentionally sacrificing some training fit (resulting in a higher training MSE) to restrict the model’s flexibility.
However, when we evaluate both models via LOOCV, the linear model’s MSE increases significantly, revealing its true out-of-sample generalization error. The LOOCV MSE of the Ridge model is ultimately lower than the LOOCV MSE of the reduced linear model.
This perfectly demonstrates the bias-variance trade-off in action. The classical subset selection approach remains too unstable for a dataset with only 47 observations. By deliberately introducing a small amount of bias via shrinkage, Ridge regression achieves a massive reduction in the model’s variance, ultimately generalizing to unseen data much more effectively.