This is my first document

library(datasets)
?datasets

## starting httpd help server ... done

help(datasets)
#library(help = "datasets")
data(iris)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

round(var(iris[,-5]))

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length            1           0            1           1
## Sepal.Width             0           0            0           0
## Petal.Length            1           0            3           1
## Petal.Width             1           0            1           1

round(cor(iris[,-5]))

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length            1           0            1           1
## Sepal.Width             0           1            0           0
## Petal.Length            1           0            1           1
## Petal.Width             1           0            1           1

The Covariance Matrix

The Code: round(var(Advertising[,-1]), 2) computes the covariance matrix for the remaining variables and rounds the output to 2 decimal places. In R, using the var() function on a data frame calculates the covariance between all columns.

The Concept: Covariance measures the direction of the linear relationship between two variables. A positive number means they tend to increase together; a negative number means one increases while the other decreases.

The Correlation Matrix

The Code: round(cor(Advertising[,-1]), 3) computes the correlation matrix and rounds it to 3 decimal places. The Concept: Correlation is a standardized version of covariance. It tells you both the direction and the strength of the linear relationship

italics bold

load("arabba_rains.RDATA")

str(arabba)

## List of 6
##  $ name  : chr "VE_0080"
##  $ maxima: num [1:36(1d)] 12.4 18 15.6 12.6 12.6 14.4 16.8 12.4 13.2 12.6 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ : chr [1:36] "1985" "1986" "1987" "1988" ...
##  $ u     : num 8.2
##  $ pot   :'data.frame':  347 obs. of  2 variables:
##   ..$ value: num [1:347] 9.4 10.4 12.4 9.4 8.4 11.4 13.8 15.6 10.4 18 ...
##   ..$ date : POSIXct[1:347], format: "1985-04-03 15:00:00" "1985-08-02 00:00:00" ...
##  $ coord : int [1:2] 716615 5151711
##  $ elev  : int 2154

rain_data = arabba$pot

rain_data

##     value                date
## 1     9.4 1985-04-03 15:00:00
## 2    10.4 1985-08-02 00:00:00
## 3    12.4 1985-08-06 12:00:00
## 4     9.4 1985-08-22 19:00:00
## 5     8.4 1986-05-20 03:00:00
## 6    11.4 1986-05-21 19:00:00
## 7    13.8 1986-05-29 14:00:00
## 8    15.6 1986-07-31 12:00:00
## 9    10.4 1986-07-31 13:00:00
## 10   18.0 1986-08-07 20:00:00
## 11   15.4 1986-08-14 01:00:00
## 12    8.6 1986-08-18 22:00:00
## 13    9.8 1986-08-23 14:00:00
## 14   10.2 1987-07-08 17:00:00
## 15   10.6 1987-07-15 19:00:00
## 16    8.8 1987-08-24 21:00:00
## 17   15.6 1987-08-24 22:00:00
## 18   11.4 1987-08-24 23:00:00
## 19   10.6 1987-08-25 02:00:00
## 20   11.4 1987-08-25 03:00:00
## 21   10.2 1987-08-25 04:00:00
## 22   11.8 1987-08-25 08:00:00
## 23   12.2 1987-11-25 18:00:00
## 24    8.8 1988-06-14 15:00:00
## 25    8.6 1988-06-24 18:00:00
## 26    9.2 1988-07-11 20:00:00
## 27   11.6 1988-07-11 21:00:00
## 28    9.4 1988-07-12 16:00:00
## 29    9.2 1988-07-15 03:00:00
## 30   12.6 1988-07-15 04:00:00
## 31   10.2 1988-07-15 05:00:00
## 32    8.4 1988-07-22 13:00:00
## 33   10.6 1988-07-29 20:00:00
## 34    8.2 1988-08-16 13:00:00
## 35   10.0 1988-08-29 22:00:00
## 36    9.4 1988-10-12 14:00:00
## 37    9.6 1988-10-12 21:00:00
## 38   10.2 1989-06-26 13:00:00
## 39    8.8 1989-06-28 12:00:00
## 40    9.8 1989-07-03 14:00:00
## 41   10.4 1989-07-22 19:00:00
## 42   11.8 1989-08-08 18:00:00
## 43    8.6 1989-08-08 19:00:00
## 44   12.6 1989-08-24 21:00:00
## 45    9.2 1989-08-26 22:00:00
## 46    8.6 1989-09-11 14:00:00
## 47   11.6 1990-06-08 14:00:00
## 48   14.4 1990-06-28 17:00:00
## 49    8.4 1990-07-01 08:00:00
## 50    8.8 1990-07-02 10:00:00
## 51   11.0 1990-07-10 14:00:00
## 52    9.2 1990-07-24 18:00:00
## 53   12.8 1990-08-15 00:00:00
## 54   10.4 1990-09-24 09:00:00
## 55    9.8 1991-06-17 16:00:00
## 56    8.8 1991-06-24 01:00:00
## 57   10.6 1991-07-04 16:00:00
## 58    9.2 1991-07-14 11:00:00
## 59   10.6 1991-07-14 12:00:00
## 60   11.4 1991-07-17 13:00:00
## 61    8.4 1991-07-21 20:00:00
## 62   16.8 1991-08-01 04:00:00
## 63    8.2 1991-08-24 20:00:00
## 64   12.4 1992-08-01 18:00:00
## 65    9.0 1992-08-04 15:00:00
## 66    8.4 1992-08-30 00:00:00
## 67   10.8 1993-06-20 16:00:00
## 68   13.2 1993-07-11 05:00:00
## 69   12.2 1993-07-16 20:00:00
## 70   12.0 1993-07-19 12:00:00
## 71    8.4 1993-07-19 20:00:00
## 72   12.0 1993-07-20 20:00:00
## 73   12.4 1993-08-04 12:00:00
## 74   10.4 1993-08-07 23:00:00
## 75   12.6 1994-06-24 16:00:00
## 76   11.2 1994-07-04 17:00:00
## 77   10.8 1994-08-08 13:00:00
## 78   10.0 1994-08-18 01:00:00
## 79   10.6 1994-08-18 02:00:00
## 80   11.8 1994-08-18 03:00:00
## 81    8.2 1994-08-31 19:00:00
## 82   12.4 1994-09-08 18:00:00
## 83    8.8 1994-09-13 19:00:00
## 84    9.2 1994-09-14 11:00:00
## 85    8.4 1994-09-14 13:00:00
## 86    9.0 1995-05-30 10:00:00
## 87    8.6 1995-06-01 17:00:00
## 88    8.8 1995-06-21 18:00:00
## 89   20.6 1995-06-21 20:00:00
## 90   11.8 1995-07-11 16:00:00
## 91   15.6 1995-07-14 16:00:00
## 92    8.8 1995-07-23 16:00:00
## 93   20.8 1995-07-23 17:00:00
## 94   10.6 1995-07-28 20:00:00
## 95   10.2 1996-05-19 17:00:00
## 96    9.2 1996-07-08 00:00:00
## 97    8.2 1996-07-08 06:00:00
## 98    8.2 1996-08-03 03:00:00
## 99    8.6 1996-08-04 20:00:00
## 100  10.0 1997-06-12 20:00:00
## 101  17.6 1997-07-14 16:00:00
## 102   8.4 1997-07-14 17:00:00
## 103  16.4 1997-07-24 13:00:00
## 104   8.6 1997-07-29 00:00:00
## 105   8.2 1997-09-05 20:00:00
## 106  12.6 1998-05-13 22:00:00
## 107   9.0 1998-07-07 16:00:00
## 108   9.2 1998-07-14 00:00:00
## 109  10.8 1998-07-14 01:00:00
## 110   8.6 1998-07-20 18:00:00
## 111   9.0 1998-07-25 18:00:00
## 112   9.0 1998-08-15 15:00:00
## 113   9.6 1998-09-05 13:00:00
## 114   8.2 1998-09-05 14:00:00
## 115  12.0 1999-05-05 00:00:00
## 116  10.4 1999-06-13 19:00:00
## 117  10.0 1999-06-30 00:00:00
## 118   9.0 1999-07-06 18:00:00
## 119  16.2 1999-08-05 18:00:00
## 120  10.0 1999-08-16 17:00:00
## 121  10.8 1999-08-20 04:00:00
## 122   9.2 1999-09-20 18:00:00
## 123  10.8 1999-09-20 19:00:00
## 124  11.4 1999-09-20 20:00:00
## 125  10.6 2000-06-14 15:00:00
## 126  11.0 2000-09-20 23:00:00
## 127   9.6 2000-11-14 16:00:00
## 128   9.0 2001-06-10 18:00:00
## 129   8.4 2001-06-16 21:00:00
## 130  14.0 2001-08-17 16:00:00
## 131   9.2 2001-08-31 19:00:00
## 132   8.8 2002-08-20 16:00:00
## 133  12.2 2002-08-23 19:00:00
## 134  58.0 2003-06-13 18:00:00
## 135  14.8 2003-06-27 23:00:00
## 136  10.0 2003-07-01 21:00:00
## 137  15.2 2003-07-17 12:00:00
## 138  13.4 2003-07-17 18:00:00
## 139  13.6 2003-07-23 02:00:00
## 140  10.0 2003-07-27 23:00:00
## 141   8.4 2003-07-28 17:00:00
## 142  18.4 2003-08-05 20:00:00
## 143  30.2 2003-08-14 15:00:00
## 144  10.2 2003-08-14 16:00:00
## 145   8.8 2003-08-20 23:00:00
## 146  10.2 2004-06-12 19:00:00
## 147  10.0 2004-07-01 17:00:00
## 148   9.0 2004-07-05 16:00:00
## 149  14.2 2004-07-06 19:00:00
## 150  16.0 2004-07-09 02:00:00
## 151  11.4 2004-07-11 08:00:00
## 152   9.2 2004-07-16 15:00:00
## 153   8.4 2004-08-07 16:00:00
## 154  11.2 2004-08-07 17:00:00
## 155  11.0 2004-08-13 22:00:00
## 156  13.6 2004-08-26 05:00:00
## 157  10.8 2004-09-14 20:00:00
## 158  15.2 2005-06-24 12:00:00
## 159  22.0 2005-06-24 13:00:00
## 160  10.6 2005-06-25 16:00:00
## 161  14.8 2005-06-29 14:00:00
## 162   8.4 2005-07-01 06:00:00
## 163  11.2 2005-07-05 02:00:00
## 164   9.4 2005-07-05 03:00:00
## 165  14.6 2005-07-18 15:00:00
## 166   9.4 2005-08-14 17:00:00
## 167  11.2 2005-08-20 12:00:00
## 168   8.6 2005-08-20 13:00:00
## 169  12.2 2005-09-03 17:00:00
## 170  19.6 2005-09-03 18:00:00
## 171   9.2 2005-09-04 17:00:00
## 172  13.0 2005-09-04 18:00:00
## 173   9.0 2006-07-27 22:00:00
## 174  10.8 2006-07-31 20:00:00
## 175  10.8 2006-08-01 09:00:00
## 176   8.2 2006-08-03 13:00:00
## 177  10.0 2006-08-26 23:00:00
## 178  10.4 2006-09-15 08:00:00
## 179  10.0 2006-09-15 09:00:00
## 180  10.2 2006-09-15 10:00:00
## 181   9.0 2007-05-22 16:00:00
## 182  12.6 2007-05-23 16:00:00
## 183   9.2 2007-06-26 18:00:00
## 184  11.8 2007-06-26 19:00:00
## 185  12.2 2007-08-07 15:00:00
## 186   9.2 2007-08-17 13:00:00
## 187   8.6 2007-08-20 07:00:00
## 188  12.2 2008-06-26 22:00:00
## 189   9.0 2008-06-27 17:00:00
## 190  10.8 2008-06-30 00:00:00
## 191   8.6 2008-07-17 19:00:00
## 192   8.4 2008-07-20 19:00:00
## 193  12.0 2008-07-26 13:00:00
## 194   8.2 2008-07-29 14:00:00
## 195  11.4 2008-08-02 01:00:00
## 196  18.4 2008-08-06 18:00:00
## 197  19.6 2008-08-06 19:00:00
## 198   8.6 2008-08-15 18:00:00
## 199  10.4 2008-08-15 19:00:00
## 200   8.8 2008-08-23 13:00:00
## 201  10.4 2009-05-22 15:00:00
## 202  16.8 2009-06-16 19:00:00
## 203   9.6 2009-06-19 22:00:00
## 204  11.6 2009-07-17 23:00:00
## 205   9.8 2009-07-18 00:00:00
## 206  13.6 2009-07-24 22:00:00
## 207   8.4 2009-08-03 18:00:00
## 208  13.8 2009-08-07 23:00:00
## 209  11.6 2009-08-08 13:00:00
## 210  16.6 2009-08-13 19:00:00
## 211   9.0 2010-06-17 13:00:00
## 212  12.0 2010-07-01 17:00:00
## 213   8.2 2010-07-02 19:00:00
## 214  11.8 2010-07-11 13:00:00
## 215  10.8 2010-07-11 14:00:00
## 216  10.4 2010-08-04 20:00:00
## 217  10.4 2010-08-08 19:00:00
## 218  20.0 2010-08-10 14:00:00
## 219   8.8 2010-08-13 02:00:00
## 220   8.2 2011-06-16 15:00:00
## 221   8.2 2011-06-18 18:00:00
## 222   8.6 2011-07-13 16:00:00
## 223  11.2 2011-07-13 21:00:00
## 224  11.0 2011-08-08 20:00:00
## 225   8.8 2011-08-15 11:00:00
## 226   9.4 2011-09-17 16:00:00
## 227  13.2 2012-06-04 10:00:00
## 228   9.0 2012-06-04 11:00:00
## 229   8.2 2012-06-19 14:00:00
## 230  14.4 2012-07-09 16:00:00
## 231   8.2 2012-07-09 20:00:00
## 232   9.8 2012-07-14 17:00:00
## 233   8.6 2012-07-21 00:00:00
## 234   9.4 2012-08-05 00:00:00
## 235   9.2 2012-08-06 19:00:00
## 236   9.2 2012-08-22 01:00:00
## 237  10.2 2012-08-30 14:00:00
## 238  10.8 2012-08-31 06:00:00
## 239   8.2 2012-08-31 07:00:00
## 240   8.4 2012-09-24 18:00:00
## 241   8.2 2013-05-17 11:00:00
## 242   8.4 2013-06-19 14:00:00
## 243   8.2 2013-07-13 01:00:00
## 244  19.0 2013-08-04 21:00:00
## 245  13.4 2013-08-19 17:00:00
## 246   8.2 2013-08-19 18:00:00
## 247  10.2 2013-08-24 23:00:00
## 248  14.8 2014-06-23 19:00:00
## 249   8.2 2014-07-27 14:00:00
## 250  18.2 2014-08-01 16:00:00
## 251  14.0 2014-08-04 19:00:00
## 252  12.8 2014-08-04 23:00:00
## 253   9.4 2014-08-05 00:00:00
## 254  11.0 2014-08-07 18:00:00
## 255  11.6 2014-08-13 12:00:00
## 256  14.8 2014-08-13 13:00:00
## 257  11.6 2014-08-31 18:00:00
## 258   8.4 2014-09-09 18:00:00
## 259   8.6 2014-09-19 17:00:00
## 260   9.0 2015-06-08 13:00:00
## 261  10.6 2015-06-08 14:00:00
## 262   8.4 2015-07-07 17:00:00
## 263   9.6 2015-07-21 17:00:00
## 264  13.8 2015-07-21 18:00:00
## 265  18.8 2015-07-24 13:00:00
## 266  21.2 2015-07-29 18:00:00
## 267  11.6 2015-08-04 18:00:00
## 268   9.4 2015-09-14 04:00:00
## 269   9.0 2015-09-14 05:00:00
## 270   8.4 2016-06-30 21:00:00
## 271   8.8 2016-07-12 16:00:00
## 272  10.0 2016-07-13 00:00:00
## 273   9.6 2016-07-21 18:00:00
## 274  11.8 2016-07-30 21:00:00
## 275  10.0 2016-07-31 12:00:00
## 276  11.6 2016-08-05 08:00:00
## 277  10.2 2016-08-05 09:00:00
## 278   9.4 2016-08-09 17:00:00
## 279  17.0 2016-08-18 21:00:00
## 280  12.8 2016-08-29 16:00:00
## 281  10.6 2016-09-17 02:00:00
## 282   9.2 2016-09-17 03:00:00
## 283   8.6 2016-09-17 04:00:00
## 284   9.6 2016-10-14 18:00:00
## 285   9.4 2016-10-14 19:00:00
## 286  13.8 2017-05-14 16:00:00
## 287   9.2 2017-05-19 17:00:00
## 288  11.0 2017-06-01 21:00:00
## 289  11.6 2017-06-02 13:00:00
## 290  10.6 2017-06-15 19:00:00
## 291   8.8 2017-06-15 20:00:00
## 292   9.2 2017-07-17 23:00:00
## 293  11.2 2017-07-24 13:00:00
## 294   9.2 2017-07-24 14:00:00
## 295  21.0 2017-08-06 12:00:00
## 296  10.8 2017-08-06 13:00:00
## 297  12.0 2017-08-09 20:00:00
## 298  11.4 2017-08-09 21:00:00
## 299  10.0 2018-05-06 15:00:00
## 300   9.2 2018-06-21 15:00:00
## 301  10.6 2018-07-10 17:00:00
## 302  10.8 2018-07-11 19:00:00
## 303   9.8 2018-07-14 20:00:00
## 304   8.4 2018-07-15 00:00:00
## 305  10.8 2018-08-01 19:00:00
## 306  12.4 2018-08-20 20:00:00
## 307   8.6 2018-08-20 21:00:00
## 308  10.2 2018-08-21 22:00:00
## 309   8.4 2018-08-23 15:00:00
## 310  11.6 2018-09-14 19:00:00
## 311  13.0 2018-10-29 15:00:00
## 312   8.6 2018-10-29 16:00:00
## 313  18.8 2018-10-29 18:00:00
## 314   8.8 2019-07-06 18:00:00
## 315   9.2 2019-07-17 18:00:00
## 316  16.4 2019-07-17 19:00:00
## 317  17.4 2019-07-26 14:00:00
## 318   8.8 2019-07-26 15:00:00
## 319  11.4 2019-07-30 19:00:00
## 320   8.4 2019-08-06 16:00:00
## 321   9.0 2019-08-15 20:00:00
## 322   9.0 2019-08-24 15:00:00
## 323  11.0 2019-10-15 20:00:00
## 324   9.0 2019-10-15 21:00:00
## 325  10.6 2020-05-23 19:00:00
## 326  11.0 2020-06-29 11:00:00
## 327  26.0 2020-07-01 12:00:00
## 328  11.4 2020-07-01 13:00:00
## 329  14.4 2020-07-11 11:00:00
## 330   9.2 2020-07-23 21:00:00
## 331   9.6 2020-07-23 22:00:00
## 332  10.6 2020-07-28 19:00:00
## 333  20.6 2020-07-29 20:00:00
## 334  11.2 2020-07-29 22:00:00
## 335  20.6 2020-07-30 21:00:00
## 336  19.6 2020-07-30 22:00:00
## 337  17.4 2020-08-01 19:00:00
## 338  15.2 2020-08-01 20:00:00
## 339   9.2 2020-08-09 23:00:00
## 340   9.0 2020-08-10 17:00:00
## 341  10.6 2020-08-13 20:00:00
## 342  16.2 2020-08-22 16:00:00
## 343  11.2 2020-08-28 22:00:00
## 344   8.2 2020-08-29 15:00:00
## 345   8.6 2020-08-29 18:00:00
## 346  11.6 2020-08-30 06:00:00
## 347   8.8 2020-08-30 07:00:00

it helps researchers to analyze how big the events occurs, how frequent the events are and did they cluster in certain time of the year.

hist(rain_data$value,freq = FALSE, main = "Histogram of arabba rain",xlab = "value" )
lines(density(rain_data$value), lwd = 2)

hist(rain_data$value,  freq = FALSE, main = "histogram" , xlab = "value")
lines(density(rain_data$value), lwd = 2)

rain_data[1:5, ]

##   value                date
## 1   9.4 1985-04-03 15:00:00
## 2  10.4 1985-08-02 00:00:00
## 3  12.4 1985-08-06 12:00:00
## 4   9.4 1985-08-22 19:00:00
## 5   8.4 1986-05-20 03:00:00

##Simple Linear Regression model

advertising <- read.csv("advertising.csv")
str(advertising)

## 'data.frame':    200 obs. of  5 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...

advertising_df <- advertising[,-1]

plot(advertising_df$newspaper,advertising_df$sales,pch = 20,xlab = "NEWPAPER BUDGET", ylab = "sales", col = "green", )

plot(advertising_df$radio,advertising_df$sales,pch = 20,xlab = "radio BUDGET", ylab = "sales", col = "red", )

plot(advertising_df$TV,advertising_df$sales, pch = 20, xlab = "TV budget" , ylab = "sales" , col= "blue"  )

n <- length(advertising_df$sales)
print(paste("Sample size:", n))

## [1] "Sample size: 200"

cov(advertising_df$TV,advertising_df$sales)

## [1] 350.3902

cov.mat <- cov(advertising_df)
cov.mat

##                   TV     radio newspaper     sales
## TV        7370.94989  69.86249 105.91945 350.39019
## radio       69.86249 220.42774 114.49698  44.63569
## newspaper  105.91945 114.49698 474.30833  25.94139
## sales      350.39019  44.63569  25.94139  27.22185

cov2cor(cov.mat)

##                   TV      radio  newspaper     sales
## TV        1.00000000 0.05480866 0.05664787 0.7822244
## radio     0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales     0.78222442 0.57622257 0.22829903 1.0000000

#par(mfrow(c(1,2))
library("corrplot")

## corrplot 0.95 loaded

?corrplot

corrplot(cor(advertising_df), method = 'ellipse', main = "\nCorrelation Ellipses")

corrplot(cor(advertising_df), method = 'number', main = "\nCorrelation Cofficients")

pairs(advertising_df, col = "green", pch = 20)

x <- advertising_df$TV
y <- advertising_df$sales
simple_rg <- lm(y~x) 
summary(simple_rg)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## x           0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

# Quick plotting of the model
plot(x, y, pch = 16, col = "blue", main = "Regression via lm()")
abline(simple_rg, col = "red", lwd = 2)

bete_hat <- coefficients(simple_rg)
y_fitted <- fitted.values(simple_rg)
e_simpleReg <- residuals(simple_rg)

sum(e_simpleReg)

## [1] -1.301736e-14

# Checking OLS properties
cat("Sum of residuals:", sum(e_simpleReg), 
    "\nSum of residuals * x:", sum(e_simpleReg * x), 
    "\nSum of residuals * fitted values:", sum(e_simpleReg * y_fitted))

## Sum of residuals: -1.301736e-14 
## Sum of residuals * x: -5.354828e-12 
## Sum of residuals * fitted values: -5.531131e-13

##Assignment 1 Simple Linear Regression

The Boston dataset The ISLR2 library contains the Boston data set, which records medv (median house value) for 506 census tracts in Boston. In this assignment we analyse the relationship between medv and lstat (percent of households with low socioeconomic status). To find out more about the data set, we can type ?Boston.

Load the Boston data.
Plot the variable medv (x-axis) against lstat (y-axis)
Compute the five-numbers-summary for the variables medv and lstat
Compute the correlation between medv and lstat and provide an interpretation of its value
Fit a simple linear regression model, with medv as the response and lstat as the predictor and use the function summary() to visualize the result.
Extract the values of the regression coefficients and provide an interpretation of the slope, that is the coefficient of lstat
Plot medv and lstat along with the least squares regression line
Compute the variable loglstat as the logarithm of lstat, plot loglstat against medv, fit the regression model with the transformed variable, output the result and add the least square line to the plot

library("ISLR2")
data(Boston)

plot(Boston$lstat,Boston$medv, pch = 20, xlab = "lower status population", ylab = "house", col = "steelblue")

summary(Boston[,c("medv", "lstat")])

##       medv           lstat      
##  Min.   : 5.00   Min.   : 1.73  
##  1st Qu.:17.02   1st Qu.: 6.95  
##  Median :21.20   Median :11.36  
##  Mean   :22.53   Mean   :12.65  
##  3rd Qu.:25.00   3rd Qu.:16.95  
##  Max.   :50.00   Max.   :37.97

cov_mat <-  cov(Boston)
cov_mat

##                crim            zn       indus         chas          nox
## crim     73.9865782   -40.2159560  23.9923388 -0.122108643  0.419593894
## zn      -40.2159560   543.9368137 -85.4126481 -0.252925293 -1.396148200
## indus    23.9923388   -85.4126481  47.0644425  0.109668806  0.607073693
## chas     -0.1221086    -0.2529253   0.1096688  0.064512973  0.002684303
## nox       0.4195939    -1.3961482   0.6070737  0.002684303  0.013427636
## rm       -1.3250378     5.1125134  -1.8879566  0.016284745 -0.024603450
## age      85.4053223  -373.9015482 124.5139031  0.618571205  2.385927202
## dis      -6.8767215    32.6293041 -10.2280975 -0.053042959 -0.187695836
## rad      46.8477610   -63.3486949  35.5499714 -0.016295543  0.616929453
## tax     844.8215381 -1236.4537354 833.3602902 -1.523367119 13.046285530
## ptratio   5.3993308   -19.7765707   5.6921040 -0.066819160  0.047397324
## lstat    27.9861679   -68.7830369  29.5802703 -0.097816264  0.488946168
## medv    -30.7185080    77.3151755 -30.5208228  0.409409463 -0.455412432
##                   rm          age           dis           rad          tax
## crim     -1.32503785   85.4053223   -6.87672154   46.84776101   844.821538
## zn        5.11251341 -373.9015482   32.62930405  -63.34869487 -1236.453735
## indus    -1.88795657  124.5139031  -10.22809746   35.54997135   833.360290
## chas      0.01628475    0.6185712   -0.05304296   -0.01629554    -1.523367
## nox      -0.02460345    2.3859272   -0.18769584    0.61692945    13.046286
## rm        0.49367085   -4.7519292    0.30366342   -1.28381457   -34.583448
## age      -4.75192919  792.3583985  -44.32937946  111.77084648  2402.690122
## dis       0.30366342  -44.3293795    4.43401514   -9.06825201  -189.664592
## rad      -1.28381457  111.7708465   -9.06825201   75.81636598  1335.756577
## tax     -34.58344778 2402.6901225 -189.66459173 1335.75657653 28404.759488
## ptratio  -0.54076322   15.9369213   -1.05977455    8.76071616   168.153141
## lstat    -3.07974141  121.0777246   -7.47332906   30.38544241   654.714520
## medv      4.49344588  -97.5890166    4.84022864  -30.56122804  -726.255716
##              ptratio        lstat         medv
## crim      5.39933079  27.98616788  -30.7185080
## zn      -19.77657066 -68.78303690   77.3151755
## indus     5.69210400  29.58027028  -30.5208228
## chas     -0.06681916  -0.09781626    0.4094095
## nox       0.04739732   0.48894617   -0.4554124
## rm       -0.54076322  -3.07974141    4.4934459
## age      15.93692134 121.07772456  -97.5890166
## dis      -1.05977455  -7.47332906    4.8402286
## rad       8.76071616  30.38544241  -30.5612280
## tax     168.15314053 654.71451963 -726.2557164
## ptratio   4.68698912   5.78272856  -10.1106571
## lstat     5.78272856  50.99475951  -48.4475383
## medv    -10.11065714 -48.44753832   84.5867236

cov2cor(cov_mat)

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##              lstat       medv
## crim     0.4556215 -0.3883046
## zn      -0.4129946  0.3604453
## indus    0.6037997 -0.4837252
## chas    -0.0539293  0.1752602
## nox      0.5908789 -0.4273208
## rm      -0.6138083  0.6953599
## age      0.6023385 -0.3769546
## dis     -0.4969958  0.2499287
## rad      0.4886763 -0.3816262
## tax      0.5439934 -0.4685359
## ptratio  0.3740443 -0.5077867
## lstat    1.0000000 -0.7376627
## medv    -0.7376627  1.0000000

cor(Boston[, c("medv", "lstat")])

##             medv      lstat
## medv   1.0000000 -0.7376627
## lstat -0.7376627  1.0000000

Boston_regression <- lm(Boston$medv~Boston$lstat)
Boston_regression

## 
## Call:
## lm(formula = Boston$medv ~ Boston$lstat)
## 
## Coefficients:
##  (Intercept)  Boston$lstat  
##        34.55         -0.95

summary(Boston_regression)

## 
## Call:
## lm(formula = Boston$medv ~ Boston$lstat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.55384    0.56263   61.41   <2e-16 ***
## Boston$lstat -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

?Boston

beta_bo <- coefficients(Boston_regression)
y_fit <- fitted.values(Boston_regression)
e_boston <- residuals(Boston_regression)

plot(Boston$lstat, Boston$medv, 
     pch = 20, 
     xlab = "Lower Status Population (%)", 
     ylab = "Median Home Value ($1000s)", 
     col = "steelblue", 
     main = "Regression of Boston Dataset")

abline(Boston_regression, col = "red", lwd = 2)

loglstat <- log(Boston$lstat)
loglstat

##   [1] 1.6054299 2.2126604 1.3937664 1.0784096 1.6733512 1.6505799 2.5201129
##   [8] 2.9523027 3.3988613 2.8390785 3.0179829 2.5855058 2.7542975 2.1114246
##  [15] 2.3282528 2.1365305 1.8840347 2.6858046 2.4587338 2.4230312 3.0454744
##  [22] 2.6268401 2.9295925 2.9897142 2.7911651 2.8039663 2.6953026 2.8495498
##  [29] 2.5494452 2.4832386 3.1179499 2.5680216 3.3217934 2.9096296 3.0125894
##  [36] 2.2700619 2.4344902 2.1713368 2.3155013 1.4632554 0.6830968 1.5769147
##  [43] 1.7595806 2.0068708 2.2565412 2.3233676 2.6497146 2.9338569 3.4278393
##  [50] 2.7850112 2.5989791 2.2438961 1.6639261 2.1317968 2.6946272 1.5706971
##  [57] 1.7526721 1.3737156 1.9257074 2.2213750 2.5764218 2.6700021 1.9065751
##  [64] 2.2512918 2.0856721 1.5411591 2.3263016 2.0918641 2.5718486 2.1736147
##  [71] 1.9050882 2.2905125 1.7083779 2.0202222 1.9139771 2.1905356 2.4824035
##  [78] 2.3292270 2.5128460 2.2082744 1.6658182 1.9768550 1.9050882 2.0162355
##  [85] 2.2638443 1.8764069 2.5541217 2.1329823 1.7047481 1.7404662 2.1758874
##  [92] 2.1041342 2.0992442 1.8261609 2.3599102 1.8946169 2.4283363 1.4374626
##  [99] 1.2725656 1.8229351 2.2428351 2.0373166 2.3636802 2.5982353 2.5120353
## [106] 2.8015405 2.9263822 2.6454653 2.5071573 2.7440606 2.5649494 2.3184584
## [113] 2.7856283 2.8384935 2.3466020 2.7574751 2.4882344 2.3321439 2.7324176
## [120] 2.6108048 2.6651427 2.6581594 2.8864753 3.2351428 2.8667619 2.6953026
## [127] 3.3054204 2.8443278 2.7337179 2.9090845 2.5336968 2.5063419 2.4087453
## [134] 2.7100482 2.8512844 2.8308576 2.8273136 2.6803364 3.0596456 2.9156062
## [141] 3.1846984 3.5383472 3.2891479 3.2741213 3.3772462 3.3250360 2.8124102
## [148] 3.3854067 3.3435683 3.0657246 2.6461748 2.5862591 2.4948570 2.7593768
## [155] 2.7160184 2.7093826 2.7813007 1.5238800 1.8609745 2.0001277 1.7047481
## [162] 0.5481214 0.6523252 1.1999648 2.4544474 2.2834023 1.3083328 2.4965058
## [169] 2.4069451 2.4265711 2.6693094 2.4874035 2.6871670 2.2016592 2.2659211
## [176] 1.6733512 2.3135250 1.8389611 1.9344158 1.6174061 2.0228712 2.2460147
## [183] 1.5727739 1.7369512 2.6376277 2.5764218 1.4929041 1.8991180 1.5173226
## [190] 1.6845454 1.6292405 1.5454326 1.0543120 1.6154200 1.4770487 1.0885620
## [197] 1.4060970 2.1529243 1.8900954 1.5173226 1.4929041 2.0055259 1.1346227
## [204] 1.3376292 1.0577903 2.3860067 2.3951643 2.8936995 2.6851227 3.1393996
## [211] 2.8489709 3.1772201 2.7744620 2.2385798 3.3860837 2.2481289 2.6034302
## [218] 2.2710944 2.8859174 2.3513753 2.2731563 3.0661907 2.2955605 2.0281482
## [225] 1.4206958 1.5325569 1.1410330 1.8500284 1.3660917 1.3244190 2.4553062
## [232] 1.6582281 0.9042182 1.3737156 2.0856721 2.3869262 2.2554935 1.5539252
## [239] 1.8500284 1.9974177 2.4318574 2.5176965 2.4176979 1.6467337 2.5257286
## [246] 2.9156062 2.2148462 2.3174737 2.2533948 1.8809906 1.7749524 1.2781522
## [253] 1.2612979 1.2641267 1.8825138 2.2246236 1.1346227 1.6331544 2.0528409
## [260] 1.9315214 2.2607209 1.9823798 1.7766458 2.4203681 2.0918641 2.3466020
## [267] 2.6939513 2.0068708 1.1505720 2.6137395 2.5649494 1.8855533 2.0451089
## [274] 1.8840347 1.2612979 1.0919233 1.8000583 1.4255151 1.9726912 1.5789787
## [281] 1.3244190 1.5238800 1.1019401 1.1505720 2.0605135 2.1077860 2.5595502
## [288] 1.9657128 2.0281482 2.2523439 1.2029723 1.2697605 1.5475625 2.1494339
## [295] 2.3418058 1.8357764 2.0001277 2.7625384 1.6034198 1.5560371 1.8033586
## [302] 2.2512918 2.1598688 1.5810384 1.9358598 2.1894164 1.8671761 2.0188950
## [309] 1.5129270 2.2995806 2.5368664 1.7884206 2.4612968 2.0668628 2.2278615
## [316] 2.4423470 2.9085391 2.7688317 2.3379522 2.5439614 1.9740810 1.9271641
## [323] 2.0412203 2.4630018 1.8115621 1.6253113 1.8164521 2.5486636 2.2995806
## [330] 1.9933388 2.2071749 2.5201129 2.0579625 1.7369512 1.9095425 2.0806908
## [337] 2.2823824 2.3570733 2.1412419 2.2762411 2.2289386 1.7029283 2.1575593
## [344] 1.9712994 1.5282279 2.3542283 2.5392370 1.8500284 1.7900914 1.7732560
## [351] 1.7884206 1.7029283 2.0528409 1.5040774 2.0856721 1.7173951 2.8678989
## [358] 2.5855058 2.4406064 2.5392370 2.0528409 2.6525375 2.3214068 2.6837575
## [365] 1.6658182 1.9629077 2.6390573 2.5900171 1.1817272 1.3164082 1.0851893
## [372] 2.2544447 2.1838016 3.5487549 3.6367964 2.5982353 3.1458749 3.0558862
## [379] 3.1650530 3.0809921 2.8454906 3.0483247 3.1612467 3.2011191 3.4219799
## [386] 3.4278393 3.3421548 3.4654234 3.4216534 3.0373539 2.8396631 2.9317269
## [393] 3.2457125 2.7193198 2.7942279 2.8402474 2.9637255 2.9917243 3.4206732
## [400] 3.4001969 3.2872819 3.0116056 3.0111134 2.9841656 3.3098128 3.1346243
## [407] 3.1501686 2.4956817 3.2733640 2.9846713 2.3135250 3.0549441 3.5371841
## [414] 2.9997243 3.6103772 3.3690185 3.2499868 3.2824138 3.0262615 3.1241255
## [421] 2.7093826 2.7536607 2.6461748 3.1480241 2.8425811 3.1941732 2.7530236
## [428] 2.6755270 3.0689827 3.1813816 2.8701691 2.9801109 2.4874035 2.7862450
## [435] 2.7193198 3.1471650 2.8931457 3.2752562 3.5269486 3.1302632 3.0960300
## [442] 2.9714396 2.8088001 2.9365129 3.1692653 3.1772201 2.8786365 2.7997174
## [449] 2.8975680 2.9606231 2.8587664 2.8752581 2.8489709 2.8178011 2.9290581
## [456] 2.8975680 2.9449652 2.8296777 2.7868614 2.6878475 2.7985001 2.6844403
## [463] 2.6383428 2.3311725 2.5817308 2.6483002 2.8419982 3.0596456 2.8975680
## [470] 2.6919208 2.7905514 2.5548990 2.6644466 2.4561642 2.8981194 3.1822118
## [477] 2.9274534 3.2152693 2.8920370 2.5733753 2.3739751 2.0464017 1.9473377
## [484] 2.3437270 2.5907670 2.3589654 2.7067160 2.4379897 2.8936995 3.1768030
## [491] 3.3904734 2.8942531 2.5915164 2.4857396 2.6093342 2.8678989 3.0511670
## [498] 2.6461748 2.5587765 2.7146947 2.6623552 2.2690283 2.2060742 1.7298841
## [505] 1.8687205 2.0643279

plot(loglstat, Boston$medv, pch = 20 , col = "blue" , xlab = "Lower Status Population (%) in log", ylab = "Median Home Value ($1000s)")

logboston <- lm(Boston$medv~loglstat)
logboston

## 
## Call:
## lm(formula = Boston$medv ~ loglstat)
## 
## Coefficients:
## (Intercept)     loglstat  
##       52.12       -12.48

summary(logboston)

## 
## Call:
## lm(formula = Boston$medv ~ loglstat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4599  -3.5006  -0.6686   2.1688  26.0129 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.1248     0.9652   54.00   <2e-16 ***
## loglstat    -12.4810     0.3946  -31.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.329 on 504 degrees of freedom
## Multiple R-squared:  0.6649, Adjusted R-squared:  0.6643 
## F-statistic:  1000 on 1 and 504 DF,  p-value: < 2.2e-16

plot(loglstat, Boston$medv, pch = 20 , col = "blue" , xlab = "Lower Status Population (%) in log", ylab = "Median Home Value ($1000s)")
abline(logboston,col = "green", lwd = 2)

advertising_df

##        TV radio newspaper sales
## 1   230.1  37.8      69.2  22.1
## 2    44.5  39.3      45.1  10.4
## 3    17.2  45.9      69.3   9.3
## 4   151.5  41.3      58.5  18.5
## 5   180.8  10.8      58.4  12.9
## 6     8.7  48.9      75.0   7.2
## 7    57.5  32.8      23.5  11.8
## 8   120.2  19.6      11.6  13.2
## 9     8.6   2.1       1.0   4.8
## 10  199.8   2.6      21.2  10.6
## 11   66.1   5.8      24.2   8.6
## 12  214.7  24.0       4.0  17.4
## 13   23.8  35.1      65.9   9.2
## 14   97.5   7.6       7.2   9.7
## 15  204.1  32.9      46.0  19.0
## 16  195.4  47.7      52.9  22.4
## 17   67.8  36.6     114.0  12.5
## 18  281.4  39.6      55.8  24.4
## 19   69.2  20.5      18.3  11.3
## 20  147.3  23.9      19.1  14.6
## 21  218.4  27.7      53.4  18.0
## 22  237.4   5.1      23.5  12.5
## 23   13.2  15.9      49.6   5.6
## 24  228.3  16.9      26.2  15.5
## 25   62.3  12.6      18.3   9.7
## 26  262.9   3.5      19.5  12.0
## 27  142.9  29.3      12.6  15.0
## 28  240.1  16.7      22.9  15.9
## 29  248.8  27.1      22.9  18.9
## 30   70.6  16.0      40.8  10.5
## 31  292.9  28.3      43.2  21.4
## 32  112.9  17.4      38.6  11.9
## 33   97.2   1.5      30.0   9.6
## 34  265.6  20.0       0.3  17.4
## 35   95.7   1.4       7.4   9.5
## 36  290.7   4.1       8.5  12.8
## 37  266.9  43.8       5.0  25.4
## 38   74.7  49.4      45.7  14.7
## 39   43.1  26.7      35.1  10.1
## 40  228.0  37.7      32.0  21.5
## 41  202.5  22.3      31.6  16.6
## 42  177.0  33.4      38.7  17.1
## 43  293.6  27.7       1.8  20.7
## 44  206.9   8.4      26.4  12.9
## 45   25.1  25.7      43.3   8.5
## 46  175.1  22.5      31.5  14.9
## 47   89.7   9.9      35.7  10.6
## 48  239.9  41.5      18.5  23.2
## 49  227.2  15.8      49.9  14.8
## 50   66.9  11.7      36.8   9.7
## 51  199.8   3.1      34.6  11.4
## 52  100.4   9.6       3.6  10.7
## 53  216.4  41.7      39.6  22.6
## 54  182.6  46.2      58.7  21.2
## 55  262.7  28.8      15.9  20.2
## 56  198.9  49.4      60.0  23.7
## 57    7.3  28.1      41.4   5.5
## 58  136.2  19.2      16.6  13.2
## 59  210.8  49.6      37.7  23.8
## 60  210.7  29.5       9.3  18.4
## 61   53.5   2.0      21.4   8.1
## 62  261.3  42.7      54.7  24.2
## 63  239.3  15.5      27.3  15.7
## 64  102.7  29.6       8.4  14.0
## 65  131.1  42.8      28.9  18.0
## 66   69.0   9.3       0.9   9.3
## 67   31.5  24.6       2.2   9.5
## 68  139.3  14.5      10.2  13.4
## 69  237.4  27.5      11.0  18.9
## 70  216.8  43.9      27.2  22.3
## 71  199.1  30.6      38.7  18.3
## 72  109.8  14.3      31.7  12.4
## 73   26.8  33.0      19.3   8.8
## 74  129.4   5.7      31.3  11.0
## 75  213.4  24.6      13.1  17.0
## 76   16.9  43.7      89.4   8.7
## 77   27.5   1.6      20.7   6.9
## 78  120.5  28.5      14.2  14.2
## 79    5.4  29.9       9.4   5.3
## 80  116.0   7.7      23.1  11.0
## 81   76.4  26.7      22.3  11.8
## 82  239.8   4.1      36.9  12.3
## 83   75.3  20.3      32.5  11.3
## 84   68.4  44.5      35.6  13.6
## 85  213.5  43.0      33.8  21.7
## 86  193.2  18.4      65.7  15.2
## 87   76.3  27.5      16.0  12.0
## 88  110.7  40.6      63.2  16.0
## 89   88.3  25.5      73.4  12.9
## 90  109.8  47.8      51.4  16.7
## 91  134.3   4.9       9.3  11.2
## 92   28.6   1.5      33.0   7.3
## 93  217.7  33.5      59.0  19.4
## 94  250.9  36.5      72.3  22.2
## 95  107.4  14.0      10.9  11.5
## 96  163.3  31.6      52.9  16.9
## 97  197.6   3.5       5.9  11.7
## 98  184.9  21.0      22.0  15.5
## 99  289.7  42.3      51.2  25.4
## 100 135.2  41.7      45.9  17.2
## 101 222.4   4.3      49.8  11.7
## 102 296.4  36.3     100.9  23.8
## 103 280.2  10.1      21.4  14.8
## 104 187.9  17.2      17.9  14.7
## 105 238.2  34.3       5.3  20.7
## 106 137.9  46.4      59.0  19.2
## 107  25.0  11.0      29.7   7.2
## 108  90.4   0.3      23.2   8.7
## 109  13.1   0.4      25.6   5.3
## 110 255.4  26.9       5.5  19.8
## 111 225.8   8.2      56.5  13.4
## 112 241.7  38.0      23.2  21.8
## 113 175.7  15.4       2.4  14.1
## 114 209.6  20.6      10.7  15.9
## 115  78.2  46.8      34.5  14.6
## 116  75.1  35.0      52.7  12.6
## 117 139.2  14.3      25.6  12.2
## 118  76.4   0.8      14.8   9.4
## 119 125.7  36.9      79.2  15.9
## 120  19.4  16.0      22.3   6.6
## 121 141.3  26.8      46.2  15.5
## 122  18.8  21.7      50.4   7.0
## 123 224.0   2.4      15.6  11.6
## 124 123.1  34.6      12.4  15.2
## 125 229.5  32.3      74.2  19.7
## 126  87.2  11.8      25.9  10.6
## 127   7.8  38.9      50.6   6.6
## 128  80.2   0.0       9.2   8.8
## 129 220.3  49.0       3.2  24.7
## 130  59.6  12.0      43.1   9.7
## 131   0.7  39.6       8.7   1.6
## 132 265.2   2.9      43.0  12.7
## 133   8.4  27.2       2.1   5.7
## 134 219.8  33.5      45.1  19.6
## 135  36.9  38.6      65.6  10.8
## 136  48.3  47.0       8.5  11.6
## 137  25.6  39.0       9.3   9.5
## 138 273.7  28.9      59.7  20.8
## 139  43.0  25.9      20.5   9.6
## 140 184.9  43.9       1.7  20.7
## 141  73.4  17.0      12.9  10.9
## 142 193.7  35.4      75.6  19.2
## 143 220.5  33.2      37.9  20.1
## 144 104.6   5.7      34.4  10.4
## 145  96.2  14.8      38.9  11.4
## 146 140.3   1.9       9.0  10.3
## 147 240.1   7.3       8.7  13.2
## 148 243.2  49.0      44.3  25.4
## 149  38.0  40.3      11.9  10.9
## 150  44.7  25.8      20.6  10.1
## 151 280.7  13.9      37.0  16.1
## 152 121.0   8.4      48.7  11.6
## 153 197.6  23.3      14.2  16.6
## 154 171.3  39.7      37.7  19.0
## 155 187.8  21.1       9.5  15.6
## 156   4.1  11.6       5.7   3.2
## 157  93.9  43.5      50.5  15.3
## 158 149.8   1.3      24.3  10.1
## 159  11.7  36.9      45.2   7.3
## 160 131.7  18.4      34.6  12.9
## 161 172.5  18.1      30.7  14.4
## 162  85.7  35.8      49.3  13.3
## 163 188.4  18.1      25.6  14.9
## 164 163.5  36.8       7.4  18.0
## 165 117.2  14.7       5.4  11.9
## 166 234.5   3.4      84.8  11.9
## 167  17.9  37.6      21.6   8.0
## 168 206.8   5.2      19.4  12.2
## 169 215.4  23.6      57.6  17.1
## 170 284.3  10.6       6.4  15.0
## 171  50.0  11.6      18.4   8.4
## 172 164.5  20.9      47.4  14.5
## 173  19.6  20.1      17.0   7.6
## 174 168.4   7.1      12.8  11.7
## 175 222.4   3.4      13.1  11.5
## 176 276.9  48.9      41.8  27.0
## 177 248.4  30.2      20.3  20.2
## 178 170.2   7.8      35.2  11.7
## 179 276.7   2.3      23.7  11.8
## 180 165.6  10.0      17.6  12.6
## 181 156.6   2.6       8.3  10.5
## 182 218.5   5.4      27.4  12.2
## 183  56.2   5.7      29.7   8.7
## 184 287.6  43.0      71.8  26.2
## 185 253.8  21.3      30.0  17.6
## 186 205.0  45.1      19.6  22.6
## 187 139.5   2.1      26.6  10.3
## 188 191.1  28.7      18.2  17.3
## 189 286.0  13.9       3.7  15.9
## 190  18.7  12.1      23.4   6.7
## 191  39.5  41.1       5.8  10.8
## 192  75.5  10.8       6.0   9.9
## 193  17.2   4.1      31.6   5.9
## 194 166.8  42.0       3.6  19.6
## 195 149.7  35.6       6.0  17.3
## 196  38.2   3.7      13.8   7.6
## 197  94.2   4.9       8.1   9.7
## 198 177.0   9.3       6.4  12.8
## 199 283.6  42.0      66.2  25.5
## 200 232.1   8.6       8.7  13.4

cov_mat1 <- cov(advertising_df)
cov_mat1

##                   TV     radio newspaper     sales
## TV        7370.94989  69.86249 105.91945 350.39019
## radio       69.86249 220.42774 114.49698  44.63569
## newspaper  105.91945 114.49698 474.30833  25.94139
## sales      350.39019  44.63569  25.94139  27.22185

Regarding Covariance vs. Strength (The Scale Problem): “While the covariance matrix indicates a positive directional relationship between the advertising budgets (TV, radio, newspaper) and sales, it cannot measure the true strength of these associations. Because covariance is highly dependent on the raw scale of the variables, the unstandardized values cannot be directly compared to determine which advertising medium is the most effective.”
Regarding Multicollinearity: “Furthermore, by inspecting the predictor-to-predictor relationships, we can observe a positive covariance between the radio and newspaper budgets. This internal association between independent variables introduces the risk of multicollinearity, which can inflate the variance of our estimators and make it difficult to isolate their individual effects on sales.

cov2cor(cov_mat1)

##                   TV      radio  newspaper     sales
## TV        1.00000000 0.05480866 0.05664787 0.7822244
## radio     0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales     0.78222442 0.57622257 0.22829903 1.0000000

correlation is just the standized version of covariance and it tells us the relationship and strenghth between two variables .

1 strong perfect positive relationship 0 no relationship -1 strong negative relationship if you could see from the above tv and sales have the strong corelation

advertising_df

##        TV radio newspaper sales
## 1   230.1  37.8      69.2  22.1
## 2    44.5  39.3      45.1  10.4
## 3    17.2  45.9      69.3   9.3
## 4   151.5  41.3      58.5  18.5
## 5   180.8  10.8      58.4  12.9
## 6     8.7  48.9      75.0   7.2
## 7    57.5  32.8      23.5  11.8
## 8   120.2  19.6      11.6  13.2
## 9     8.6   2.1       1.0   4.8
## 10  199.8   2.6      21.2  10.6
## 11   66.1   5.8      24.2   8.6
## 12  214.7  24.0       4.0  17.4
## 13   23.8  35.1      65.9   9.2
## 14   97.5   7.6       7.2   9.7
## 15  204.1  32.9      46.0  19.0
## 16  195.4  47.7      52.9  22.4
## 17   67.8  36.6     114.0  12.5
## 18  281.4  39.6      55.8  24.4
## 19   69.2  20.5      18.3  11.3
## 20  147.3  23.9      19.1  14.6
## 21  218.4  27.7      53.4  18.0
## 22  237.4   5.1      23.5  12.5
## 23   13.2  15.9      49.6   5.6
## 24  228.3  16.9      26.2  15.5
## 25   62.3  12.6      18.3   9.7
## 26  262.9   3.5      19.5  12.0
## 27  142.9  29.3      12.6  15.0
## 28  240.1  16.7      22.9  15.9
## 29  248.8  27.1      22.9  18.9
## 30   70.6  16.0      40.8  10.5
## 31  292.9  28.3      43.2  21.4
## 32  112.9  17.4      38.6  11.9
## 33   97.2   1.5      30.0   9.6
## 34  265.6  20.0       0.3  17.4
## 35   95.7   1.4       7.4   9.5
## 36  290.7   4.1       8.5  12.8
## 37  266.9  43.8       5.0  25.4
## 38   74.7  49.4      45.7  14.7
## 39   43.1  26.7      35.1  10.1
## 40  228.0  37.7      32.0  21.5
## 41  202.5  22.3      31.6  16.6
## 42  177.0  33.4      38.7  17.1
## 43  293.6  27.7       1.8  20.7
## 44  206.9   8.4      26.4  12.9
## 45   25.1  25.7      43.3   8.5
## 46  175.1  22.5      31.5  14.9
## 47   89.7   9.9      35.7  10.6
## 48  239.9  41.5      18.5  23.2
## 49  227.2  15.8      49.9  14.8
## 50   66.9  11.7      36.8   9.7
## 51  199.8   3.1      34.6  11.4
## 52  100.4   9.6       3.6  10.7
## 53  216.4  41.7      39.6  22.6
## 54  182.6  46.2      58.7  21.2
## 55  262.7  28.8      15.9  20.2
## 56  198.9  49.4      60.0  23.7
## 57    7.3  28.1      41.4   5.5
## 58  136.2  19.2      16.6  13.2
## 59  210.8  49.6      37.7  23.8
## 60  210.7  29.5       9.3  18.4
## 61   53.5   2.0      21.4   8.1
## 62  261.3  42.7      54.7  24.2
## 63  239.3  15.5      27.3  15.7
## 64  102.7  29.6       8.4  14.0
## 65  131.1  42.8      28.9  18.0
## 66   69.0   9.3       0.9   9.3
## 67   31.5  24.6       2.2   9.5
## 68  139.3  14.5      10.2  13.4
## 69  237.4  27.5      11.0  18.9
## 70  216.8  43.9      27.2  22.3
## 71  199.1  30.6      38.7  18.3
## 72  109.8  14.3      31.7  12.4
## 73   26.8  33.0      19.3   8.8
## 74  129.4   5.7      31.3  11.0
## 75  213.4  24.6      13.1  17.0
## 76   16.9  43.7      89.4   8.7
## 77   27.5   1.6      20.7   6.9
## 78  120.5  28.5      14.2  14.2
## 79    5.4  29.9       9.4   5.3
## 80  116.0   7.7      23.1  11.0
## 81   76.4  26.7      22.3  11.8
## 82  239.8   4.1      36.9  12.3
## 83   75.3  20.3      32.5  11.3
## 84   68.4  44.5      35.6  13.6
## 85  213.5  43.0      33.8  21.7
## 86  193.2  18.4      65.7  15.2
## 87   76.3  27.5      16.0  12.0
## 88  110.7  40.6      63.2  16.0
## 89   88.3  25.5      73.4  12.9
## 90  109.8  47.8      51.4  16.7
## 91  134.3   4.9       9.3  11.2
## 92   28.6   1.5      33.0   7.3
## 93  217.7  33.5      59.0  19.4
## 94  250.9  36.5      72.3  22.2
## 95  107.4  14.0      10.9  11.5
## 96  163.3  31.6      52.9  16.9
## 97  197.6   3.5       5.9  11.7
## 98  184.9  21.0      22.0  15.5
## 99  289.7  42.3      51.2  25.4
## 100 135.2  41.7      45.9  17.2
## 101 222.4   4.3      49.8  11.7
## 102 296.4  36.3     100.9  23.8
## 103 280.2  10.1      21.4  14.8
## 104 187.9  17.2      17.9  14.7
## 105 238.2  34.3       5.3  20.7
## 106 137.9  46.4      59.0  19.2
## 107  25.0  11.0      29.7   7.2
## 108  90.4   0.3      23.2   8.7
## 109  13.1   0.4      25.6   5.3
## 110 255.4  26.9       5.5  19.8
## 111 225.8   8.2      56.5  13.4
## 112 241.7  38.0      23.2  21.8
## 113 175.7  15.4       2.4  14.1
## 114 209.6  20.6      10.7  15.9
## 115  78.2  46.8      34.5  14.6
## 116  75.1  35.0      52.7  12.6
## 117 139.2  14.3      25.6  12.2
## 118  76.4   0.8      14.8   9.4
## 119 125.7  36.9      79.2  15.9
## 120  19.4  16.0      22.3   6.6
## 121 141.3  26.8      46.2  15.5
## 122  18.8  21.7      50.4   7.0
## 123 224.0   2.4      15.6  11.6
## 124 123.1  34.6      12.4  15.2
## 125 229.5  32.3      74.2  19.7
## 126  87.2  11.8      25.9  10.6
## 127   7.8  38.9      50.6   6.6
## 128  80.2   0.0       9.2   8.8
## 129 220.3  49.0       3.2  24.7
## 130  59.6  12.0      43.1   9.7
## 131   0.7  39.6       8.7   1.6
## 132 265.2   2.9      43.0  12.7
## 133   8.4  27.2       2.1   5.7
## 134 219.8  33.5      45.1  19.6
## 135  36.9  38.6      65.6  10.8
## 136  48.3  47.0       8.5  11.6
## 137  25.6  39.0       9.3   9.5
## 138 273.7  28.9      59.7  20.8
## 139  43.0  25.9      20.5   9.6
## 140 184.9  43.9       1.7  20.7
## 141  73.4  17.0      12.9  10.9
## 142 193.7  35.4      75.6  19.2
## 143 220.5  33.2      37.9  20.1
## 144 104.6   5.7      34.4  10.4
## 145  96.2  14.8      38.9  11.4
## 146 140.3   1.9       9.0  10.3
## 147 240.1   7.3       8.7  13.2
## 148 243.2  49.0      44.3  25.4
## 149  38.0  40.3      11.9  10.9
## 150  44.7  25.8      20.6  10.1
## 151 280.7  13.9      37.0  16.1
## 152 121.0   8.4      48.7  11.6
## 153 197.6  23.3      14.2  16.6
## 154 171.3  39.7      37.7  19.0
## 155 187.8  21.1       9.5  15.6
## 156   4.1  11.6       5.7   3.2
## 157  93.9  43.5      50.5  15.3
## 158 149.8   1.3      24.3  10.1
## 159  11.7  36.9      45.2   7.3
## 160 131.7  18.4      34.6  12.9
## 161 172.5  18.1      30.7  14.4
## 162  85.7  35.8      49.3  13.3
## 163 188.4  18.1      25.6  14.9
## 164 163.5  36.8       7.4  18.0
## 165 117.2  14.7       5.4  11.9
## 166 234.5   3.4      84.8  11.9
## 167  17.9  37.6      21.6   8.0
## 168 206.8   5.2      19.4  12.2
## 169 215.4  23.6      57.6  17.1
## 170 284.3  10.6       6.4  15.0
## 171  50.0  11.6      18.4   8.4
## 172 164.5  20.9      47.4  14.5
## 173  19.6  20.1      17.0   7.6
## 174 168.4   7.1      12.8  11.7
## 175 222.4   3.4      13.1  11.5
## 176 276.9  48.9      41.8  27.0
## 177 248.4  30.2      20.3  20.2
## 178 170.2   7.8      35.2  11.7
## 179 276.7   2.3      23.7  11.8
## 180 165.6  10.0      17.6  12.6
## 181 156.6   2.6       8.3  10.5
## 182 218.5   5.4      27.4  12.2
## 183  56.2   5.7      29.7   8.7
## 184 287.6  43.0      71.8  26.2
## 185 253.8  21.3      30.0  17.6
## 186 205.0  45.1      19.6  22.6
## 187 139.5   2.1      26.6  10.3
## 188 191.1  28.7      18.2  17.3
## 189 286.0  13.9       3.7  15.9
## 190  18.7  12.1      23.4   6.7
## 191  39.5  41.1       5.8  10.8
## 192  75.5  10.8       6.0   9.9
## 193  17.2   4.1      31.6   5.9
## 194 166.8  42.0       3.6  19.6
## 195 149.7  35.6       6.0  17.3
## 196  38.2   3.7      13.8   7.6
## 197  94.2   4.9       8.1   9.7
## 198 177.0   9.3       6.4  12.8
## 199 283.6  42.0      66.2  25.5
## 200 232.1   8.6       8.7  13.4

n <- nrow(advertising_df)
n

## [1] 200

names <- variable.names(advertising_df)
names

## [1] "TV"        "radio"     "newspaper" "sales"

multi_model <- lm(sales ~ .-X , data = advertising)
summary(multi_model)

## 
## Call:
## lm(formula = sales ~ . - X, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

According to the F-statistic, the p-value is very small, which means we reject the global null hypothesis; at least one of the predictors provides significant information about the model. If you look at the p-values for TV and radio, they are less than the 0.05 threshold, meaning they are statistically significant. Conversely, the p-value for newspaper is very high, indicating that it is not statistically significant and provides no useful information. Finally, the Adjusted $R^2$ explains that 89.6% of the total variance is accounted for by the model.The median of the residuals is close to zero, which suggests that the model’s errors are centered properly and do not show significant bias.The median of the residuals is close to zero, which suggests that the model’s errors are centered properly and do not show significant bias. As the newspaper is useless we can easily remove it to improve our model without losing the model predictive power.

e <- residuals(multi_model)
RSS <- sum(e^2)
TSS <- var(advertising_df$sales) * n-1

r.square <- 1 - RSS/TSS
r.square

## [1] 0.8977058

multi_new <- update(multi_model, ~ . -newspaper)
summary(multi_new)

## 
## Call:
## lm(formula = sales ~ TV + radio, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## radio        0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

Before removing the variable we will use F test to do the comparison between the two models to be sure that if the variable is actually useless or not .

F TEST FOR MODEL COMPARISON

e1 <- residuals(multi_new)

RSS0 <- sum(e1^2) 



a <- (RSS0 - RSS)
b <- RSS/(n-4)

F_distribution <- a/b

F_distribution

## [1] 0.03122805

#pvalue of f distribution

f_pvalue <- 1 - pf(F_distribution, df1 = 1, df2 = n-4)
f_pvalue

## [1] 0.8599151

Because the p-value for the newspaper variable is very high, it is not statistically significant. This means it provides no unique or useful information to the model. Therefore, we can safely remove this predictor without losing any valuable information, which helps us create a simpler and more efficient model.

cat("r square full: ", summary(multi_model)$r.squared)

## r square full:  0.8972106

 cat("r square reduce: ", summary(multi_new)$r.squared)

## r square reduce:  0.8971943

cat("AIC FULL :", AIC(multi_model))

## AIC FULL : 782.3622

cat("AIC REDUCE :", AIC(multi_new))

## AIC REDUCE : 780.3941

cat("Adj R-squared (Full):   ", summary(multi_model)$adj.r.squared, "\n")

## Adj R-squared (Full):    0.8956373

cat("Adj R-squared (NEW):   ", summary(multi_new)$adj.r.squared, "\n")

## Adj R-squared (NEW):    0.8961505

I chose the reduced model because it is more efficient than the full model. The Adjusted $R^2$ increased, which confirms that the removed variable was not contributing to the model’s accuracy, and the lower AIC indicates that the model is now more parsimonious—providing the best balance of accuracy and simplicity.

Interaction

model_int <- lm(sales ~ TV*radio, data = advertising_df)
summary(model_int)

## 
## Call:
## lm(formula = sales ~ TV * radio, data = advertising_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

The interaction term between TV and radio is statistically significant, as indicated by a p-value of less than 0.05. This reveals a positive synergy: the effect of TV advertising on sales is enhanced when the radio budget is also increased. Furthermore, the increase in Adjusted $R^2$ confirms that this interaction term adds meaningful explanatory power to the model, making it a better fit than the model with only main effects.

“The TV and radio interaction is significant (p < 0.05). This means that these two ads work better together than they do separately. Because the Adjusted $R^2$ went up, we know that including this interaction makes our model better at predicting sales

PREDICTION

simple.model <- lm(sales ~ TV, data = advertising_df)
n.new <- data.frame(TV = 170)


predict(simple.model , newdata = n.new , interval = "confidence")

##        fit      lwr      upr
## 1 15.11382 14.64338 15.58426

predict(simple.model , newdata = n.new , interval = "prediction")

##        fit      lwr      upr
## 1 15.11382 8.670498 21.55715

“The confidence interval is narrower, as it only reflects the uncertainty of the model’s estimate of the average. Conversely, the prediction interval is wider because it reflects both the uncertainty of the model and the random noise (the error term) associated with a single new observation.”

plot(advertising_df$TV,advertising_df$sales , pch = 20, col = "grey")
abline(simple.model, col = "blue")
points(170,predict(simple.model,newdata = n.new)[1], pch =4,cex = 2, lwd = 3)

Assignment n.2 Multiple linear regression 2026-03-25 The Auto dataset The ISLR2 library contains the Auto data set, which records 392 observations on the following 8 variables: mpg (miles per gallon), cylinders, engine displacement (cu. inches), horsepower, vehicle weight (lbs.), acceleration, that is time to accelerate from 0 to 60 mph (sec.), model year (modulo 100), and origin of car (1. American, 2. European, 3. Japanese).

Load the Auto dataset and remove the column name.
Fit the linear regression model of mpg on all the remaining variables in the dataset.
Provide an interpretation of the coefficient of the variable origin.
Initially, remove origin from the model, fit the full model, and provide an interpretation of the sign of the coefficients of the variables that are significantly different from zero.
Use the function update() to fit the reduced model containing only the variables that are significantly different from zero in the full model. Compare both the Multiple R2 and the Adjusted R2 between the full and the reduced model.
Compute, explicitly, the F-test for the comparison of the two models and compare your results with those you can get by applying the anova() function.
Define the variable origin as categorical and fit the full model again and provide an interpretation of the origin parameters.
Apply a one-way Analysis of Variance (ANOVA) to compare the mean efficiency of cars across the three origins (ignoring other variables for this specific test). Include a post-hoc analysis based on the Tukey HSD method. Comment on the results.
Show how the ANOVA test computed with the aov() function can also be computed with the lm() function.
Include an interaction term between origin and weight. Fit the model and comment on the output.
Load the Auto dataset and remove the column name.

library("ISLR2")
data("Auto")
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Auto_df <- Auto[, 1:8]
Auto_df

##      mpg cylinders displacement horsepower weight acceleration year origin
## 1   18.0         8        307.0        130   3504         12.0   70      1
## 2   15.0         8        350.0        165   3693         11.5   70      1
## 3   18.0         8        318.0        150   3436         11.0   70      1
## 4   16.0         8        304.0        150   3433         12.0   70      1
## 5   17.0         8        302.0        140   3449         10.5   70      1
## 6   15.0         8        429.0        198   4341         10.0   70      1
## 7   14.0         8        454.0        220   4354          9.0   70      1
## 8   14.0         8        440.0        215   4312          8.5   70      1
## 9   14.0         8        455.0        225   4425         10.0   70      1
## 10  15.0         8        390.0        190   3850          8.5   70      1
## 11  15.0         8        383.0        170   3563         10.0   70      1
## 12  14.0         8        340.0        160   3609          8.0   70      1
## 13  15.0         8        400.0        150   3761          9.5   70      1
## 14  14.0         8        455.0        225   3086         10.0   70      1
## 15  24.0         4        113.0         95   2372         15.0   70      3
## 16  22.0         6        198.0         95   2833         15.5   70      1
## 17  18.0         6        199.0         97   2774         15.5   70      1
## 18  21.0         6        200.0         85   2587         16.0   70      1
## 19  27.0         4         97.0         88   2130         14.5   70      3
## 20  26.0         4         97.0         46   1835         20.5   70      2
## 21  25.0         4        110.0         87   2672         17.5   70      2
## 22  24.0         4        107.0         90   2430         14.5   70      2
## 23  25.0         4        104.0         95   2375         17.5   70      2
## 24  26.0         4        121.0        113   2234         12.5   70      2
## 25  21.0         6        199.0         90   2648         15.0   70      1
## 26  10.0         8        360.0        215   4615         14.0   70      1
## 27  10.0         8        307.0        200   4376         15.0   70      1
## 28  11.0         8        318.0        210   4382         13.5   70      1
## 29   9.0         8        304.0        193   4732         18.5   70      1
## 30  27.0         4         97.0         88   2130         14.5   71      3
## 31  28.0         4        140.0         90   2264         15.5   71      1
## 32  25.0         4        113.0         95   2228         14.0   71      3
## 34  19.0         6        232.0        100   2634         13.0   71      1
## 35  16.0         6        225.0        105   3439         15.5   71      1
## 36  17.0         6        250.0        100   3329         15.5   71      1
## 37  19.0         6        250.0         88   3302         15.5   71      1
## 38  18.0         6        232.0        100   3288         15.5   71      1
## 39  14.0         8        350.0        165   4209         12.0   71      1
## 40  14.0         8        400.0        175   4464         11.5   71      1
## 41  14.0         8        351.0        153   4154         13.5   71      1
## 42  14.0         8        318.0        150   4096         13.0   71      1
## 43  12.0         8        383.0        180   4955         11.5   71      1
## 44  13.0         8        400.0        170   4746         12.0   71      1
## 45  13.0         8        400.0        175   5140         12.0   71      1
## 46  18.0         6        258.0        110   2962         13.5   71      1
## 47  22.0         4        140.0         72   2408         19.0   71      1
## 48  19.0         6        250.0        100   3282         15.0   71      1
## 49  18.0         6        250.0         88   3139         14.5   71      1
## 50  23.0         4        122.0         86   2220         14.0   71      1
## 51  28.0         4        116.0         90   2123         14.0   71      2
## 52  30.0         4         79.0         70   2074         19.5   71      2
## 53  30.0         4         88.0         76   2065         14.5   71      2
## 54  31.0         4         71.0         65   1773         19.0   71      3
## 55  35.0         4         72.0         69   1613         18.0   71      3
## 56  27.0         4         97.0         60   1834         19.0   71      2
## 57  26.0         4         91.0         70   1955         20.5   71      1
## 58  24.0         4        113.0         95   2278         15.5   72      3
## 59  25.0         4         97.5         80   2126         17.0   72      1
## 60  23.0         4         97.0         54   2254         23.5   72      2
## 61  20.0         4        140.0         90   2408         19.5   72      1
## 62  21.0         4        122.0         86   2226         16.5   72      1
## 63  13.0         8        350.0        165   4274         12.0   72      1
## 64  14.0         8        400.0        175   4385         12.0   72      1
## 65  15.0         8        318.0        150   4135         13.5   72      1
## 66  14.0         8        351.0        153   4129         13.0   72      1
## 67  17.0         8        304.0        150   3672         11.5   72      1
## 68  11.0         8        429.0        208   4633         11.0   72      1
## 69  13.0         8        350.0        155   4502         13.5   72      1
## 70  12.0         8        350.0        160   4456         13.5   72      1
## 71  13.0         8        400.0        190   4422         12.5   72      1
## 72  19.0         3         70.0         97   2330         13.5   72      3
## 73  15.0         8        304.0        150   3892         12.5   72      1
## 74  13.0         8        307.0        130   4098         14.0   72      1
## 75  13.0         8        302.0        140   4294         16.0   72      1
## 76  14.0         8        318.0        150   4077         14.0   72      1
## 77  18.0         4        121.0        112   2933         14.5   72      2
## 78  22.0         4        121.0         76   2511         18.0   72      2
## 79  21.0         4        120.0         87   2979         19.5   72      2
## 80  26.0         4         96.0         69   2189         18.0   72      2
## 81  22.0         4        122.0         86   2395         16.0   72      1
## 82  28.0         4         97.0         92   2288         17.0   72      3
## 83  23.0         4        120.0         97   2506         14.5   72      3
## 84  28.0         4         98.0         80   2164         15.0   72      1
## 85  27.0         4         97.0         88   2100         16.5   72      3
## 86  13.0         8        350.0        175   4100         13.0   73      1
## 87  14.0         8        304.0        150   3672         11.5   73      1
## 88  13.0         8        350.0        145   3988         13.0   73      1
## 89  14.0         8        302.0        137   4042         14.5   73      1
## 90  15.0         8        318.0        150   3777         12.5   73      1
## 91  12.0         8        429.0        198   4952         11.5   73      1
## 92  13.0         8        400.0        150   4464         12.0   73      1
## 93  13.0         8        351.0        158   4363         13.0   73      1
## 94  14.0         8        318.0        150   4237         14.5   73      1
## 95  13.0         8        440.0        215   4735         11.0   73      1
## 96  12.0         8        455.0        225   4951         11.0   73      1
## 97  13.0         8        360.0        175   3821         11.0   73      1
## 98  18.0         6        225.0        105   3121         16.5   73      1
## 99  16.0         6        250.0        100   3278         18.0   73      1
## 100 18.0         6        232.0        100   2945         16.0   73      1
## 101 18.0         6        250.0         88   3021         16.5   73      1
## 102 23.0         6        198.0         95   2904         16.0   73      1
## 103 26.0         4         97.0         46   1950         21.0   73      2
## 104 11.0         8        400.0        150   4997         14.0   73      1
## 105 12.0         8        400.0        167   4906         12.5   73      1
## 106 13.0         8        360.0        170   4654         13.0   73      1
## 107 12.0         8        350.0        180   4499         12.5   73      1
## 108 18.0         6        232.0        100   2789         15.0   73      1
## 109 20.0         4         97.0         88   2279         19.0   73      3
## 110 21.0         4        140.0         72   2401         19.5   73      1
## 111 22.0         4        108.0         94   2379         16.5   73      3
## 112 18.0         3         70.0         90   2124         13.5   73      3
## 113 19.0         4        122.0         85   2310         18.5   73      1
## 114 21.0         6        155.0        107   2472         14.0   73      1
## 115 26.0         4         98.0         90   2265         15.5   73      2
## 116 15.0         8        350.0        145   4082         13.0   73      1
## 117 16.0         8        400.0        230   4278          9.5   73      1
## 118 29.0         4         68.0         49   1867         19.5   73      2
## 119 24.0         4        116.0         75   2158         15.5   73      2
## 120 20.0         4        114.0         91   2582         14.0   73      2
## 121 19.0         4        121.0        112   2868         15.5   73      2
## 122 15.0         8        318.0        150   3399         11.0   73      1
## 123 24.0         4        121.0        110   2660         14.0   73      2
## 124 20.0         6        156.0        122   2807         13.5   73      3
## 125 11.0         8        350.0        180   3664         11.0   73      1
## 126 20.0         6        198.0         95   3102         16.5   74      1
## 128 19.0         6        232.0        100   2901         16.0   74      1
## 129 15.0         6        250.0        100   3336         17.0   74      1
## 130 31.0         4         79.0         67   1950         19.0   74      3
## 131 26.0         4        122.0         80   2451         16.5   74      1
## 132 32.0         4         71.0         65   1836         21.0   74      3
## 133 25.0         4        140.0         75   2542         17.0   74      1
## 134 16.0         6        250.0        100   3781         17.0   74      1
## 135 16.0         6        258.0        110   3632         18.0   74      1
## 136 18.0         6        225.0        105   3613         16.5   74      1
## 137 16.0         8        302.0        140   4141         14.0   74      1
## 138 13.0         8        350.0        150   4699         14.5   74      1
## 139 14.0         8        318.0        150   4457         13.5   74      1
## 140 14.0         8        302.0        140   4638         16.0   74      1
## 141 14.0         8        304.0        150   4257         15.5   74      1
## 142 29.0         4         98.0         83   2219         16.5   74      2
## 143 26.0         4         79.0         67   1963         15.5   74      2
## 144 26.0         4         97.0         78   2300         14.5   74      2
## 145 31.0         4         76.0         52   1649         16.5   74      3
## 146 32.0         4         83.0         61   2003         19.0   74      3
## 147 28.0         4         90.0         75   2125         14.5   74      1
## 148 24.0         4         90.0         75   2108         15.5   74      2
## 149 26.0         4        116.0         75   2246         14.0   74      2
## 150 24.0         4        120.0         97   2489         15.0   74      3
## 151 26.0         4        108.0         93   2391         15.5   74      3
## 152 31.0         4         79.0         67   2000         16.0   74      2
## 153 19.0         6        225.0         95   3264         16.0   75      1
## 154 18.0         6        250.0        105   3459         16.0   75      1
## 155 15.0         6        250.0         72   3432         21.0   75      1
## 156 15.0         6        250.0         72   3158         19.5   75      1
## 157 16.0         8        400.0        170   4668         11.5   75      1
## 158 15.0         8        350.0        145   4440         14.0   75      1
## 159 16.0         8        318.0        150   4498         14.5   75      1
## 160 14.0         8        351.0        148   4657         13.5   75      1
## 161 17.0         6        231.0        110   3907         21.0   75      1
## 162 16.0         6        250.0        105   3897         18.5   75      1
## 163 15.0         6        258.0        110   3730         19.0   75      1
## 164 18.0         6        225.0         95   3785         19.0   75      1
## 165 21.0         6        231.0        110   3039         15.0   75      1
## 166 20.0         8        262.0        110   3221         13.5   75      1
## 167 13.0         8        302.0        129   3169         12.0   75      1
## 168 29.0         4         97.0         75   2171         16.0   75      3
## 169 23.0         4        140.0         83   2639         17.0   75      1
## 170 20.0         6        232.0        100   2914         16.0   75      1
## 171 23.0         4        140.0         78   2592         18.5   75      1
## 172 24.0         4        134.0         96   2702         13.5   75      3
## 173 25.0         4         90.0         71   2223         16.5   75      2
## 174 24.0         4        119.0         97   2545         17.0   75      3
## 175 18.0         6        171.0         97   2984         14.5   75      1
## 176 29.0         4         90.0         70   1937         14.0   75      2
## 177 19.0         6        232.0         90   3211         17.0   75      1
## 178 23.0         4        115.0         95   2694         15.0   75      2
## 179 23.0         4        120.0         88   2957         17.0   75      2
## 180 22.0         4        121.0         98   2945         14.5   75      2
## 181 25.0         4        121.0        115   2671         13.5   75      2
## 182 33.0         4         91.0         53   1795         17.5   75      3
## 183 28.0         4        107.0         86   2464         15.5   76      2
## 184 25.0         4        116.0         81   2220         16.9   76      2
## 185 25.0         4        140.0         92   2572         14.9   76      1
## 186 26.0         4         98.0         79   2255         17.7   76      1
## 187 27.0         4        101.0         83   2202         15.3   76      2
## 188 17.5         8        305.0        140   4215         13.0   76      1
## 189 16.0         8        318.0        150   4190         13.0   76      1
## 190 15.5         8        304.0        120   3962         13.9   76      1
## 191 14.5         8        351.0        152   4215         12.8   76      1
## 192 22.0         6        225.0        100   3233         15.4   76      1
## 193 22.0         6        250.0        105   3353         14.5   76      1
## 194 24.0         6        200.0         81   3012         17.6   76      1
## 195 22.5         6        232.0         90   3085         17.6   76      1
## 196 29.0         4         85.0         52   2035         22.2   76      1
## 197 24.5         4         98.0         60   2164         22.1   76      1
## 198 29.0         4         90.0         70   1937         14.2   76      2
## 199 33.0         4         91.0         53   1795         17.4   76      3
## 200 20.0         6        225.0        100   3651         17.7   76      1
## 201 18.0         6        250.0         78   3574         21.0   76      1
## 202 18.5         6        250.0        110   3645         16.2   76      1
## 203 17.5         6        258.0         95   3193         17.8   76      1
## 204 29.5         4         97.0         71   1825         12.2   76      2
## 205 32.0         4         85.0         70   1990         17.0   76      3
## 206 28.0         4         97.0         75   2155         16.4   76      3
## 207 26.5         4        140.0         72   2565         13.6   76      1
## 208 20.0         4        130.0        102   3150         15.7   76      2
## 209 13.0         8        318.0        150   3940         13.2   76      1
## 210 19.0         4        120.0         88   3270         21.9   76      2
## 211 19.0         6        156.0        108   2930         15.5   76      3
## 212 16.5         6        168.0        120   3820         16.7   76      2
## 213 16.5         8        350.0        180   4380         12.1   76      1
## 214 13.0         8        350.0        145   4055         12.0   76      1
## 215 13.0         8        302.0        130   3870         15.0   76      1
## 216 13.0         8        318.0        150   3755         14.0   76      1
## 217 31.5         4         98.0         68   2045         18.5   77      3
## 218 30.0         4        111.0         80   2155         14.8   77      1
## 219 36.0         4         79.0         58   1825         18.6   77      2
## 220 25.5         4        122.0         96   2300         15.5   77      1
## 221 33.5         4         85.0         70   1945         16.8   77      3
## 222 17.5         8        305.0        145   3880         12.5   77      1
## 223 17.0         8        260.0        110   4060         19.0   77      1
## 224 15.5         8        318.0        145   4140         13.7   77      1
## 225 15.0         8        302.0        130   4295         14.9   77      1
## 226 17.5         6        250.0        110   3520         16.4   77      1
## 227 20.5         6        231.0        105   3425         16.9   77      1
## 228 19.0         6        225.0        100   3630         17.7   77      1
## 229 18.5         6        250.0         98   3525         19.0   77      1
## 230 16.0         8        400.0        180   4220         11.1   77      1
## 231 15.5         8        350.0        170   4165         11.4   77      1
## 232 15.5         8        400.0        190   4325         12.2   77      1
## 233 16.0         8        351.0        149   4335         14.5   77      1
## 234 29.0         4         97.0         78   1940         14.5   77      2
## 235 24.5         4        151.0         88   2740         16.0   77      1
## 236 26.0         4         97.0         75   2265         18.2   77      3
## 237 25.5         4        140.0         89   2755         15.8   77      1
## 238 30.5         4         98.0         63   2051         17.0   77      1
## 239 33.5         4         98.0         83   2075         15.9   77      1
## 240 30.0         4         97.0         67   1985         16.4   77      3
## 241 30.5         4         97.0         78   2190         14.1   77      2
## 242 22.0         6        146.0         97   2815         14.5   77      3
## 243 21.5         4        121.0        110   2600         12.8   77      2
## 244 21.5         3         80.0        110   2720         13.5   77      3
## 245 43.1         4         90.0         48   1985         21.5   78      2
## 246 36.1         4         98.0         66   1800         14.4   78      1
## 247 32.8         4         78.0         52   1985         19.4   78      3
## 248 39.4         4         85.0         70   2070         18.6   78      3
## 249 36.1         4         91.0         60   1800         16.4   78      3
## 250 19.9         8        260.0        110   3365         15.5   78      1
## 251 19.4         8        318.0        140   3735         13.2   78      1
## 252 20.2         8        302.0        139   3570         12.8   78      1
## 253 19.2         6        231.0        105   3535         19.2   78      1
## 254 20.5         6        200.0         95   3155         18.2   78      1
## 255 20.2         6        200.0         85   2965         15.8   78      1
## 256 25.1         4        140.0         88   2720         15.4   78      1
## 257 20.5         6        225.0        100   3430         17.2   78      1
## 258 19.4         6        232.0         90   3210         17.2   78      1
## 259 20.6         6        231.0        105   3380         15.8   78      1
## 260 20.8         6        200.0         85   3070         16.7   78      1
## 261 18.6         6        225.0        110   3620         18.7   78      1
## 262 18.1         6        258.0        120   3410         15.1   78      1
## 263 19.2         8        305.0        145   3425         13.2   78      1
## 264 17.7         6        231.0        165   3445         13.4   78      1
## 265 18.1         8        302.0        139   3205         11.2   78      1
## 266 17.5         8        318.0        140   4080         13.7   78      1
## 267 30.0         4         98.0         68   2155         16.5   78      1
## 268 27.5         4        134.0         95   2560         14.2   78      3
## 269 27.2         4        119.0         97   2300         14.7   78      3
## 270 30.9         4        105.0         75   2230         14.5   78      1
## 271 21.1         4        134.0         95   2515         14.8   78      3
## 272 23.2         4        156.0        105   2745         16.7   78      1
## 273 23.8         4        151.0         85   2855         17.6   78      1
## 274 23.9         4        119.0         97   2405         14.9   78      3
## 275 20.3         5        131.0        103   2830         15.9   78      2
## 276 17.0         6        163.0        125   3140         13.6   78      2
## 277 21.6         4        121.0        115   2795         15.7   78      2
## 278 16.2         6        163.0        133   3410         15.8   78      2
## 279 31.5         4         89.0         71   1990         14.9   78      2
## 280 29.5         4         98.0         68   2135         16.6   78      3
## 281 21.5         6        231.0        115   3245         15.4   79      1
## 282 19.8         6        200.0         85   2990         18.2   79      1
## 283 22.3         4        140.0         88   2890         17.3   79      1
## 284 20.2         6        232.0         90   3265         18.2   79      1
## 285 20.6         6        225.0        110   3360         16.6   79      1
## 286 17.0         8        305.0        130   3840         15.4   79      1
## 287 17.6         8        302.0        129   3725         13.4   79      1
## 288 16.5         8        351.0        138   3955         13.2   79      1
## 289 18.2         8        318.0        135   3830         15.2   79      1
## 290 16.9         8        350.0        155   4360         14.9   79      1
## 291 15.5         8        351.0        142   4054         14.3   79      1
## 292 19.2         8        267.0        125   3605         15.0   79      1
## 293 18.5         8        360.0        150   3940         13.0   79      1
## 294 31.9         4         89.0         71   1925         14.0   79      2
## 295 34.1         4         86.0         65   1975         15.2   79      3
## 296 35.7         4         98.0         80   1915         14.4   79      1
## 297 27.4         4        121.0         80   2670         15.0   79      1
## 298 25.4         5        183.0         77   3530         20.1   79      2
## 299 23.0         8        350.0        125   3900         17.4   79      1
## 300 27.2         4        141.0         71   3190         24.8   79      2
## 301 23.9         8        260.0         90   3420         22.2   79      1
## 302 34.2         4        105.0         70   2200         13.2   79      1
## 303 34.5         4        105.0         70   2150         14.9   79      1
## 304 31.8         4         85.0         65   2020         19.2   79      3
## 305 37.3         4         91.0         69   2130         14.7   79      2
## 306 28.4         4        151.0         90   2670         16.0   79      1
## 307 28.8         6        173.0        115   2595         11.3   79      1
## 308 26.8         6        173.0        115   2700         12.9   79      1
## 309 33.5         4        151.0         90   2556         13.2   79      1
## 310 41.5         4         98.0         76   2144         14.7   80      2
## 311 38.1         4         89.0         60   1968         18.8   80      3
## 312 32.1         4         98.0         70   2120         15.5   80      1
## 313 37.2         4         86.0         65   2019         16.4   80      3
## 314 28.0         4        151.0         90   2678         16.5   80      1
## 315 26.4         4        140.0         88   2870         18.1   80      1
## 316 24.3         4        151.0         90   3003         20.1   80      1
## 317 19.1         6        225.0         90   3381         18.7   80      1
## 318 34.3         4         97.0         78   2188         15.8   80      2
## 319 29.8         4        134.0         90   2711         15.5   80      3
## 320 31.3         4        120.0         75   2542         17.5   80      3
## 321 37.0         4        119.0         92   2434         15.0   80      3
## 322 32.2         4        108.0         75   2265         15.2   80      3
## 323 46.6         4         86.0         65   2110         17.9   80      3
## 324 27.9         4        156.0        105   2800         14.4   80      1
## 325 40.8         4         85.0         65   2110         19.2   80      3
## 326 44.3         4         90.0         48   2085         21.7   80      2
## 327 43.4         4         90.0         48   2335         23.7   80      2
## 328 36.4         5        121.0         67   2950         19.9   80      2
## 329 30.0         4        146.0         67   3250         21.8   80      2
## 330 44.6         4         91.0         67   1850         13.8   80      3
## 332 33.8         4         97.0         67   2145         18.0   80      3
## 333 29.8         4         89.0         62   1845         15.3   80      2
## 334 32.7         6        168.0        132   2910         11.4   80      3
## 335 23.7         3         70.0        100   2420         12.5   80      3
## 336 35.0         4        122.0         88   2500         15.1   80      2
## 338 32.4         4        107.0         72   2290         17.0   80      3
## 339 27.2         4        135.0         84   2490         15.7   81      1
## 340 26.6         4        151.0         84   2635         16.4   81      1
## 341 25.8         4        156.0         92   2620         14.4   81      1
## 342 23.5         6        173.0        110   2725         12.6   81      1
## 343 30.0         4        135.0         84   2385         12.9   81      1
## 344 39.1         4         79.0         58   1755         16.9   81      3
## 345 39.0         4         86.0         64   1875         16.4   81      1
## 346 35.1         4         81.0         60   1760         16.1   81      3
## 347 32.3         4         97.0         67   2065         17.8   81      3
## 348 37.0         4         85.0         65   1975         19.4   81      3
## 349 37.7         4         89.0         62   2050         17.3   81      3
## 350 34.1         4         91.0         68   1985         16.0   81      3
## 351 34.7         4        105.0         63   2215         14.9   81      1
## 352 34.4         4         98.0         65   2045         16.2   81      1
## 353 29.9         4         98.0         65   2380         20.7   81      1
## 354 33.0         4        105.0         74   2190         14.2   81      2
## 356 33.7         4        107.0         75   2210         14.4   81      3
## 357 32.4         4        108.0         75   2350         16.8   81      3
## 358 32.9         4        119.0        100   2615         14.8   81      3
## 359 31.6         4        120.0         74   2635         18.3   81      3
## 360 28.1         4        141.0         80   3230         20.4   81      2
## 361 30.7         6        145.0         76   3160         19.6   81      2
## 362 25.4         6        168.0        116   2900         12.6   81      3
## 363 24.2         6        146.0        120   2930         13.8   81      3
## 364 22.4         6        231.0        110   3415         15.8   81      1
## 365 26.6         8        350.0        105   3725         19.0   81      1
## 366 20.2         6        200.0         88   3060         17.1   81      1
## 367 17.6         6        225.0         85   3465         16.6   81      1
## 368 28.0         4        112.0         88   2605         19.6   82      1
## 369 27.0         4        112.0         88   2640         18.6   82      1
## 370 34.0         4        112.0         88   2395         18.0   82      1
## 371 31.0         4        112.0         85   2575         16.2   82      1
## 372 29.0         4        135.0         84   2525         16.0   82      1
## 373 27.0         4        151.0         90   2735         18.0   82      1
## 374 24.0         4        140.0         92   2865         16.4   82      1
## 375 36.0         4        105.0         74   1980         15.3   82      2
## 376 37.0         4         91.0         68   2025         18.2   82      3
## 377 31.0         4         91.0         68   1970         17.6   82      3
## 378 38.0         4        105.0         63   2125         14.7   82      1
## 379 36.0         4         98.0         70   2125         17.3   82      1
## 380 36.0         4        120.0         88   2160         14.5   82      3
## 381 36.0         4        107.0         75   2205         14.5   82      3
## 382 34.0         4        108.0         70   2245         16.9   82      3
## 383 38.0         4         91.0         67   1965         15.0   82      3
## 384 32.0         4         91.0         67   1965         15.7   82      3
## 385 38.0         4         91.0         67   1995         16.2   82      3
## 386 25.0         6        181.0        110   2945         16.4   82      1
## 387 38.0         6        262.0         85   3015         17.0   82      1
## 388 26.0         4        156.0         92   2585         14.5   82      1
## 389 22.0         6        232.0        112   2835         14.7   82      1
## 390 32.0         4        144.0         96   2665         13.9   82      3
## 391 36.0         4        135.0         84   2370         13.0   82      1
## 392 27.0         4        151.0         90   2950         17.3   82      1
## 393 27.0         4        140.0         86   2790         15.6   82      1
## 394 44.0         4         97.0         52   2130         24.6   82      2
## 395 32.0         4        135.0         84   2295         11.6   82      1
## 396 28.0         4        120.0         79   2625         18.6   82      1
## 397 31.0         4        119.0         82   2720         19.4   82      1

Fit the linear regression model of mpg on all the remaining variables in the dataset.

auto_multi <- lm(mpg ~ ., data = Auto_df)

summary(auto_multi)

## 
## Call:
## lm(formula = mpg ~ ., data = Auto_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

According to the F-statistic, the p-value is well below the threshold of 0.05, which leads us to reject the global null hypothesis; this indicates that at least one of the predictors is significantly related to mpg. Specifically, displacement, weight, year, and origin have small individual p-values, suggesting they are statistically significant contributors to the model. The model explains 82.2% of the variance in mpg ($R^2 = 0.8215$). Conversely, variables such as cylinders, horsepower, and acceleration do not show significant individual contributions when included alongside the other predictors.

Provide an interpretation of the coefficient of the variable origin.

Auto_df$origin

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 2 2 2 2 1 1 1 1 1 3 1 3 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 2 1 3 1 2 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
##  [75] 1 2 2 2 2 1 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 3 1 3 3
## [112] 1 1 2 1 1 2 2 2 2 1 2 3 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 2 2 2 3 3 1 2 2 3
## [149] 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 2 3 1 2 1 2 2 2 2 3 2 2 1 1 2
## [186] 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 2 3 3 1 2 1 2 3 2 1 1 1 1 3 1 2 1 3 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 2 1 3 1 1 1 3 2 3 2 3 2 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 3 3 1 3 1 1 3 2 2 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 2
## [297] 1 2 1 1 1 3 2 1 1 1 1 2 3 1 3 1 1 1 1 2 3 3 3 3 3 1 3 2 2 2 2 3 3 2 3 3 2
## [334] 3 1 1 1 1 1 3 1 3 3 3 3 3 1 1 1 2 3 3 3 3 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 2
## [371] 3 3 1 1 3 3 3 3 3 3 1 1 1 1 3 1 1 1 2 1 1 1

“Since origin is a discrete variable, we cannot interpret it as a numerical variable; therefore, we should convert it into a factor.”

Auto_df$origin <- as.factor(Auto_df$origin)

auto_multi <- lm(mpg ~ ., data = Auto_df)

summary(auto_multi)

## 
## Call:
## lm(formula = mpg ~ ., data = Auto_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders    -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement  2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower   -1.818e-02  1.371e-02  -1.326 0.185488    
## weight       -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration  7.910e-02  9.822e-02   0.805 0.421101    
## year          7.770e-01  5.178e-02  15.005  < 2e-16 ***
## origin2       2.630e+00  5.664e-01   4.643 4.72e-06 ***
## origin3       2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

“With Origin 1 serving as the reference group, vehicles from Origin 2 and Origin 3 demonstrate, on average, a 2.63 mpg and 2.85 mpg higher fuel efficiency, respectively, holding all other variables constant. Both coefficients are statistically significant (as indicated by their small p-values), confirming that these origins contribute significantly to the prediction of mpg compared to the baseline.”

Initially, remove origin from the model, fit the full model, and provide an interpretation of the sign of the coefficients of the variables that are significantly different from zero.

full_model_auto <- lm(mpg~.-origin, data = Auto_df) 
summary(full_model_auto)

## 
## Call:
## lm(formula = mpg ~ . - origin, data = Auto_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6927 -2.3864 -0.0801  2.0291 14.3607 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.454e+01  4.764e+00  -3.051  0.00244 ** 
## cylinders    -3.299e-01  3.321e-01  -0.993  0.32122    
## displacement  7.678e-03  7.358e-03   1.044  0.29733    
## horsepower   -3.914e-04  1.384e-02  -0.028  0.97745    
## weight       -6.795e-03  6.700e-04 -10.141  < 2e-16 ***
## acceleration  8.527e-02  1.020e-01   0.836  0.40383    
## year          7.534e-01  5.262e-02  14.318  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 385 degrees of freedom
## Multiple R-squared:  0.8093, Adjusted R-squared:  0.8063 
## F-statistic: 272.2 on 6 and 385 DF,  p-value: < 2.2e-16

“In this model, weight and year are statistically significant predictors of mpg (p < 0.05). Conversely, predictors such as cylinders, displacement, and horsepower exhibit high p-values, suggesting that they do not provide statistically significant unique information in the presence of the other variables. This is likely due to multicollinearity, where these features share overlapping information regarding vehicle size and engine performance.”

Use the function update() to fit the reduced model containing only the variables that are significantly different from zero in the full model. Compare both the Multiple R2 and the Adjusted R2 between the full and the reduced model.

auto_reduce <- update(full_model_auto,~.-cylinders-displacement-horsepower-acceleration )
summary(auto_reduce)

## 
## Call:
## lm(formula = mpg ~ weight + year, data = Auto_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8505 -2.3014 -0.1167  2.0367 14.3555 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.435e+01  4.007e+00  -3.581 0.000386 ***
## weight      -6.632e-03  2.146e-04 -30.911  < 2e-16 ***
## year         7.573e-01  4.947e-02  15.308  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.427 on 389 degrees of freedom
## Multiple R-squared:  0.8082, Adjusted R-squared:  0.8072 
## F-statistic: 819.5 on 2 and 389 DF,  p-value: < 2.2e-16

cat( " r square auto (full):" , summary(full_model_auto)$r.squared)

##  r square auto (full): 0.8092553

cat( " adj r square auto (full):" , summary(full_model_auto)$adj.r.squared)

##  adj r square auto (full): 0.8062826

cat( " r square auto (reduce):" , summary(auto_reduce)$r.squared)

##  r square auto (reduce): 0.8081803

cat( " adj r square auto (reduce):" , summary(auto_reduce)$adj.r.squared)

##  adj r square auto (reduce): 0.8071941

Compute, explicitly, the F-test for the comparison of the two models and compare your results with those you can get by applying the anova() function.

ner <- nrow(Auto_df)
ner

## [1] 392

er <- residuals(full_model_auto)
RSSer <- sum(er^2)
TSSer <- var(Auto_df$mpg) * ((ner)-1)



R_squareAuto <- 1 - (RSSer/TSSer)
R_squareAuto

## [1] 0.8092553

ero <- residuals(auto_reduce)
RSSero <- sum(ero^2)


R_squareAutore <- 1 - (RSSero/TSSer)
R_squareAutore

## [1] 0.8081803

F_test_auto <- ((RSSero-RSSer)/4)/(RSSer/(ner-7))
F_test_auto

## [1] 0.5424377

pvalaf <- 1-pf(F_test_auto, df1 = 4, df2 = ner - 7)
pvalaf

## [1] 0.7046591

“The F-test resulted in an F-statistic of 0.542 with a p-value of 0.704. Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. This indicates that the four removed predictors—cylinders, displacement, horsepower, and acceleration—do not contribute significantly to the explanation of mpg once weight and year are included in the model. Therefore, the reduced model is preferred for its simplicity and equivalent explanatory power.”

anova_auto <- anova(full_model_auto,auto_reduce)
anova_auto

## Analysis of Variance Table
## 
## Model 1: mpg ~ (cylinders + displacement + horsepower + weight + acceleration + 
##     year + origin) - origin
## Model 2: mpg ~ weight + year
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    385 4543.3                           
## 2    389 4569.0 -4   -25.605 0.5424 0.7047

“To determine the most parsimonious model, we compared the full model (containing six predictors) against a reduced model (containing only weight and year). The full model yielded an $R^2$ of 0.809, while the reduced model yielded an $R^2$ of 0.808, indicating that the four removed predictors—cylinders, displacement, horsepower, and acceleration—contributed negligible unique information. This was statistically confirmed by an F-test, which resulted in an F-statistic of 0.542 and a p-value of 0.705. Since the p-value significantly exceeds the 0.05 threshold, we fail to reject the null hypothesis, concluding that the reduced model provides a fit equivalent to the full model. Consequently, the reduced model is preferred for its simplicity and equivalent predictive performance.”

Apply a one-way Analysis of Variance (ANOVA) to compare the mean efficiency of cars across the three origins (ignoring other variables for this specific test). Include a post-hoc analysis based on the Tukey HSD method. Comment on the results.
Show how the ANOVA test computed with the aov() function can also be computed with the lm() function.

Task 8: ANOVA and Tukey HSD

# Define the object in the same chunk where you use it
anova_origin <- aov(mpg ~ origin, data = Auto_df)
summary(anova_origin)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## origin        2   7904    3952    96.6 <2e-16 ***
## Residuals   389  15915      41                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(anova_origin)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mpg ~ origin, data = Auto_df)
## 
## $origin
##          diff       lwr       upr     p adj
## 2-1  7.569472 5.5068042  9.632139 0.0000000
## 3-1 10.417164 8.4701431 12.364184 0.0000000
## 3-2  2.847692 0.3583458  5.337038 0.0202502

“A one-way ANOVA was conducted to compare mpg across the three vehicle origins. The test returned a significant F-statistic ($F(2, 389) = 96.6, p < 0.001$), leading us to reject the null hypothesis of equal means. Subsequent post-hoc analysis using the Tukey HSD method revealed that all three groups differ significantly from one another ($p_{adj} < 0.05$ for all pairs). Specifically, Origin 3 demonstrates the highest average efficiency, followed by Origin 2, with Origin 1 showing the lowest average efficiency.”

anova_origin <- aov(mpg ~ origin, data = Auto_df)
lm_origin <- lm(mpg ~ origin, data = Auto_df)
summary(lm_origin)

## 
## Call:
## lm(formula = mpg ~ origin, data = Auto_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.451  -5.034  -1.034   3.649  18.966 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.0335     0.4086  49.025   <2e-16 ***
## origin2       7.5695     0.8767   8.634   <2e-16 ***
## origin3      10.4172     0.8276  12.588   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.396 on 389 degrees of freedom
## Multiple R-squared:  0.3318, Adjusted R-squared:  0.3284 
## F-statistic:  96.6 on 2 and 389 DF,  p-value: < 2.2e-16

summary(anova_origin)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## origin        2   7904    3952    96.6 <2e-16 ***
## Residuals   389  15915      41                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Include an interaction term between origin and weight. Fit the model and comment on the output.

model_int_auto <- lm(mpg ~ origin*weight, data = Auto_df )
summary(model_int_auto)

## 
## Call:
## lm(formula = mpg ~ origin * weight, data = Auto_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4928  -2.7715  -0.3895   2.2397  15.5163 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.315e+01  1.186e+00  36.378  < 2e-16 ***
## origin2         1.125e+00  2.878e+00   0.391  0.69616    
## origin3         1.111e+01  3.574e+00   3.109  0.00202 ** 
## weight         -6.854e-03  3.423e-04 -20.020  < 2e-16 ***
## origin2:weight  3.575e-06  1.111e-03   0.003  0.99743    
## origin3:weight -3.865e-03  1.541e-03  -2.508  0.01255 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.253 on 386 degrees of freedom
## Multiple R-squared:  0.7068, Adjusted R-squared:  0.703 
## F-statistic: 186.1 on 5 and 386 DF,  p-value: < 2.2e-16

“The model including the interaction term origin * weight reveals a significant interaction effect for Origin 3 (p = 0.0125). This indicates that the relationship between vehicle weight and fuel efficiency is not uniform across all origins; specifically, weight has a stronger negative impact on the mpg of vehicles from Origin 3 compared to the baseline. Conversely, the interaction between Origin 2 and weight was found to be statistically non-significant (p = 0.997), suggesting that the effect of weight on mpg for Origin 2 is consistent with that of Origin 1.”

model_ad <- lm( mpg ~ horsepower , data = Auto_df)  
summary(model_ad)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

plot(Auto_df$horsepower, Auto_df$mpg, pch = 20, lwd = 3, col = "grey")
abline(model_ad, col ="blue")

par(mfrow = c(2,2))
plot(model_ad)

“The diagnostic plots reveal potential issues with the linear model. The Residuals vs. Fitted and Scale-Location plots exhibit a non-linear curvature, indicating a failure to capture the true underlying relationship between variables, likely requiring a transformation or a non-linear term. The Q-Q plot shows deviations at the tails, suggesting that the residuals are not normally distributed. Finally, while there is no extreme influence evidenced by Cook’s distance, the presence of specific labeled points in the Residuals vs. Leverage plot suggests the existence of influential outliers. Together, these diagnostics imply that the current linear model is insufficient, and further data transformation or model refinement is recommended.”

solution of underfitting data transformation

model_ad_poly <- lm(mpg ~ poly(horsepower, degree = 2), data = Auto_df)
summary(model_ad_poly)

## 
## Call:
## lm(formula = mpg ~ poly(horsepower, degree = 2), data = Auto_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, degree = 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, degree = 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

“The quadratic model is statistically significant overall, and both the linear and quadratic terms have p-values well below the 0.05 threshold. This indicates that there is a significant curved relationship between horsepower and the response variable. The model accounts for 68.6% of the variance in the data, and the residuals are well-centered around zero, suggesting the model is a robust fit.”

Does it make sense why we use degree = 2? It’s because the relationship between these two variables wasn’t just a straight line—it had a bend to it, and this model “bent” the regression line to match that curve perfectly.

plot(Auto_df$horsepower,Auto_df$mpg , pch = 20, col = "darkgrey")

xp <- seq(min(Auto_df$horsepower), max(Auto_df$horsepower), lenght = 100)

## Warning: In seq.default(min(Auto_df$horsepower), max(Auto_df$horsepower), 
##     lenght = 100) :
##  extra argument 'lenght' will be disregarded

newdata <- data.frame(horsepower = xp)



lines( xp, predict(model_ad_poly, newdata = newdata), col = "darkblue")

par(mfrow = c(1,2))
plot(model_ad_poly)

Interpretation: The “Residuals vs Fitted” plot is now much flatter, and the U-shape is gone. Adding the polynomial term significantly improved our model’s fit!

However, if we look closely at the scale-location plot (standardized residuals), the assumption of homoscedasticity is not perfectly met. There is a slight funnel shape across the fitted values. Does this mean transforming the predictor wasn’t enough? Is there still unexplained variance that requires us to include even higher polynomial terms?

model_ad_poly_high <- lm(mpg ~ poly(horsepower, degree = 15), data = Auto_df)
summary(model_ad_poly_high)

## 
## Call:
## lm(formula = mpg ~ poly(horsepower, degree = 15), data = Auto_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.659  -2.534  -0.129   2.145  13.873 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       23.4459     0.2156 108.768  < 2e-16 ***
## poly(horsepower, degree = 15)1  -120.1377     4.2679 -28.149  < 2e-16 ***
## poly(horsepower, degree = 15)2    44.0895     4.2679  10.331  < 2e-16 ***
## poly(horsepower, degree = 15)3    -3.9488     4.2679  -0.925  0.35543    
## poly(horsepower, degree = 15)4    -5.1878     4.2679  -1.216  0.22492    
## poly(horsepower, degree = 15)5    13.2722     4.2679   3.110  0.00201 ** 
## poly(horsepower, degree = 15)6    -8.5462     4.2679  -2.002  0.04595 *  
## poly(horsepower, degree = 15)7     7.9806     4.2679   1.870  0.06227 .  
## poly(horsepower, degree = 15)8     2.1727     4.2679   0.509  0.61099    
## poly(horsepower, degree = 15)9    -3.9182     4.2679  -0.918  0.35917    
## poly(horsepower, degree = 15)10   -2.6146     4.2679  -0.613  0.54050    
## poly(horsepower, degree = 15)11    3.5636     4.2679   0.835  0.40426    
## poly(horsepower, degree = 15)12    1.1451     4.2679   0.268  0.78861    
## poly(horsepower, degree = 15)13    0.6041     4.2679   0.142  0.88752    
## poly(horsepower, degree = 15)14   -3.8267     4.2679  -0.897  0.37049    
## poly(horsepower, degree = 15)15   13.4922     4.2679   3.161  0.00170 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.268 on 376 degrees of freedom
## Multiple R-squared:  0.7125, Adjusted R-squared:  0.701 
## F-statistic: 62.11 on 15 and 376 DF,  p-value: < 2.2e-16

“This model suffers from overfitting. By using a 15th-degree polynomial, the model has become overly complex and is capturing random noise rather than the underlying trend. The high number of insignificant p-values and the drop in Adjusted $R^2$ confirm that a simpler model—likely a degree-2 polynomial—would be much more reliable and better at predicting future observations.”

plot(Auto_df$horsepower,Auto_df$mpg , pch = 20, col = "darkgrey")

xp0 <- seq(min(Auto_df$horsepower), max(Auto_df$horsepower), lenght = 100)

## Warning: In seq.default(min(Auto_df$horsepower), max(Auto_df$horsepower), 
##     lenght = 100) :
##  extra argument 'lenght' will be disregarded

newdata <- data.frame(horsepower = xp0)



lines( xp0, predict(model_ad_poly_high, newdata = newdata), col = "darkblue")

plot(model_ad_poly_high)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Interpretation: Notice the wildly fluctuating red curve—it bends unnaturally just to chase individual data points, particularly at the boundaries. Can you imagine doing extrapolation with this model?

In the diagnostic plots, we see strange vertical striping patterns in the residuals and an extreme outlier/leverage point pulling heavily on the model. Clearly, arbitrarily increasing the polynomial degree is not the right way to fix our variance issues!

model_y <- lm(log(mpg) ~ poly(horsepower , degree = 2), data = Auto_df)
summary(model_y)

## 
## Call:
## lm(formula = log(mpg) ~ poly(horsepower, degree = 2), data = Auto_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66460 -0.12041  0.00316  0.11349  0.66376 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    3.098313   0.008908 347.820  < 2e-16 ***
## poly(horsepower, degree = 2)1 -5.581826   0.176366 -31.649  < 2e-16 ***
## poly(horsepower, degree = 2)2  1.397641   0.176366   7.925 2.44e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1764 on 389 degrees of freedom
## Multiple R-squared:  0.7324, Adjusted R-squared:  0.731 
## F-statistic: 532.2 on 2 and 389 DF,  p-value: < 2.2e-16

This model represents an optimal balance between accuracy and simplicity. By applying a log-transformation to the response variable and using a quadratic polynomial for the predictor, the model achieves a high Adjusted $R^2$ of 0.731. The statistical significance of the quadratic term confirms that the relationship is non-linear, and the well-centered residuals indicate that the model is robust and reliable.

par(mfrow = c(2, 2))
plot(model_y)

The variance is now much better behaved, and the points on the normal Q-Q plot follow the theoretical line more closely.

Important: When you predict using a log-transformed model, the result (as well as the confidence intervals) will be in the log scale. You must use the exp() function to back-transform the predictions to the original scale (MPG).

Also note that even though our model includes a degree 2 polynomial for horsepower, we only need to provide the raw horsepower values to our newdata object. R automatically applies the correct polynomial transformation behind the scenes!

new.n <- data.frame(horsepower = 220)
prediction_level <- predict(model_y, newdata = new.n , interval = "confidence")
prediction_level

##        fit      lwr      upr
## 1 2.526395 2.438396 2.614395

exp(prediction_level)

##        fit      lwr      upr
## 1 12.50833 11.45465 13.65895

“I used the confidence interval because my primary goal was to estimate the expected average mpg for cars with 220 horsepower. This provides a precise estimate of the model’s average trend at that point. If I were predicting the efficiency of a single, specific vehicle, I would use a prediction interval to account for the additional variability in individual observations.”

Think of it like this:

Confidence Interval: “What is the average MPG for this group?” (Good for proving the model works).

Prediction Interval: “What is the MPG for this specific car?” (Good for real-world car buyers).

plot(Auto_df$horsepower, log(Auto_df$mpg), pch = 20 , col = "grey")
xp1 <- seq(min(Auto_df$horsepower), max(Auto_df$horsepower), length = 100)

# Instead of just the raw horsepower, include the poly syntax
lines(xp1, predict(model_y, newdata = data.frame(horsepower = xp1)), col = "darkred", lwd = 3)

“This scatter plot illustrates a negative, quadratic relationship between vehicle horsepower and the log of miles per gallon. The use of a log-transformation on the response variable helps to capture the non-linear trend, and the red regression curve demonstrates that the quadratic model provides a strong, well-centered fit to the observed data, effectively capturing the diminishing effect that horsepower has on fuel efficiency at higher levels.”

GLM

glm_auto <- glm(mpg ~ poly(horsepower, degree = 2), family = Gamma(link = "log"), data = Auto_df )
summary(glm_auto)

## 
## Call:
## glm(formula = mpg ~ poly(horsepower, degree = 2), family = Gamma(link = "log"), 
##     data = Auto_df)
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    3.113649   0.008929 348.696  < 2e-16 ***
## poly(horsepower, degree = 2)1 -5.631775   0.176793 -31.855  < 2e-16 ***
## poly(horsepower, degree = 2)2  1.405670   0.176793   7.951 2.03e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.03125591)
## 
##     Null deviance: 44.205  on 391  degrees of freedom
## Residual deviance: 12.023  on 389  degrees of freedom
## AIC: 2185.6
## 
## Number of Fisher Scoring iterations: 4

The GLM demonstrates a significant improvement over the null model, as evidenced by the substantial reduction in deviance. All significant predictors (p < 0.05) exhibit a directional effect consistent with the theoretical relationship. The deviance residuals are symmetric and centered, and the AIC provides a benchmark for further model selection, confirming that this model is both statistically robust and an efficient fit for the data.This Gamma GLM is highly significant, as evidenced by the extreme reduction in deviance from the null model and the very low p-values for both the linear and quadratic horsepower terms. The significance of the quadratic term confirms a non-linear relationship. With an AIC of 2185.6 and a low dispersion parameter, the model provides a robust and well-fitting statistical representation of the relationship between horsepower and the response variable.

Predictions in a GLM require specifying the type argument. This is where GLMs show a major practical advantage over manually transforming Y

type = “link” returns the linear predictor (in this case, the predicted value on the log scale). type = “response” automatically applies the inverse-link function (the exponential) and returns the predicted expected value in the original scale (MPG). You do not need to manually compute exp()! # Create a sequence of horsepower values for a smooth curve

plot(Auto_df$horsepower, Auto_df$mpg, pch = 20, col = "lightblue" )

ne <- seq(min(Auto_df$horsepower), max(Auto_df$horsepower), length = 100 )
newD <- data.frame(horsepower = ne)
lines(ne, predict(glm_auto, newdata = newD, type = "response"), col = "purple", lwd = 3)

“This Gamma GLM is statistically robust, evidenced by the extreme reduction in deviance compared to the null model. Both the linear and quadratic terms for horsepower are highly significant, confirming a strong non-linear relationship between the predictor and the response. The low dispersion parameter and the resulting AIC indicate a well-fitting model that appropriately handles the positive, continuous nature of the data using a log link function.”

par(mfrow = c(2,2))
plot(glm_auto)

par(mfrow = c(1,1))

“While the inverse link is the canonical link for the Gamma distribution, I chose the log link because it provides more intuitive, interpretable results by modeling multiplicative effects rather than reciprocal ones. Furthermore, the log link offers superior numerical stability, particularly when predicting across a wider range of values, and prevents the risk of non-convergence associated with the inverse link.” “The diagnostic plots collectively indicate a well-fitting Gamma GLM. The Residuals vs. Fitted plot confirms the model structure is correctly specified, while the Normal Q-Q plot validates the choice of the Gamma distribution. The Scale-Location plot shows no evidence of non-constant variance, and the Residuals vs. Leverage plot confirms that the results are not unduly biased by influential outliers. Therefore, the model provides a robust and reliable statistical explanation of the data.”

“The diagnostic plots collectively indicate a well-fitting Gamma GLM. The Residuals vs. Fitted plot confirms the model structure is correctly specified, while the Normal Q-Q plot validates the choice of the Gamma distribution. The Scale-Location plot shows no evidence of non-constant variance, and the Residuals vs. Leverage plot confirms that the results are not unduly biased by influential outliers. Therefore, the model provides a robust and reliable statistical explanation of the data.”

Possion distribution

library("faraway")

data(gala)
gala

##              Species Endemics    Area Elevation Nearest Scruz Adjacent
## Baltra            58       23   25.09       346     0.6   0.6     1.84
## Bartolome         31       21    1.24       109     0.6  26.3   572.33
## Caldwell           3        3    0.21       114     2.8  58.7     0.78
## Champion          25        9    0.10        46     1.9  47.4     0.18
## Coamano            2        1    0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11    0.34       119     8.0   8.0     1.84
## Daphne.Minor      24        0    0.08        93     6.0  12.0     0.34
## Darwin            10        7    2.33       168    34.1 290.2     2.85
## Eden               8        4    0.03        71     0.4   0.4    17.95
## Enderby            2        2    0.18       112     2.6  50.2     0.10
## Espanola          97       26   58.27       198     1.1  88.3     0.57
## Fernandina        93       35  634.49      1494     4.3  95.3  4669.32
## Gardner1          58       17    0.57        49     1.1  93.1    58.27
## Gardner2           5        4    0.78       227     4.6  62.2     0.21
## Genovesa          40       19   17.35        76    47.4  92.2   129.49
## Isabela          347       89 4669.32      1707     0.7  28.1   634.49
## Marchena          51       23  129.49       343    29.1  85.9    59.56
## Onslow             2        2    0.01        25     3.3  45.9     0.10
## Pinta            104       37   59.56       777    29.1 119.6   129.49
## Pinzon           108       33   17.95       458    10.7  10.7     0.03
## Las.Plazas        12        9    0.23        94     0.5   0.6    25.09
## Rabida            70       30    4.89       367     4.4  24.4   572.33
## SanCristobal     280       65  551.62       716    45.2  66.6     0.57
## SanSalvador      237       81  572.33       906     0.2  19.8     4.89
## SantaCruz        444       95  903.82       864     0.6   0.0     0.52
## SantaFe           62       28   24.08       259    16.5  16.5     0.52
## SantaMaria       285       73  170.92       640     2.6  49.2     0.10
## Seymour           44       16    1.84       147     0.6   9.6    25.09
## Tortuga           16        8    1.24       186     6.8  50.9    17.95
## Wolf              21       12    2.85       253    34.1 254.7     2.33

mod_p <- glm(Species ~  Area + Elevation + Nearest + Scruz + Adjacent,family = poisson(link = "log") , data = gala)
summary(mod_p)

## 
## Call:
## glm(formula = Species ~ Area + Elevation + Nearest + Scruz + 
##     Adjacent, family = poisson(link = "log"), data = gala)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***
## Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***
## Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***
## Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***
## Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***
## Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  716.85  on 24  degrees of freedom
## AIC: 889.68
## 
## Number of Fisher Scoring iterations: 5

A Note on Coefficient Significance in GLMs: Notice the z

-values and associated p-values in the summary() output above. These are based on Wald tests. In Generalized Linear Models, Wald tests can sometimes be mathematically unreliable (for instance, when coefficients are large, the standard error can become artificially inflated, leading to false negatives). Therefore, to robustly test whether a parameter (or a set of parameters) is truly significant, it is much better to perform a model comparison test, rather than just looking at the summary table.

A Note on Coefficient Significance in GLMs: Notice the z

model_p_reduce <- glm(Species ~  Area + Elevation + Adjacent,family = poisson(link = "log") , data = gala)
summary(model_p_reduce)

## 
## Call:
## glm(formula = Species ~ Area + Elevation + Adjacent, family = poisson(link = "log"), 
##     data = gala)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.961e+00  4.773e-02   62.05   <2e-16 ***
## Area        -5.703e-04  2.518e-05  -22.65   <2e-16 ***
## Elevation    3.589e-03  8.608e-05   41.70   <2e-16 ***
## Adjacent    -7.508e-04  2.778e-05  -27.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  818.74  on 26  degrees of freedom
## AIC: 987.57
## 
## Number of Fisher Scoring iterations: 5

Anova using chisq

anova(model_p_reduce, mod_p, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Species ~ Area + Elevation + Adjacent
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        26     818.74                          
## 2        24     716.85  2   101.89 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

“Upon comparing the Full and Reduced models, the Full Model yielded both a lower Residual Deviance and a lower AIC. Given that the Full Model also demonstrated statistical significance in the Likelihood Ratio Test, we conclude that it provides a superior fit to the data. While the Reduced Model is more parsimonious, the Full Model’s improved accuracy and predictive efficiency make it the preferred model for this analysis.”

lamdahat <- fitted(mod_p)
peoson_chi <- residuals(mod_p, type = "pearson")

phi_hat <- sum((peoson_chi)^2)/mod_p$df.residual

cat("Estimated dispersion parameter (phi):", phi_hat, "\n")

## Estimated dispersion parameter (phi): 31.74914

error <- (gala$Species - lamdahat)^2
plot(lamdahat, error, col = "black", main = "Mean vs variance" )
abline(0,1, lty = 1, col ="red")

plot(lamdahat, peoson_chi, main =  "mean vs pearson_residual" )
abline(h = 0, lty = 2 )

Assignment n.3 GLM: Logistic Regression 2026-04-01 The orings dataset The faraway library contains the orings dataset, which records the number of damaged O-rings and the launch temperature for 23 space shuttle flights prior to the Challenger disaster.

Load the data and prepare the response variable. Based on the variable damage, create a new binary variable named damage_bin indicating if at least one failure occurred (1 = at least one damage, 0 = no damage).

library(faraway)
data("orings")

head(orings)

##   temp damage
## 1   53      5
## 2   57      1
## 3   58      1
## 4   63      1
## 5   66      0
## 6   67      0

orings$damage_bin <-  ifelse(orings$damage > 0, 1,0)
orings

##    temp damage damage_bin
## 1    53      5          1
## 2    57      1          1
## 3    58      1          1
## 4    63      1          1
## 5    66      0          0
## 6    67      0          0
## 7    67      0          0
## 8    67      0          0
## 9    68      0          0
## 10   69      0          0
## 11   70      1          1
## 12   70      0          0
## 13   70      1          1
## 14   70      0          0
## 15   72      0          0
## 16   73      0          0
## 17   75      0          0
## 18   75      1          1
## 19   76      0          0
## 20   76      0          0
## 21   78      0          0
## 22   79      0          0
## 23   81      0          0

Use a boxplot to display the relationship between temperature (temp) and the newly created binary response.

# Plot temperature (y) against the binary damage variable (x)
# We convert damage_bin to a factor so R treats it as distinct categories
boxplot(temp ~ as.factor(damage_bin), data = orings,
        main = "Temperature vs. Failure Occurrence",
        xlab = "Failure (0=No, 1=Yes)",
        ylab = "Temperature")

“Exploratory analysis via boxplot suggests a negative association between temperature and failure occurrence. Specifically, the distribution of temperatures associated with failure ($1$) is shifted toward lower values. This implies that the risk of failure increases as temperature decreases, a relationship I will quantify formally by fitting a logistic regression model.”

Demonstrate why a standard linear regression is inappropriate here by fitting a linear model and predicting the outcome for temperatures of 31°F and 82°F.

lm_orings <- lm(damage_bin ~ temp, data = orings)
summary(lm_orings)

## 
## Call:
## lm(formula = damage_bin ~ temp, data = orings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43762 -0.30679 -0.06381  0.17452  0.89881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.90476    0.84208   3.450  0.00240 **
## temp        -0.03738    0.01205  -3.103  0.00538 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3987 on 21 degrees of freedom
## Multiple R-squared:  0.3144, Adjusted R-squared:  0.2818 
## F-statistic:  9.63 on 1 and 21 DF,  p-value: 0.005383

temp_p <- data.frame(temp = c(31,82))
prediction <- predict(lm_orings, newdata = temp_p)
prediction

##          1          2 
##  1.7459524 -0.1604762

Standard linear regression is inappropriate here because it does not constrain predictions to the $[0, 1]$ interval, which is required for modeling the probability of an event. By using logistic regression with the logit link function, I can model the log-odds of failure as a linear function of temperature, effectively estimating the probability of damage occurring at any given temperature level.”

“The linear model is inappropriate because it produces predictions outside the valid range of probability ($[0, 1]$). My predictions of 1.75 for 31°F and -0.16 for 82°F are mathematically impossible outcomes for a binary response variable. This clearly demonstrates that the linear regression assumption—that the response is continuous and unbounded—is violated by my binary damage_bin data. This failure necessitates the use of a logistic regression model, which ensures that all predictions remain strictly 0 and 1.”

Fit the correct logistic regression model using the glm() function, utilizing the help(glm) documentation to properly set the family argument, and view the summary.

?glm
md_temp <- glm(damage_bin ~ temp, data = orings, family = binomial(link = "logit")  )

summary(md_temp)

## 
## Call:
## glm(formula = damage_bin ~ temp, family = binomial(link = "logit"), 
##     data = orings)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  15.0429     7.3786   2.039   0.0415 *
## temp         -0.2322     0.1082  -2.145   0.0320 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.315  on 21  degrees of freedom
## AIC: 24.315
## 
## Number of Fisher Scoring iterations: 5

“The logistic regression results confirm a statistically significant negative relationship between temperature and the occurrence of damage ($p = 0.0320$). The negative coefficient of -0.23 indicates that an increase in temperature is associated with a decrease in the log-odds of failure. Furthermore, the reduction in deviance from the null (28.27) to the residual model (20.32) indicates that temperature provides a statistically significant improvement in predicting damage occurrence compared to the null model.”

Use the anova() function to carry out a deviance difference test for the hypothesis that the coefficient of temp is equal to zero. Compare your full model to a null (intercept-only) model.

md_re <-  glm(damage_bin ~ 1, data = orings, family = binomial(link = "logit")  )

summary(md_re)

## 
## Call:
## glm(formula = damage_bin ~ 1, family = binomial(link = "logit"), 
##     data = orings)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.8267     0.4532  -1.824   0.0681 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 28.267  on 22  degrees of freedom
## AIC: 30.267
## 
## Number of Fisher Scoring iterations: 4

help("anova")


anova(md_re,md_temp,test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: damage_bin ~ 1
## Model 2: damage_bin ~ temp
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1        22     28.267                        
## 2        21     20.315  1    7.952 0.004804 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When your Residual Deviance is smaller than your Degrees of Freedom (Df), it indicates that your model is not overdispersed. In the context of a Binomial GLM, this is actually a sign of a “good” fit, as it suggests the model isn’t struggling with excess unexplained noise. “I performed a likelihood ratio test to compare the null model with the model containing temperature. The addition of the temperature predictor resulted in a statistically significant reduction in residual deviance of 7.95 ($p = 0.0048$). This allows me to reject the null hypothesis that the temperature coefficient is zero and conclude that temperature significantly influences the probability of failure.”

Produce a scatterplot of the binary data and overlay the fitted logistic regression curve.

plot(orings$temp,orings$damage_bin, pch = 20 , col = "lightgreen")

tp <- seq(min(orings$temp), max(orings$temp), length = 100)

lines(tp, predict(md_temp, newdata = data.frame(temp = tp), type = "response"), col = "darkred", lwd = 3)

“This scatterplot displays the binary outcome of failure against temperature. The overlaid curve is the fitted logistic regression model, which demonstrates the inverse relationship between temperature and the probability of damage. The S-shaped curve indicates that lower temperatures are associated with a higher probability of failure, with the model effectively capturing the transition from a high risk of damage at lower temperatures to a low risk at higher temperatures.”

Calculate the predicted probability of damage for a launch temperature of 31°F (the temperature on the day of the Challenger crash) and 82°F. Use the predict() function with type = “response”.

prediction <- predict(md_temp, newdata = data.frame(temp = c(31,82)), type = "response")
prediction

##          1          2 
## 0.99960878 0.01808462

“Using the predict() function with type =”response”, I calculated the probability of damage for specific temperatures. At 31°F, the model predicts a probability of failure near 1.0, highlighting the extreme risk present at the temperature of the Challenger launch. Conversely, at 82°F, the model predicts a near-zero probability of damage. This calculation effectively quantifies the risk, moving beyond a simple visual trend to provide an actionable estimate of the probability of failure under specific environmental conditions.”

Now, calculate the 95% confidence intervals for these predictions. Hint: The predict.glm() function does not support the interval argument. To do this manually:

First, use predict() with type = “link” and se.fit = TRUE to get the fitted values and standard errors on the logit scale. Calculate the lower and upper bounds on this scale (fit ± 1.96 se.fit). Finally, use the plogis() function to back-transform your results into actual probabilities.

pred_link <- predict(md_temp, newdata = data.frame(temp = c(31,82)), type = "link", se.fit = TRUE)
lower_bound <- pred_link$fit - 1.96 * pred_link$se.fit
upper_bound <- pred_link$fit + 1.96 * pred_link$se.fit
lower_confint <- plogis(lower_bound)
upper_confint <- plogis(upper_bound)
lower_confint

##            1            2 
## 0.4815742515 0.0007759897

upper_confint

##         1         2 
## 0.9999999 0.3040064

“The difference in the width of the confidence intervals reflects the distribution of the underlying data. The model is highly certain about the high risk of failure at 31°F because that temperature is closer to the ‘damage’ cluster in the data. Conversely, the interval for 82°F is wider because the model has less data in that region, leading to higher uncertainty regarding the exact probability. This interval demonstrates the value of the model: it doesn’t just provide a point estimate, but also a realistic range that communicates where we are more or less certain.”

Classification

library(ISLR2)
data("Default")
head(Default)

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

# 1. Map labels to meaningful values
# This is much easier to read than "0*15+1"
my_colors <- c("black", "red")
my_shapes <- c(1, 19)

# 2. Use those variables in your plot
plot(Default$balance, Default$income, 
     col = my_colors[as.factor(Default$default)], 
     pch = my_shapes[as.factor(Default$default)])

One interesting consideration is the potential bidirectional relationship between balance and default. While the model treats balance as a predictor of default, it is possible that the default itself contributes to an increasing balance due to accumulated interest and penalties. However, for predictive purposes, balance remains a highly reliable signal of high-risk accounts regardless of the causal direction.The visual analysis shows that the income variable does not provide a clear decision boundary for default status. We see significant overlap of both classes across the entire range of incomes, suggesting that income is not a strong discriminator between defaulters and non-defaulters. Consequently, the model assigns it a low statistical significance, as it offers little predictive power compared to the credit balance

mod_d <- glm(default ~ income + balance , data = Default, family = binomial  )
summary(mod_d)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

“In this model, the estimate represents the change in log-odds, indicating that the probability of default increases as both balance and income increase. Both predictors are statistically significant, as evidenced by their low p-values. However, relying solely on the summary table is insufficient. The standard errors and z-values are derived from the Wald test, which can lead to artificially inflated estimates when coefficients are large, potentially resulting in false negatives. Therefore, we must perform model comparison to rigorously evaluate the model’s efficiency and actual predictive power.”

mod_def <- glm(default ~ 1 , data = Default, family = binomial  )
summary(mod_def)

## 
## Call:
## glm(formula = default ~ 1, family = binomial, data = Default)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.36833    0.05574  -60.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 2920.6  on 9999  degrees of freedom
## AIC: 2922.6
## 
## Number of Fisher Scoring iterations: 6

anova(mod_def,mod_d, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: default ~ 1
## Model 2: default ~ income + balance
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      9999     2920.7                          
## 2      9997     1579.0  2   1341.7 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Decision boundary

#extract coefficients 



Beta_hat <- coefficients(mod_d)

Beta0_hat <- Beta_hat[1]
Beta1_hat <- Beta_hat[2]
Beta2_hat <- Beta_hat[3]

#b0 + b1 * income + b2 * balance = 0

# b1 * income = -(b2*balance) - b0

#income =  -(b2*balance)/b1 - b0 / b1


# slope = -b2/b1
#intercept = -b0/b1

slope = -Beta2_hat/Beta1_hat

intercept = -Beta0_hat/Beta1_hat
plot(Default$balance, Default$income, 
     col = my_colors[as.factor(Default$default)], 
     pch = my_shapes[as.factor(Default$default)])
abline(intercept,slope,lwd = 3,col = "darkblue")

confusion matrix

conf.matrix <- predict(mod_d, type = "response")
confit.class <- rep("NO", 10000)
confit.class[conf.matrix > 0.5 ] <- "YES"

#class[condition] <- value
conf_matrix <- table(predicted = confit.class, True = Default$default)
conf_matrix

##          True
## predicted   No  Yes
##       NO  9629  225
##       YES   38  108

overall_error <-(conf_matrix[1,2] + conf_matrix[2,1])/ sum(conf_matrix)
overall_error

## [1] 0.0263

false_negative_error <- conf_matrix[1,2]/ sum(conf_matrix[, 2])
false_negative_error

## [1] 0.6756757

“While the overall error rate of 2.63% indicates high general accuracy, it is deceptive due to the class imbalance in the dataset. A more critical metric is the False Negative Rate, which is approximately 67.6%. This reveals that the model fails to identify the majority of actual defaulters when using a 0.5 probability threshold. Consequently, while the model is highly accurate at identifying safe accounts, it is currently ineffective at fulfilling the primary business objective of catching high-risk defaults.”

conf.matrix <- predict(mod_d, type = "response")
trivial.class <- rep("NO", 10000)
trivial_matrix <- table(predicted = trivial.class, True = Default$default)
trivial_matrix

##          True
## predicted   No  Yes
##        NO 9667  333

overall_error_t <- 333/10000
overall_error_t

## [1] 0.0333

“The analysis shows that while the logistic model outperforms the trivial classifier in overall accuracy (2.63% vs 3.33%), the 0.5 probability threshold results in a high False Negative Rate of 67.6%. This indicates that the current model configuration is biased towards precision over recall. To improve risk detection, future iterations should focus on threshold tuning or adjusting for class imbalance to better identify high-risk accounts.”

Threshold 0.2

conf.matrix2 <- predict(mod_d, type = "response")
confit.class2 <- rep("NO", 10000)
confit.class2[conf.matrix2 > 0.2] <- "YES"

#class[condition] <- value
conf_matrix2 <- table(predicted = confit.class2, True = Default$default)
conf_matrix2

##          True
## predicted   No  Yes
##       NO  9396  133
##       YES  271  200

overall_error2 <-(conf_matrix2[1,2] + conf_matrix2[2,1])/ sum(conf_matrix2)
overall_error2

## [1] 0.0404

false_negative_error2 <- conf_matrix2[1,2]/ sum(conf_matrix2[, 2])
false_negative_error2

## [1] 0.3993994

Lowering the classification threshold to 0.2 demonstrates a deliberate shift in the model’s objective. While the overall error rate increased slightly from 2.6% to 4.0%, the False Negative Rate improved drastically from 67.6% to 39.9%. In a business context, this is a highly rational trade-off: it is generally better to incorrectly label a safe customer as ‘risky’ (a False Positive) than to fail to detect a customer who will actually default (a False Negative) and incur a total loss on that account.By lowering the threshold, our overall error rate increased to 4%, but our error rate among actual defaulters dropped massively from 67.5% to about 40%. We increase the sensitivity. We catch more defaults, but at the cost of investigating more false positives.

Default$default <- as.factor(Default$default)

perf.measure <- function(true.values, pred.values, lab.pos = 1){
  
conf.matrix <- table(pred.values,true.values) 
n <- sum(conf.matrix)

print(conf.matrix)
print(rownames(conf.matrix))

lab.pos <- as.character(lab.pos)
lab <- rownames(conf.matrix)
lab.neg <- lab[lab != lab.pos]


TP <- conf.matrix[lab.pos, lab.pos]
TN <- conf.matrix[lab.neg, lab.neg]
FP <- conf.matrix[lab.pos, lab.neg]
FN <- conf.matrix[lab.neg, lab.pos]


P <- TP + FN
N <- TN + FP
P.ast <- TP + FP

OER <- (FP + FN) / n
PPV <- TP/P.ast
TPR <- TP/P
F <- 2 * (PPV * TPR) / (PPV + TPR)
TNR <- TN/N
FPR <- FP/N
FNR <- FN /P


return(list(Overall.ER = OER, PPV =PPV , TPR = TPR, F = F, TNR = TNR, FPR = FPR , FNR = FNR ))
  
}

# Standardize your predictions to match the dataset labels
confit.class2 <- factor(confit.class2, levels = c("NO", "YES"), labels = c("No", "Yes"))

# Now call the function using "Yes"
PM <- perf.measure(Default$default, confit.class2, lab.pos = "Yes")

##            true.values
## pred.values   No  Yes
##         No  9396  133
##         Yes  271  200
## [1] "No"  "Yes"

PM

## $Overall.ER
## [1] 0.0404
## 
## $PPV
## [1] 0.4246285
## 
## $TPR
## [1] 0.6006006
## 
## $F
## [1] 0.4975124
## 
## $TNR
## [1] 0.9719665
## 
## $FPR
## [1] 0.02803352
## 
## $FNR
## [1] 0.3993994

At the 0.2 threshold, the model demonstrates a Sensitivity (TPR) of 60.06%, indicating that it successfully captures the majority of defaults. While this results in a higher False Positive Rate (2.8%) compared to the 0.5 threshold, it significantly reduces the False Negative Rate (Miss Rate) to 39.94%. The F1 Score of 0.4975 reflects a balanced attempt to minimize both missing high-risk accounts and incorrectly flagging safe ones. Given the business context, this 0.2 threshold provides a much more effective risk mitigation strategy than the 0.5 threshold.”The F1 Score of 0.4975 provides a conservative estimate of the model’s predictive power. It indicates that the model’s performance is currently limited by the trade-off between its ability to detect defaults (Sensitivity) and its tendency to flag safe accounts incorrectly (Precision). A score of ~0.5 suggests that while the model has moved beyond random guessing, it requires further tuning—such as adjusting the classification threshold or utilizing more complex feature engineering—to achieve higher reliability.

Precision (Positive Predictive Value - PPV):

Definition: “Of all the times the model shouted ‘YES!’, how often was it actually right?”

General Rule: High precision means your model is “trustworthy

False Positive Rate (FPR):

Definition: “Of all the truly safe cases, how many did the model accidentally label as ‘Yes’?”

Sensitivity (TPR / Recall):

Definition: “Of all the actual ‘Yes’ cases in the world, what percentage did the model successfully catch?”

General Rule: This is your “Net’s efficiency.” If you are catching a disease or fraud, you want this to be very high

False Negative Rate (FNR / Miss Rate):

Definition: “Of all the actual ‘Yes’ cases, how many did the model fail to see?”

General Rule: This is your “blind spot.” In critical scenarios (like medicine), this is the most dangerous metric.

Specificity (TNR):

Definition: “Of all the truly safe/negative cases, what percentage did the model correctly call ‘No’?”

General Rule: This tells you how good the model is at recognizing “normal” behavior.

Overall Error Rate (OER):Definition: “What percentage of all my guesses (both Yes and No) were wrong?”General Rule: Simple, but often misleading if your dataset is “unbalanced” (e.g., if 99% of people are “No,” a model that says “No” to everyone is 99% accurate but useless).

F1 Score:Definition: “The mathematical balance between Precision and Sensitivity.”General Rule: Use this when you need a single number to represent a model that is “good at both” catching events and being accurate when it flags them.

ROC curve

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

conf.matrix3 <- predict(mod_d, type = "response")
table(Default$default)

## 
##   No  Yes 
## 9667  333

roc.out <- roc(Default$default, conf.matrix3, levels = c("No","Yes") )

## Setting direction: controls < cases

plot(roc.out, print.auc = TRUE, legacy.axes = TRUE, 
     xlab = "False Positive Rate", ylab = "True Positive Rate",
     main = "ROC Curve - Logistic Regression")

?coords
coords(roc.out , "best")

##    threshold specificity sensitivity
## 1 0.03312645   0.8644874   0.8978979

coords(roc.out , 0.5)

##   threshold specificity sensitivity
## 1       0.5   0.9960691   0.3243243

AUC <- round(auc(roc.out), 3)
AUC

## [1] 0.949

Assignment n.4 Classification methods 2026-04-15 We first provide a function for the computation of several performance measures.

Function for the computation of performance measures

Arguments:

true.values = binary vector of true values

pred.values = binary vector of predicted values

lab.pos = label of the positive class

perf.measure <- function(true.values, pred.values,  lab.pos = 1){
  
  # compute the confusion matrix and number of units
  conf.matrix <- table(pred.values, true.values)
  n <- sum(conf.matrix)
  
  # force the label of positives to be a character string
  lab.pos <- as.character(lab.pos)
  
  # obtain the label of negatives
  lab <- rownames(conf.matrix)
  lab.neg <- lab[lab != lab.pos]
  
  # extract relevant quantities from the confusion matrix
  TP <- conf.matrix[lab.pos, lab.pos]
  TN <- conf.matrix[lab.neg, lab.neg]
  FP <- conf.matrix[lab.pos, lab.neg]
  FN <- conf.matrix[lab.neg, lab.pos]
  
  P     <- TP + FN
  N     <- FP + TN
  P.ast <- TP + FP
  
  # compute the performance measures
  OER <- (FP+FN)/n
  PPV <- TP/P.ast
  TPR <- TP/P
  F1  <- 2*PPV*TPR/(PPV+TPR)
  TNR <- TN/N
  FPR <- FP/N
  
  return(list(overall.ER = OER, PPV=PPV, TPR=TPR, F1=F1, TNR=TNR, FPR=FPR))
}

The Stock Market Data We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR2 package. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date). Our goal is to predict Direction (a qualitative response) using the other features.

library(ISLR2)
data("Smarket")
names(Smarket)

## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

dim(Smarket)

## [1] 1250    9

summary(Smarket)

##       Year           Lag1                Lag2                Lag3          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
##  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
##       Lag4                Lag5              Volume           Today          
##  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
##  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
##  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
##  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
##  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
##  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
##

pairs(Smarket)

The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the Direction variable is qualitative.

round(cor(Smarket[, -9]), 3)

##         Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today
## Year   1.000  0.030  0.031  0.033  0.036  0.030  0.539  0.030
## Lag1   0.030  1.000 -0.026 -0.011 -0.003 -0.006  0.041 -0.026
## Lag2   0.031 -0.026  1.000 -0.026 -0.011 -0.004 -0.043 -0.010
## Lag3   0.033 -0.011 -0.026  1.000 -0.024 -0.019 -0.042 -0.002
## Lag4   0.036 -0.003 -0.011 -0.024  1.000 -0.027 -0.048 -0.007
## Lag5   0.030 -0.006 -0.004 -0.019 -0.027  1.000 -0.022 -0.035
## Volume 0.539  0.041 -0.043 -0.042 -0.048 -0.022  1.000  0.015
## Today  0.030 -0.026 -0.010 -0.002 -0.007 -0.035  0.015  1.000

As one would expect, the correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume. By plotting the data, which is ordered chronologically, we see that Volume is increasing over time.

plot(Smarket$Volume, main="Trading Volume over Time", ylab="Volume", xlab="Days")

Logistic Regression 1. Fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. 2. Apply the fitted logistic regression model to make a prediction as to whether the market will go up or down on a particular day (with a 0.5 threshold). 3. Obtain the confusion matrix, the overall error rate, and plot the ROC curve. 4. Refit the logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model. How do the results change, and why? 5. Change the classification threshold. Re-classify the predictions of your reduced model using a threshold of 0.52 instead of 0.5. Extract the performance measures and comment on how the True Positive Rate (Sensitivity) and Positive Predictive Value (Precision) have changed. Linear Discriminant Analysis (LDA) 6. Apply linear discriminant analysis using Lag1 and Lag2 as predictors on the full dataset. Quadratic Discriminant Analysis (QDA) 7. Apply quadratic discriminant analysis using Lag1 and Lag2 as predictors. Naive Bayes 8. Apply Naive Bayes analysis using Lag1 and Lag2 as predictors.

Fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume

?Smarket

mod.s <- glm(Direction ~ Lag1 + Lag2 + Lag3   + Lag4 + Lag5 + Volume , data = Smarket, family = binomial )
summary(mod.s)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

“The model fails to demonstrate predictive utility. The high p-values across all predictors suggest that the features (Lag returns and Volume) do not possess a statistically significant relationship with the market direction. The minimal reduction in deviance ($3.6$) compared to the Null Deviance, combined with a high AIC, suggests that the model’s predictive power is effectively negligible and comparable to random chance. These results suggest that the efficient market hypothesis may hold for this dataset, or that these specific variables are insufficient for forecasting. The model shows a high Residual Deviance relative to the degrees of freedom, which indicates a poor goodness-of-fit. Since overdispersion is a property of Poisson models, I interpret this as evidence of model misspecification, likely because the chosen predictors (Lags/Volume) lack sufficient explanatory power to predict the response.

Apply the fitted logistic regression model to make a prediction as to whether the market will go up or down on a particular day (with a 0.5 threshold).

prob.s <- predict(mod.s, type = "response")

glm.class <- rep("Down", 1250)

glm.class[prob.s > 0.5] <- "Up"

Obtain the confusion matrix, the overall error rate, and plot the ROC curve.

# 3. Check that the lengths match before tabling
length(glm.class) # Should be 1250

## [1] 1250

length(Smarket$Direction) # Should be 1250

## [1] 1250

con.mat <- table(Predicted = glm.class, True = Smarket$Direction)
con.mat

##          True
## Predicted Down  Up
##      Down  145 141
##      Up    457 507

overal_error_r <- con.mat[1,2]+con.mat[2,1]/sum(con.mat)
overal_error_r

## [1] 141.3656

library(pROC)
roc.s <- roc(Smarket$Direction, prob.s, levels=c("Down", "Up"))

## Setting direction: controls < cases

plot(roc.s, print.auc = TRUE, legacy.axes = TRUE)

“The ROC curve illustrates the model’s classification performance with an AUC of 0.539. This value indicates that the model has negligible discriminatory power between ‘Up’ and ‘Down’ market days, performing only slightly better than a random baseline. This reinforces the conclusion that the lagged return variables (Lag1 and Lag2) do not possess sufficient signal to reliably predict market direction in this dataset.”

Refit the logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model. How do the results change, and why?

mod.red <- glm(Direction ~ Lag1 + Lag2 , data = Smarket, family = binomial )
summary(mod.red)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Smarket)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.07425    0.05667   1.310    0.190
## Lag1        -0.07151    0.05010  -1.427    0.153
## Lag2        -0.04450    0.05000  -0.890    0.374
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1728.4  on 1247  degrees of freedom
## AIC: 1734.4
## 
## Number of Fisher Scoring iterations: 3

“The model was refitted using only Lag1 and Lag2 to achieve greater parsimony. The reduction in AIC confirms that the reduced model is more efficient than the full model. However, the lack of statistical significance ($p > 0.05$) for the predictors confirms that these variables remain insufficient for forecasting market direction. This refit demonstrates that while model simplification improves efficiency, it cannot overcome the fundamental absence of a predictive relationship between these specific lagged returns and the market’s daily direction.”

Change the classification threshold. Re-classify the predictions of your reduced model using a threshold of 0.52 instead of 0.5. Extract the performance measures and comment on how the True Positive Rate (Sensitivity) and Positive Predictive Value (Precision) have changed.

prob.s2 <- predict(mod.s , type = "response")

class.s2 <- rep("Down", 1250)

class.s2[prob.s2 > 0.52] <- "Up"


con.mat2 <- table(predicted = class.s2, True = Smarket$Direction)

con.mat2

##          True
## predicted Down  Up
##      Down  344 327
##      Up    258 321

PM1 <- perf.measure(Smarket$Direction, class.s2, lab.pos = "Up" )
PM1

## $overall.ER
## [1] 0.468
## 
## $PPV
## [1] 0.5544041
## 
## $TPR
## [1] 0.4953704
## 
## $F1
## [1] 0.5232274
## 
## $TNR
## [1] 0.5714286
## 
## $FPR
## [1] 0.4285714

PM2 <- perf.measure(Smarket$Direction, glm.class, lab.pos = "Up" )
PM2

## $overall.ER
## [1] 0.4784
## 
## $PPV
## [1] 0.5259336
## 
## $TPR
## [1] 0.7824074
## 
## $F1
## [1] 0.6290323
## 
## $TNR
## [1] 0.2408638
## 
## $FPR
## [1] 0.7591362

The performance evaluation of the two models reveals a significant trade-off between detection capability and predictive reliability. Model 2 with threshold 0.52 (PM2) demonstrates a superior ability to identify “Up” market days, achieving a Sensitivity (TPR) of 78.2% compared to Model 1’s (PM1) 49.5%. However, this increased detection comes at a cost; PM2’s Precision (PPV) drops slightly to 52.6%, and its Specificity (TNR) falls sharply to 24.1%, indicating that the model is much more prone to misidentifying “Down” days as “Up” days. Despite these frequent false alarms, PM2 achieves a higher F1 Score of 0.629, suggesting that it provides a more balanced approach to capturing market trends than PM1, which remains more cautious but misses a large portion of potential “Up” market movements.

Linear Discriminant Analysis (LDA) 6. Apply linear discriminant analysis using Lag1 and Lag2 as predictors on the full dataset.

?lda
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Boston

## The following object is masked from 'package:ISLR2':
## 
##     Boston

mod.ld <- lda(Direction ~ Lag1 + Lag2 , data = Smarket)
mod.ld

## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket)
## 
## Prior probabilities of groups:
##   Down     Up 
## 0.4816 0.5184 
## 
## Group means:
##             Lag1        Lag2
## Down  0.05068605  0.03229734
## Up   -0.03969136 -0.02244444
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.7567605
## Lag2 -0.4707872

pred_lda <- predict(mod.ld,Smarket)
pred.class <- pred_lda$class

con.mat.ld <- table(Predicted = pred.class, True = Smarket$Direction)
con.mat.ld

##          True
## Predicted Down  Up
##      Down  114 102
##      Up    488 546

PM_L <- perf.measure(Smarket$Direction, pred.class, lab.pos = "Up")
PM_L

## $overall.ER
## [1] 0.472
## 
## $PPV
## [1] 0.5280464
## 
## $TPR
## [1] 0.8425926
## 
## $F1
## [1] 0.6492271
## 
## $TNR
## [1] 0.1893688
## 
## $FPR
## [1] 0.8106312

library(pROC)
roc_ld <- roc(Smarket$Direction , pred_lda$posterior[,2], levels = c("Down", "Up"))

## Setting direction: controls < cases

plot(roc_ld , print.auc = TRUE, legacy.axes = TRUE)

“The LDA model reveals that the market direction is negatively associated with the magnitude of previous day returns (Lag1 = -0.757; Lag2 = -0.471). The group means indicate that ‘Up’ days are preceded by lower average returns than ‘Down’ days, suggesting that the model is capturing a subtle mean-reverting trend in the Smarket data.”

Quadratic Discriminant Analysis (QDA) 7. Apply quadratic discriminant analysis using Lag1 and Lag2 as predictors.

mod.qd <- qda(Direction ~ Lag1 + Lag2 , data = Smarket)
mod.qd

## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket)
## 
## Prior probabilities of groups:
##   Down     Up 
## 0.4816 0.5184 
## 
## Group means:
##             Lag1        Lag2
## Down  0.05068605  0.03229734
## Up   -0.03969136 -0.02244444

pred_qda <- predict(mod.qd, Smarket)
pred.class.qda <- pred_qda$class

table(Predicted = pred.class.qda, Smarket$Direction)

##          
## Predicted Down  Up
##      Down  109  94
##      Up    493 554

PM_Q <- perf.measure(Smarket$Direction,pred.class.qda, lab.pos = "Up")
PM_Q

## $overall.ER
## [1] 0.4696
## 
## $PPV
## [1] 0.5291309
## 
## $TPR
## [1] 0.8549383
## 
## $F1
## [1] 0.6536873
## 
## $TNR
## [1] 0.1810631
## 
## $FPR
## [1] 0.8189369

roc.qd <- roc(Smarket$Direction, pred_qda$posterior[,2], levels = c("Down", "Up"))

## Setting direction: controls < cases

plot(roc.qd, print.auc = TRUE,legacy.axes = TRUE)

“QDA was employed to relax the assumption of equal variance between the ‘Up’ and ‘Down’ groups. By utilizing a quadratic decision boundary, the model can capture non-linear relationships that LDA, with its rigid linear boundary, might overlook. This is particularly useful if the market exhibits different volatility characteristics during bullish versus bearish regimes.”

Naive Bayes 8. Apply Naive Bayes analysis using Lag1 and Lag2 as predictors.

library(e1071)
mod.NB <- naiveBayes(Direction ~ Lag1 + Lag2 , data = Smarket)
mod.NB

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   Down     Up 
## 0.4816 0.5184 
## 
## Conditional probabilities:
##       Lag1
## Y             [,1]     [,2]
##   Down  0.05068605 1.141070
##   Up   -0.03969136 1.130989
## 
##       Lag2
## Y             [,1]     [,2]
##   Down  0.03229734 1.157174
##   Up   -0.02244444 1.116768

class_NB <- predict(mod.NB, Smarket)

table(Predicted = class_NB, True = Smarket$Direction)

##          True
## Predicted Down  Up
##      Down  106  93
##      Up    496 555

PM_NB <- perf.measure(Smarket$Direction, class_NB, lab.pos = "Up")
PM_NB

## $overall.ER
## [1] 0.4712
## 
## $PPV
## [1] 0.5280685
## 
## $TPR
## [1] 0.8564815
## 
## $F1
## [1] 0.6533255
## 
## $TNR
## [1] 0.1760797
## 
## $FPR
## [1] 0.8239203

N.posterior <- predict(mod.NB, Smarket, type = "raw")
library(pROC)
roc.NB <- roc(Smarket$Direction, N.posterior[, 2], levels = c("Down", "Up"))

## Setting direction: controls < cases

plot(roc.NB, print.auc = TRUE , legacy.axes = TRUE)

“The Naive Bayes model confirms the statistical patterns identified in the LDA and QDA analyses, specifically the mean-reverting relationship where ‘Down’ days follow positive lagged returns. By assuming independence between Lag1 and Lag2, Naive Bayes provides a robust, simplified benchmark for classification. The consistency in the conditional means across all three models (LDA, QDA, and Naive Bayes) reinforces the validity of the identified market pattern, even though the overall predictive power remains constrained by the inherent volatility of the Smarket data.”

“The negative group means for ‘Up’ days indicate that market gains are statistically preceded by negative lagged returns. This suggests the presence of a mean-reversion process where the market corrects itself after downward pressure. The predictive models (LDA, QDA, and Naive Bayes) consistently leverage this ‘bounce-back’ behavior to classify market direction, confirming that past returns are inversely related to subsequent market movements in this dataset.”

MODEL SELECTION

library(ISLR2)
names(Hitters)

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"

data(Hitters)
head(Hitters)

##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson      293   66     1   30  29    14     1    293    66      1
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson       30   29     14      A        E     446      33     20
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
##                   Salary NewLeague
## -Andy Allanson        NA         A
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A

sum(is.na(Hitters))

## [1] 59

Hitters <- na.omit(Hitters)
dim(Hitters)

## [1] 263  20

sum(is.na(Hitters))

## [1] 0

WThe regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

library(leaps)

## Warning: package 'leaps' was built under R version 4.6.1

hit.reg <- regsubsets(Salary ~ . , Hitters)
summary(hit.reg)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "

An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI. By default, regsubsets() only reports results up to the best eight-variable model. But the nvmax option can be used in order to return as many variables as are desired. Here we fit up to a 19-variable model.

library(leaps)
hit.reg <- regsubsets(Salary ~ . , Hitters, nvmax = 19)
summary(hit.reg)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters, nvmax = 19)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 )  " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
##           CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

reg_summary <- summary(hit.reg)

reg_summary$cp

##  [1] 104.281319  50.723090  38.693127  27.856220  21.613011  14.023870
##  [7]  13.128474   7.400719   6.158685   5.009317   5.874113   7.330766
## [13]   8.888112  10.481576  12.346193  14.187546  16.087831  18.011425
## [19]  20.000000

reg_summary$bic

##  [1]  -90.84637 -128.92622 -135.62693 -141.80892 -144.07143 -147.91690
##  [7] -145.25594 -147.61525 -145.44316 -143.21651 -138.86077 -133.87283
## [13] -128.77759 -123.64420 -118.21832 -112.81768 -107.35339 -101.86391
## [19]  -96.30412

reg_summary$adjr2

##  [1] 0.3188503 0.4208024 0.4450753 0.4672734 0.4808971 0.4972001 0.5007849
##  [8] 0.5137083 0.5180572 0.5222606 0.5225706 0.5217245 0.5206736 0.5195431
## [15] 0.5178661 0.5162219 0.5144464 0.5126097 0.5106270

reg_summary$rss

##  [1] 36179679 30646560 29249297 27970852 27149899 26194904 25906548 25136930
##  [9] 24814051 24500402 24387345 24333232 24289148 24248660 24235177 24219377
## [17] 24209447 24201837 24200700

plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
adj_max <- which.max(reg_summary$adjr2)
points(adj_max, reg_summary$adjr2[adj_max], col = "red", cex = 2, pch = 20)

plot(hit.reg, scale = "bic")

coef(hit.reg,6)

##  (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
##      PutOuts 
##    0.2643076

predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

set.seed(1)
n <- nrow(Hitters)
train <- sample(c(TRUE,FALSE), n , replace = TRUE)
test <- (!train)

reg.forward <- regsubsets(Salary ~ . , data = Hitters[train,], method = "forward", nvmax = 19)

val.errors <- rep(NA, 19)




for ( i in 1:19) {
  pred <- predict(reg.forward , Hitters[test, ] , id = i)
  val.errors[i] <- mean((Hitters$Salary[test] - pred)^2)
}



val.opt.size <- which.min(val.errors)
plot(val.errors, type = "b", pch = 16)
points(val.opt.size, val.errors[val.opt.size], pch = 20, col = "red")

k = 10
folds <- sample(rep((1:k), length = n))
cv.errors <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for (j in 1:k) {
  # Train forward selection on all folds except j
  best.fit <- regsubsets(Salary ~ ., data = Hitters[folds != j, ], nvmax = 19, method = "forward")
  
  # Predict on fold j and calculate MSE
  for (i in 1:19) {
    pred <- predict(best.fit, Hitters[folds == j, ], id = i)
    cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred)^2)
  }
}


mean.cv.errors <- apply(cv.errors, 2, mean)
opt.cv.size <- which.min(mean.cv.errors)

plot(1:19, mean.cv.errors, type = "b", pch = 16, xlab = "Number of Variables", ylab = "10-Fold CV MSE")
points(opt.cv.size, mean.cv.errors[opt.cv.size], col = "red", cex = 2, pch = 20)

reg.fwd.final <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19,
                            method = "forward")
coef(reg.fwd.final, 10)

##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
##         CRBI       CWalks    DivisionW      PutOuts      Assists 
##    0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

Assignment n.5 Model selection 2026-04-22 We first provide our custom function for the computation of classification performance measures.

perf.measure <- function(true.values, pred.values,  lab.pos = 1){
  conf.matrix <- table(pred.values, true.values)
  n <- sum(conf.matrix)
  lab.pos <- as.character(lab.pos)
  lab <- rownames(conf.matrix)
  lab.neg <- lab[lab != lab.pos]
  
  TP <- conf.matrix[lab.pos, lab.pos]
  TN <- conf.matrix[lab.neg, lab.neg]
  FP <- conf.matrix[lab.pos, lab.neg]
  FN <- conf.matrix[lab.neg, lab.pos]
  
  P     <- TP + FN
  N     <- FP + TN
  P.ast <- TP + FP
  
  OER <- (FP+FN)/n
  PPV <- TP/P.ast
  TPR <- TP/P
  F1  <- 2*PPV*TPR/(PPV+TPR)
  TNR <- TN/N
  FPR <- FP/N
  
  return(list(overall.ER = OER, PPV=PPV, TPR=TPR, F1=F1, TNR=TNR, FPR=FPR))
}

The Default Dataset For this assignment, we will use the Default dataset from the ISLR2 package. The goal is to predict which customers will default on their credit card debt based on their student status, credit card balance, and income.

library(ISLR2)
data("Default")
summary(Default)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

Fit a full Logistic Regression model. Then apply Backward Stepwise Selection based on the BIC to identify the best subset of predictors. Use the summary() function to summarize the properties of the selected model. Hint: Since the leaps package (regsubsets) only supports linear regression, use the step() function with direction = “backward” and explore the help function and the argument k to apply the BIC penalty instead of AIC.

def_glm <- glm(default ~ . , data = Default, family = binomial)
summary(def_glm)

## 
## Call:
## glm(formula = default ~ ., family = binomial, data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

n <- nrow(Default)
help("step")
glm_full_def <- step(def_glm, direction = "backward",  k = log(n))

## Start:  AIC=1608.39
## default ~ student + balance + income
## 
##           Df Deviance    AIC
## - income   1   1571.7 1599.3
## - student  1   1579.0 1606.6
## <none>         1571.5 1608.4
## - balance  1   2907.5 2935.1
## 
## Step:  AIC=1599.31
## default ~ student + balance
## 
##           Df Deviance    AIC
## <none>         1571.7 1599.3
## - student  1   1596.5 1614.9
## - balance  1   2908.7 2927.1

summary(glm_full_def)

## 
## Call:
## glm(formula = default ~ student + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.075e+01  3.692e-01 -29.116  < 2e-16 ***
## studentYes  -7.149e-01  1.475e-01  -4.846 1.26e-06 ***
## balance      5.738e-03  2.318e-04  24.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8

Carry out a diagnostic check of the selected model by computing its performance measures using a standard 0.5 probability threshold. Considering the problem setting (predicting credit defaults), which is the best error measure to use in this case?
Try to improve the model by including a quadratic component for the balance predictor. Fit this model, calculate its BIC, and compare it to the model selected in Question 1.
Try to improve the model by including all two-way interaction effects between all original predictors. Use Backward Stepwise Selection (via BIC) again to see if any interaction terms should be retained.
Apply 10-fold Cross-Validation to estimate the test performance of our chosen model (student + balance). Hint: Use the cv.glm() function from the boot package. Because cv.glm() expects a numeric response, first create a 0/1 variable. Furthermore, since we established that the Overall Error Rate is misleading for this data, supply to the argument cost a custom cost function that uses your perf.measure() function to calculate the F1 Error (defined as 1 - F1).
Now, consider a subsample of 1000 customers. set.seed(123) sub_idx <- sample(1:nrow(Default), 1000) Default_sub <- Default[sub_idx, ] Based on this subsample run the following analysis. Suppose the bank determines that issuing a credit card to someone who defaults (a False Negative) is much more expensive (costs 20k on average) than denying a card to someone who would be safe (a False Positive, costing 1k on average). How should this affect your classification threshold? Construct your own customized cost function and perform leave-one-out cross-validation to adjust the threshold among a set of 5 pre-defined values. Hint: The cost function should be defined in the form cost.financial <- function(r, pi=0, tau=0.5). Since cv.glm() only accepts a cost function with two arguments (r and pi), you can wrap your customized function inside the CV loop to pass the current threshold. To keep computation time reasonable, take a random sample of 1,000 observations from the dataset before running your loop.

1. **The Hitters Data**

This dataset is part of the R-package ISLR2 and contains certain statistics and salaries of major league baseball players for the years 1986–87. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance. It consists of 20 variables and 322 observations, only the Salary variable has missing observations.

library(ISLR2)
names(Hitters)

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"

dim(Hitters)

## [1] 263  20

sum(is.na(Hitters$Salary))

## [1] 0

First of all, we note that the Salary variable is missing for some of the players. The is.na() function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a TRUE for any elements that are missing, and a FALSE for non-missing elements. The sum() function can then be used to count all of the missing elements.

Hence we see that Salary is missing for 59 players. The na.omit() function removes all of the rows that have missing values in any variable.

Hitters <- na.omit(Hitters)
dim(Hitters)

## [1] 263  20

sum(is.na(Hitters))

## [1] 0

After omitting them, we have 263 complete observations to use for modeling.

2. Subset Selection via Analytical Indexes

WThe regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

library(leaps)
regfit.full <- regsubsets(Salary ~ ., Hitters)
summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "

An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI. By default, regsubsets() only reports results up to the best eight-variable model. But the nvmax option can be used in order to return as many variables as are desired. Here we fit up to a 19-variable model.

regfit.full <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
reg.summary <- summary(regfit.full)

The summary() function also returns $R^2$, RSS, adjusted $R^2$, $C_p$, and BIC. We can plot these, highlighting the optimal model size for each metric, to try to select the best overall model.

par(mfrow = c(2, 2))

# Plot Adjusted R-squared
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
adj_max <- which.max(reg.summary$adjr2)
points(adj_max, reg.summary$adjr2[adj_max], col = "red", cex = 2, pch = 20)

# Plot Cp
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min <- which.min(reg.summary$cp)
points(cp_min, reg.summary$cp[cp_min], col = "red", cex = 2, pch = 20)

# Plot BIC
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min <- which.min(reg.summary$bic)
points(bic_min, reg.summary$bic[bic_min], col = "red", cex = 2, pch = 20)

Adjusted $R^2$ selects an 11-variable model.
$C_p$ selects a 10-variable model.
BIC selects a 6-variable model. BIC places a heavier penalty on the number of variables, leading to a simpler, more parsimonious model.

The regsubsets() function has a built-in plot() command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the BIC, $C_p$, or adjusted $R^2$.

plot(regfit.full, scale = "bic")

The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic. For instance, we see that several models share a BIC close to $-150$. However, the model with the lowest BIC is the six-variable model that contains only AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts. We can use the coef() function to see the coefficient estimates associated with this model.

coef(regfit.full, 6)

##  (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
##      PutOuts 
##    0.2643076

3. Forward Stepwise Selection via Held-Out Approximation

Analytical indexes are approximations. A more direct way to estimate the true test error is to use resampling techniques like the validation set approach and cross-validation.

Best subset selection can be computationally prohibitive on very large datasets. To demonstrate an alternative, we will use Forward Stepwise Selection for our resampling techniques.

First, we define a custom predict() method for regsubsets objects, as one is not provided by default in R.

predict.regsubsets <- function(object, newdata, id, ...) {
  # 1. Extract the formula used in the original regsubsets() call
  form <- as.formula(object$call[[2]])
  
  # 2. Build a design matrix ('X' matrix) from the new data using that formula.
  # This safely handles dummy variable creation and adds the intercept column.
  mat <- model.matrix(form, newdata)
  
  # 3. Extract the exact coefficients for the best model of size 'id'
  coefi <- coef(object, id = id)
  
  # 4. Identify the names of the specific predictors kept in this model size
  xvars <- names(coefi)
  
  # 5. Extract only the relevant columns from the design matrix and multiply 
  # them by the coefficients using matrix multiplication (X * beta)
  mat[, xvars] %*% coefi
}

3.1 The Validation Set Approach

We begin by splitting the data into a training set and a validation set.

set.seed(1)
n <- nrow(Hitters)
train <- sample(c(TRUE, FALSE), n, replace = TRUE)
test <- (!train)

We perform forward stepwise selection only on the training data. This prevents information leakage and ensures our validation errors are honest estimates of test performance.

regfit.fwd <- regsubsets(Salary ~ ., data = Hitters[train, ], nvmax = 19,
                         method = "forward")

(Note: Backward stepwise selection can be performed just as easily by passing method = "backward" to the function).

Next, we loop through all 19 model sizes, making predictions on the held-out test data and calculating the Mean Squared Error (MSE).

val.errors <- rep(NA, 19)
for (i in 1:19) {
  pred <- predict(regfit.fwd, Hitters[test, ], id = i)
  val.errors[i] <- mean((Hitters$Salary[test] - pred)^2)
}

Finally, we find the model size that minimizes the validation error and plot the results.

opt.val.size <- which.min(val.errors)

par(mfrow = c(1, 1))
plot(1:19, val.errors, type = "b", pch = 16, xlab = "Number of Variables", ylab = "Validation MSE")
points(opt.val.size, val.errors[opt.val.size], col = "red", cex = 2, pch = 20)

The validation set approach identifies the 7-variable model as the one with the lowest test MSE. However, this result is highly dependent on the initial random split.

3.2 k-Fold Cross-Validation

To reduce the variance introduced by a single train/test split, we use 10-fold cross-validation. We first create our folds and an empty matrix to store the errors.

k <- 10
set.seed(1)
folds <- sample(rep(1:k, length = n))
cv.errors <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))

We loop over each of the 10 folds. In each iteration, one fold acts as the test set while we train the forward stepwise model on the remaining 9 folds. We then compute the MSE for all 19 model sizes.

for (j in 1:k) {
  # Train forward selection on all folds except j
  best.fit <- regsubsets(Salary ~ ., data = Hitters[folds != j, ], nvmax = 19, method = "forward")
  
  # Predict on fold j and calculate MSE
  for (i in 1:19) {
    pred <- predict(best.fit, Hitters[folds == j, ], id = i)
    cv.errors[j, i] <- mean((Hitters$Salary[folds == j] - pred)^2)
  }
}

We average the errors across all 10 folds for each model size to find the overall optimal model.

mean.cv.errors <- apply(cv.errors, 2, mean)
opt.cv.size <- which.min(mean.cv.errors)

plot(1:19, mean.cv.errors, type = "b", pch = 16, xlab = "Number of Variables", ylab = "10-Fold CV MSE")
points(opt.cv.size, mean.cv.errors[opt.cv.size], col = "red", cex = 2, pch = 20)

Cross-validation provides a more robust estimate of test error and selects a 10-variable model as optimal.

Now that CV has identified the optimal size, we may refit the forward selection on the full dataset to extract the final coefficients.

reg.fwd.final <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19,
                            method = "forward")
coef(reg.fwd.final, 10)

##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
##         CRBI       CWalks    DivisionW      PutOuts      Assists 
##    0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

4. Fast Leave-One-Out Cross-Validation (LOOCV)

4.1 Cross Validation function

The custom prediction loop above gives us total control over the subsetting process. However, if we simply want to evaluate a specific model using Leave-One-Out Cross-Validation, we can use the cv.glm() function from the boot package that implements functions for cross-validation for GLMs. For this reason we need to fit linear regression models with the glm function.

Let’s test the specific 6-variable model we previously identified using BIC. We fit a standard generalized linear model (which defaults to linear regression) and pass it to cv.glm().

library(boot)

## 
## Attaching package: 'boot'

## The following objects are masked from 'package:faraway':
## 
##     logit, melanoma

# Fit a GLM using the 6 variables selected by BIC
glm.fit.6 <- glm(Salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters)

# Perform LOOCV. Since K is not specified, it defaults to K = n.
n <- nrow(Hitters)
cv.err.loo.6 <- cv.glm(Hitters, glm.fit.6, K = n)

cv.err.loo.6$K

## [1] 263

cv.err.loo.6$delta

## [1] 107943.0 107926.7

The returned object contains a few useful components. $K confirms the number of folds used (263, which is $n$, indicating LOOCV).

The $delta component is a vector of length two:

The first component is the raw cross-validation estimate of the prediction error.
The second component is the adjusted cross-validation estimate, designed to compensate for the slight bias introduced by not using all the observations during the training folds. (For LOOCV, these two numbers are very very close).

Because Hitters is a relatively small dataset, this $n$-fold loop runs almost instantaneously. Now, let’s compare this to the 10-variable model selected by our 10-fold cross-validation earlier.

# Fit a GLM using the 10 variables typically selected by the stepwise CV
glm.fit.10 <- glm(Salary ~ AtBat + Hits + Walks + CAtBat + CRuns + CRBI + 
                           CWalks + League + Division + PutOuts, data = Hitters)

# Perform LOOCV on the 10-variable model
cv.err.loo.10 <- cv.glm(Hitters, glm.fit.10, K = n)

cv.err.loo.10$delta

## [1] 107306.8 107279.6

Finally, we can compare the raw LOOCV MSE of both models directly:

cat("6-Variable Model LOOCV Error: ", cv.err.loo.6$delta[2], "\n")

## 6-Variable Model LOOCV Error:  107926.7

cat("10-Variable Model LOOCV Error:", cv.err.loo.10$delta[2], "\n")

## 10-Variable Model LOOCV Error: 107279.6

The 10-variable model yields a slightly lower LOOCV error than the 6-variable model, confirming the results we saw in our previous $k$-fold cross-validation loop. While the 6-variable model is more parsimonious, the 10-variable model provides marginally better out-of-sample predictive accuracy.

4.2 The LOOCV Analytic Shortcut

For standard linear regression, there is a mathematical shortcut that allows us to calculate the exact LOOCV error without having to physically refit the model $n$ times. We can fit the model just once on the full dataset and adjust each residual by dividing it by $1 - h_i$, where $h_i$ represents the leverage value (the diagonal element of the hat matrix) for that specific observation.

The formula for the exact LOOCV MSE is: \[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2\]

We can compute this manually in R using the residuals() and hatvalues() functions. Let’s apply this to our 6-variable model and compare it to the output from cv.glm():

# Fit the model using standard lm()
lm.fit.6 <- lm(Salary ~ AtBat + Hits + Walks + CRBI + Division + PutOuts, data = Hitters)

# Compute the exact LOOCV error using the shortcut formula
loocv.exact <- mean((residuals(lm.fit.6) / (1 - hatvalues(lm.fit.6)))^2)

cat("Analytic Shortcut LOOCV Error: ", loocv.exact, "\n")

## Analytic Shortcut LOOCV Error:  107943

cat("cv.glm() LOOCV Error:          ", cv.err.loo.6$delta[1], "\n")

## cv.glm() LOOCV Error:           107943

The analytic shortcut yields the exact same MSE as the $n$-fold loop performed by cv.glm(). This algebraic trick is incredibly efficient, making the computational cost of LOOCV essentially identical to a single model fit when working with least squares regression.

Appendix: variables in Hitter dataset

The definitions of the variables of the dataset are as follows:

-AtBat: Number of shots made with a baseball bat during the 1986–1987 season,

-Hits: Number of hits made in the 1986–1987 season,

-HmRun: Most valuable hits in the 1986–1987 season,

-Runs: The points he earned for his team in the 1986–1987 season,

-RBI: Number of players a batsman had jogged when he hit in the season,

-Walks: Number of mistakes made by the opposing player,

-Years: Player’s playing time in major league (in year),

-CAtBat: Number of shots made with a baseball bat in career,

-CHits: Number of hits made in the career,

-CHmRun: Most valuable hits in the career,

-CRuns: The points he earned for his team in his career,

-CRBI: Number of players a batsman had jogged when he hit in the career,

-CWalks: Number of mistakes made by the opposing player in career,

-League: A factor with A and N levels showing the league in which the player played until the end of the season,

-Division: A factor with levels E and W indicating the position played by the player at the end of 1986,

-PutOuts: Helping your teammate in-game,

-Assists: Number of assists made by the player in the 1986–1987 season,

-Errors: Player’s errors in the 1986–1987 season,

-Salary: The salary of the player in the 1986–1987 season (in thousand),

-NewLeague: A factor with A and N levels showing the player’s league at the start of the 1987 season.

The Ames Housing Dataset

In this lab, we will apply shrinkage methods (Ridge and Lasso regression) to the Ames Housing dataset.

# Load the pre-cleaned data
ames <- read.csv("AmesHousing_Cleaned.csv")

This dataset contains information on residential home sales in Ames, Iowa. It is a robust dataset consisting of 2,930 observations and over 80 features.

# explore the data
dim(ames)

## [1] 2930   81

str(ames)

## 'data.frame':    2930 obs. of  81 variables:
##  $ MS_SubClass       : chr  "One_Story_1946_and_Newer_All_Styles" "One_Story_1946_and_Newer_All_Styles" "One_Story_1946_and_Newer_All_Styles" "One_Story_1946_and_Newer_All_Styles" ...
##  $ MS_Zoning         : chr  "Residential_Low_Density" "Residential_High_Density" "Residential_Low_Density" "Residential_Low_Density" ...
##  $ Lot_Frontage      : int  141 80 81 93 74 78 41 43 39 60 ...
##  $ Lot_Area          : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
##  $ Street            : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley             : chr  "No_Alley_Access" "No_Alley_Access" "No_Alley_Access" "No_Alley_Access" ...
##  $ Lot_Shape         : chr  "Slightly_Irregular" "Regular" "Slightly_Irregular" "Regular" ...
##  $ Land_Contour      : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities         : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ Lot_Config        : chr  "Corner" "Inside" "Corner" "Corner" ...
##  $ Land_Slope        : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood      : chr  "North_Ames" "North_Ames" "North_Ames" "North_Ames" ...
##  $ Condition_1       : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition_2       : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ Bldg_Type         : chr  "OneFam" "OneFam" "OneFam" "OneFam" ...
##  $ House_Style       : chr  "One_Story" "One_Story" "One_Story" "One_Story" ...
##  $ Overall_Qual      : chr  "Above_Average" "Average" "Above_Average" "Good" ...
##  $ Overall_Cond      : chr  "Average" "Above_Average" "Above_Average" "Average" ...
##  $ Year_Built        : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ Year_Remod_Add    : int  1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
##  $ Roof_Style        : chr  "Hip" "Gable" "Hip" "Hip" ...
##  $ Roof_Matl         : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior_1st      : chr  "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ Exterior_2nd      : chr  "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
##  $ Mas_Vnr_Type      : chr  "Stone" "None" "BrkFace" "None" ...
##  $ Mas_Vnr_Area      : int  112 0 108 0 0 20 0 0 0 0 ...
##  $ Exter_Qual        : chr  "Typical" "Typical" "Typical" "Good" ...
##  $ Exter_Cond        : chr  "Typical" "Typical" "Typical" "Typical" ...
##  $ Foundation        : chr  "CBlock" "CBlock" "CBlock" "CBlock" ...
##  $ Bsmt_Qual         : chr  "Typical" "Typical" "Typical" "Typical" ...
##  $ Bsmt_Cond         : chr  "Good" "Typical" "Typical" "Typical" ...
##  $ Bsmt_Exposure     : chr  "Gd" "No" "No" "No" ...
##  $ BsmtFin_Type_1    : chr  "BLQ" "Rec" "ALQ" "ALQ" ...
##  $ BsmtFin_SF_1      : int  2 6 1 1 3 3 3 1 3 7 ...
##  $ BsmtFin_Type_2    : chr  "Unf" "LwQ" "Unf" "Unf" ...
##  $ BsmtFin_SF_2      : int  0 144 0 0 0 0 0 0 0 0 ...
##  $ Bsmt_Unf_SF       : int  441 270 406 1045 137 324 722 1017 415 994 ...
##  $ Total_Bsmt_SF     : int  1080 882 1329 2110 928 926 1338 1280 1595 994 ...
##  $ Heating           : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ Heating_QC        : chr  "Fair" "Typical" "Typical" "Excellent" ...
##  $ Central_Air       : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical        : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ First_Flr_SF      : int  1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
##  $ Second_Flr_SF     : int  0 0 0 0 701 678 0 0 0 776 ...
##  $ Low_Qual_Fin_SF   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gr_Liv_Area       : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
##  $ Bsmt_Full_Bath    : int  1 0 0 1 0 0 1 0 1 0 ...
##  $ Bsmt_Half_Bath    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full_Bath         : int  1 1 1 2 2 2 2 2 2 2 ...
##  $ Half_Bath         : int  0 0 1 1 1 1 0 0 0 1 ...
##  $ Bedroom_AbvGr     : int  3 2 3 3 3 3 2 2 2 3 ...
##  $ Kitchen_AbvGr     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen_Qual      : chr  "Typical" "Typical" "Good" "Excellent" ...
##  $ TotRms_AbvGrd     : int  7 5 6 8 6 7 6 5 5 7 ...
##  $ Functional        : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces        : int  2 0 0 2 1 1 0 0 1 1 ...
##  $ Fireplace_Qu      : chr  "Good" "No_Fireplace" "No_Fireplace" "Typical" ...
##  $ Garage_Type       : chr  "Attchd" "Attchd" "Attchd" "Attchd" ...
##  $ Garage_Finish     : chr  "Fin" "Unf" "Unf" "Fin" ...
##  $ Garage_Cars       : int  2 1 1 2 2 2 2 2 2 2 ...
##  $ Garage_Area       : int  528 730 312 522 482 470 582 506 608 442 ...
##  $ Garage_Qual       : chr  "Typical" "Typical" "Typical" "Typical" ...
##  $ Garage_Cond       : chr  "Typical" "Typical" "Typical" "Typical" ...
##  $ Paved_Drive       : chr  "Partial_Pavement" "Paved" "Paved" "Paved" ...
##  $ Wood_Deck_SF      : int  210 140 393 0 212 360 0 0 237 140 ...
##  $ Open_Porch_SF     : int  62 0 36 0 34 36 0 82 152 60 ...
##  $ Enclosed_Porch    : int  0 0 0 0 0 0 170 0 0 0 ...
##  $ Three_season_porch: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Screen_Porch      : int  0 120 0 0 0 0 0 144 0 0 ...
##  $ Pool_Area         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool_QC           : chr  "No_Pool" "No_Pool" "No_Pool" "No_Pool" ...
##  $ Fence             : chr  "No_Fence" "Minimum_Privacy" "No_Fence" "No_Fence" ...
##  $ Misc_Feature      : chr  "None" "None" "Gar2" "None" ...
##  $ Misc_Val          : int  0 0 12500 0 0 0 0 0 0 0 ...
##  $ Mo_Sold           : int  5 6 6 4 3 6 4 1 3 6 ...
##  $ Year_Sold         : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Sale_Type         : chr  "WD " "WD " "WD " "WD " ...
##  $ Sale_Condition    : chr  "Normal" "Normal" "Normal" "Normal" ...
##  $ Sale_Price        : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
##  $ Longitude         : num  -93.6 -93.6 -93.6 -93.6 -93.6 ...
##  $ Latitude          : num  42.1 42.1 42.1 42.1 42.1 ...

Our target variable is Sale_Price. The predictor variables cover almost every aspect of a home, including: * Continuous variables: e.g., Lot_Area (square footage), Gr_Liv_Area (above ground living area). * Discrete/Ordinal variables: e.g., Full_Bath, Overall_Qual (rating from 1-10). * Categorical/Nominal variables: e.g., Neighborhood (28 different locations), Bldg_Type, House_Style.

When dealing with this many variables—especially once categorical variables are expanded into dozens of dummy variables—standard linear regression often suffers from high variance and multicollinearity. Shrinkage methods constrain or “regularize” the coefficient estimates, significantly reducing variance and improving out-of-sample prediction.

1. Baseline: Least Squares (OLS)

We begin by fitting a saturated linear regression model using all available predictors. This will serve as our baseline.

# Fit a saturated linear regression model
lm.mod <- lm(Sale_Price ~ ., data = ames)
summary(lm.mod)

## 
## Call:
## lm(formula = Sale_Price ~ ., data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -303624   -9147      83    8942  145991 
## 
## Coefficients: (7 not defined because of singularities)
##                                                        Estimate Std. Error
## (Intercept)                                          -9.864e+06  9.006e+06
## MS_SubClassOne_and_Half_Story_Finished_All_Ages       2.371e+04  1.626e+04
## MS_SubClassOne_and_Half_Story_PUD_All_Ages           -2.454e+04  2.891e+04
## MS_SubClassOne_and_Half_Story_Unfinished_All_Ages     2.833e+04  1.933e+04
## MS_SubClassOne_Story_1945_and_Older                   2.040e+04  1.565e+04
## MS_SubClassOne_Story_1946_and_Newer_All_Styles        1.611e+04  1.519e+04
## MS_SubClassOne_Story_PUD_1946_and_Newer              -1.637e+03  1.739e+04
## MS_SubClassOne_Story_with_Finished_Attic_All_Ages     2.805e+04  1.829e+04
## MS_SubClassPUD_Multilevel_Split_Level_Foyer          -5.647e+03  1.864e+04
## MS_SubClassSplit_Foyer                                1.231e+04  1.589e+04
## MS_SubClassSplit_or_Multilevel                        6.335e+03  1.715e+04
## MS_SubClassTwo_and_Half_Story_All_Ages                1.445e+04  1.799e+04
## MS_SubClassTwo_Family_conversion_All_Styles_and_Ages  1.093e+04  4.891e+03
## MS_SubClassTwo_Story_1945_and_Older                   2.499e+04  1.616e+04
## MS_SubClassTwo_Story_1946_and_Newer                   1.594e+04  1.565e+04
## MS_SubClassTwo_Story_PUD_1946_and_Newer              -3.577e+03  1.797e+04
## MS_ZoningC_all                                        2.161e+04  3.684e+04
## MS_ZoningFloating_Village_Residential                 3.044e+04  3.614e+04
## MS_ZoningI_all                                        2.812e+04  4.358e+04
## MS_ZoningResidential_High_Density                     3.473e+04  3.658e+04
## MS_ZoningResidential_Low_Density                      3.191e+04  3.592e+04
## MS_ZoningResidential_Medium_Density                   2.850e+04  3.615e+04
## Lot_Frontage                                          1.198e+01  1.468e+01
## Lot_Area                                              5.652e-01  7.803e-02
## StreetPave                                            1.790e+04  7.378e+03
## AlleyNo_Alley_Access                                 -1.054e+03  2.496e+03
## AlleyPaved                                            2.296e+01  3.841e+03
## Lot_ShapeModerately_Irregular                         3.298e+01  6.228e+03
## Lot_ShapeRegular                                     -3.611e+03  5.862e+03
## Lot_ShapeSlightly_Irregular                          -4.904e+03  5.836e+03
## Land_ContourHLS                                       7.625e+03  3.205e+03
## Land_ContourLow                                      -4.570e+03  4.174e+03
## Land_ContourLvl                                       4.528e+03  2.351e+03
## UtilitiesNoSeWa                                      -3.020e+04  2.382e+04
## UtilitiesNoSewr                                      -1.176e+04  1.972e+04
## Lot_ConfigCulDSac                                     4.119e+03  2.092e+03
## Lot_ConfigFR2                                        -6.163e+03  2.688e+03
## Lot_ConfigFR3                                        -4.250e+03  6.097e+03
## Lot_ConfigInside                                     -4.659e+02  1.123e+03
## Land_SlopeMod                                         4.971e+03  2.561e+03
## Land_SlopeSev                                        -2.517e+04  7.808e+03
## NeighborhoodBlueste                                   2.099e+04  1.126e+04
## NeighborhoodBriardale                                 8.075e+03  7.667e+03
## NeighborhoodBrookside                                 9.502e+02  7.054e+03
## NeighborhoodClear_Creek                               4.308e+03  8.475e+03
## NeighborhoodCollege_Creek                             3.894e+03  8.998e+03
## NeighborhoodCrawford                                  1.888e+04  8.275e+03
## NeighborhoodEdwards                                  -5.226e+03  8.173e+03
## NeighborhoodGilbert                                  -1.067e+04  5.137e+03
## NeighborhoodGreen_Hills                               1.234e+05  1.787e+04
## NeighborhoodGreens                                    9.377e+03  9.995e+03
## NeighborhoodIowa_DOT_and_Rail_Road                   -2.213e+03  8.103e+03
## NeighborhoodLandmark                                  6.722e+03  2.347e+04
## NeighborhoodMeadow_Village                            1.145e+04  1.153e+04
## NeighborhoodMitchell                                  4.017e+03  1.060e+04
## NeighborhoodNorth_Ames                               -9.349e+03  5.946e+03
## NeighborhoodNorthpark_Villa                           1.586e+04  1.024e+04
## NeighborhoodNorthridge                                2.327e+04  6.010e+03
## NeighborhoodNorthridge_Heights                        1.268e+04  5.218e+03
## NeighborhoodNorthwest_Ames                           -1.162e+04  5.621e+03
## NeighborhoodOld_Town                                 -9.832e+03  7.194e+03
## NeighborhoodSawyer                                    8.041e+02  7.449e+03
## NeighborhoodSawyer_West                              -1.086e+03  7.883e+03
## NeighborhoodSomerset                                  9.752e+03  6.039e+03
## NeighborhoodSouth_and_West_of_Iowa_State_University  -1.162e+03  8.584e+03
## NeighborhoodStone_Brook                               3.128e+04  5.773e+03
## NeighborhoodTimberland                                1.145e+04  1.044e+04
## NeighborhoodVeenker                                  -5.008e+03  7.477e+03
## Condition_1Feedr                                      3.307e+03  3.084e+03
## Condition_1Norm                                       1.230e+04  2.567e+03
## Condition_1PosA                                       1.512e+04  5.860e+03
## Condition_1PosN                                       1.553e+04  4.529e+03
## Condition_1RRAe                                      -1.866e+03  5.077e+03
## Condition_1RRAn                                       7.700e+03  4.266e+03
## Condition_1RRNe                                       1.890e+03  9.394e+03
## Condition_1RRNn                                       1.290e+00  7.831e+03
## Condition_2Feedr                                     -6.461e+03  1.192e+04
## Condition_2Norm                                       2.930e+03  1.035e+04
## Condition_2PosA                                       2.946e+04  1.617e+04
## Condition_2PosN                                      -1.259e+05  1.550e+04
## Condition_2RRAe                                      -4.266e+04  3.054e+04
## Condition_2RRAn                                       2.470e+03  2.387e+04
## Condition_2RRNn                                       1.446e+04  1.850e+04
## Bldg_TypeOneFam                                       2.019e+03  1.459e+04
## Bldg_TypeTwnhs                                       -5.225e+03  1.707e+04
## Bldg_TypeTwnhsE                                       5.260e+02  1.683e+04
## Bldg_TypeTwoFmCon                                            NA         NA
## House_StyleOne_and_Half_Unf                           5.423e+03  1.104e+04
## House_StyleOne_Story                                  1.022e+04  5.321e+03
## House_StyleSFoyer                                     1.207e+04  6.976e+03
## House_StyleSLvl                                       1.577e+04  9.316e+03
## House_StyleTwo_and_Half_Fin                          -7.546e+03  1.165e+04
## House_StyleTwo_and_Half_Unf                           6.292e+03  8.300e+03
## House_StyleTwo_Story                                  5.972e+02  4.883e+03
## Overall_QualAverage                                  -2.826e+03  1.341e+03
## Overall_QualBelow_Average                            -5.202e+03  2.137e+03
## Overall_QualExcellent                                 5.180e+04  3.783e+03
## Overall_QualFair                                     -3.304e+03  4.269e+03
## Overall_QualGood                                      5.364e+03  1.442e+03
## Overall_QualPoor                                     -4.212e+03  7.168e+03
## Overall_QualVery_Excellent                            8.152e+04  6.211e+03
## Overall_QualVery_Good                                 1.848e+04  2.116e+03
## Overall_QualVery_Poor                                 9.535e+03  1.863e+04
## Overall_CondAverage                                  -5.527e+03  1.295e+03
## Overall_CondBelow_Average                            -1.387e+04  2.556e+03
## Overall_CondExcellent                                 1.907e+04  4.210e+03
## Overall_CondFair                                     -2.605e+04  3.753e+03
## Overall_CondGood                                      7.790e+03  1.561e+03
## Overall_CondPoor                                     -2.585e+04  9.225e+03
## Overall_CondVery_Good                                 1.101e+04  2.317e+03
## Overall_CondVery_Poor                                -1.726e+04  1.139e+04
## Year_Built                                            4.018e+02  5.267e+01
## Year_Remod_Add                                        6.545e+01  3.471e+01
## Roof_StyleGable                                       9.138e+03  9.895e+03
## Roof_StyleGambrel                                     4.774e+03  1.111e+04
## Roof_StyleHip                                         9.381e+03  9.967e+03
## Roof_StyleMansard                                    -2.559e+03  1.227e+04
## Roof_StyleShed                                        2.778e+04  1.707e+04
## Roof_MatlCompShg                                      6.898e+05  3.162e+04
## Roof_MatlMembran                                      7.508e+05  4.061e+04
## Roof_MatlMetal                                        7.325e+05  4.042e+04
## Roof_MatlRoll                                         6.959e+05  3.847e+04
## Roof_MatlTar&Grv                                      6.900e+05  3.273e+04
## Roof_MatlWdShake                                      6.875e+05  3.277e+04
## Roof_MatlWdShngl                                      7.346e+05  3.260e+04
## Exterior_1stAsphShn                                  -1.194e+04  2.332e+04
## Exterior_1stBrkComm                                   7.759e+03  1.330e+04
## Exterior_1stBrkFace                                   1.383e+04  7.867e+03
## Exterior_1stCBlock                                   -5.702e+03  2.748e+04
## Exterior_1stCemntBd                                   1.205e+03  1.241e+04
## Exterior_1stHdBoard                                  -6.228e+03  7.684e+03
## Exterior_1stImStucc                                  -2.456e+04  2.334e+04
## Exterior_1stMetalSd                                   1.551e+03  8.632e+03
## Exterior_1stPlywood                                  -4.318e+03  7.530e+03
## Exterior_1stPreCast                                   6.239e+04  2.293e+04
## Exterior_1stStone                                    -2.294e+03  1.861e+04
## Exterior_1stStucco                                    1.305e+03  8.527e+03
## Exterior_1stVinylSd                                  -9.551e+03  8.381e+03
## Exterior_1stWd Sdng                                  -2.899e+03  7.431e+03
## Exterior_1stWdShing                                  -6.136e+03  8.014e+03
## Exterior_2ndAsphShn                                   1.659e+04  1.761e+04
## Exterior_2ndBrk Cmn                                  -3.326e+02  1.213e+04
## Exterior_2ndBrkFace                                  -4.499e+02  8.629e+03
## Exterior_2ndCBlock                                    8.888e+03  2.247e+04
## Exterior_2ndCmentBd                                   2.371e+02  1.258e+04
## Exterior_2ndHdBoard                                   5.302e+03  7.908e+03
## Exterior_2ndImStucc                                   8.676e+03  9.559e+03
## Exterior_2ndMetalSd                                   1.086e+03  8.815e+03
## Exterior_2ndOther                                    -1.118e+04  2.302e+04
## Exterior_2ndPlywood                                   3.272e+03  7.620e+03
## Exterior_2ndPreCast                                          NA         NA
## Exterior_2ndStone                                    -5.752e+03  1.295e+04
## Exterior_2ndStucco                                    3.308e+03  8.698e+03
## Exterior_2ndVinylSd                                   1.154e+04  8.555e+03
## Exterior_2ndWd Sdng                                   4.171e+03  7.676e+03
## Exterior_2ndWd Shng                                   6.923e+03  8.038e+03
## Mas_Vnr_TypeBrkFace                                   3.107e+03  4.521e+03
## Mas_Vnr_TypeCBlock                                   -7.058e+04  2.688e+04
## Mas_Vnr_TypeNone                                      6.745e+03  4.530e+03
## Mas_Vnr_TypeStone                                     8.371e+03  4.746e+03
## Mas_Vnr_Area                                          1.958e+01  3.756e+00
## Exter_QualFair                                       -6.031e+03  6.081e+03
## Exter_QualGood                                       -1.046e+04  3.345e+03
## Exter_QualTypical                                    -1.162e+04  3.670e+03
## Exter_CondFair                                        1.003e+03  7.552e+03
## Exter_CondGood                                        3.167e+03  6.803e+03
## Exter_CondPoor                                       -8.777e+03  1.857e+04
## Exter_CondTypical                                     5.240e+03  6.806e+03
## FoundationCBlock                                     -8.659e+01  1.969e+03
## FoundationPConc                                       1.619e+03  2.133e+03
## FoundationSlab                                       -3.174e+03  5.930e+03
## FoundationStone                                       1.089e+04  7.102e+03
## FoundationWood                                       -1.776e+04  1.006e+04
## Bsmt_QualFair                                        -9.787e+03  3.737e+03
## Bsmt_QualGood                                        -1.276e+04  2.103e+03
## Bsmt_QualNo_Basement                                  1.783e+04  3.860e+04
## Bsmt_QualPoor                                        -7.999e+03  1.798e+04
## Bsmt_QualTypical                                     -1.105e+04  2.626e+03
## Bsmt_CondFair                                         1.349e+03  1.302e+04
## Bsmt_CondGood                                         8.209e+02  1.292e+04
## Bsmt_CondNo_Basement                                         NA         NA
## Bsmt_CondPoor                                        -1.390e+03  1.769e+04
## Bsmt_CondTypical                                      7.412e+02  1.278e+04
## Bsmt_ExposureGd                                       1.013e+04  1.877e+03
## Bsmt_ExposureMn                                      -5.889e+03  1.885e+03
## Bsmt_ExposureNo                                      -4.784e+03  1.419e+03
## Bsmt_ExposureNo_Basement                             -1.113e+04  1.223e+04
## BsmtFin_Type_1BLQ                                    -3.908e+03  6.650e+03
## BsmtFin_Type_1GLQ                                    -3.384e+03  1.295e+04
## BsmtFin_Type_1LwQ                                    -1.460e+04  1.940e+04
## BsmtFin_Type_1No_Basement                                    NA         NA
## BsmtFin_Type_1Rec                                    -2.036e+04  3.222e+04
## BsmtFin_Type_1Unf                                    -1.775e+04  3.865e+04
## BsmtFin_SF_1                                          3.550e+03  6.436e+03
## BsmtFin_Type_2BLQ                                    -7.072e+03  4.152e+03
## BsmtFin_Type_2GLQ                                     6.535e+03  5.055e+03
## BsmtFin_Type_2LwQ                                    -9.866e+03  4.052e+03
## BsmtFin_Type_2No_Basement                            -2.649e+04  2.166e+04
## BsmtFin_Type_2Rec                                    -8.215e+03  3.863e+03
## BsmtFin_Type_2Unf                                    -3.995e+03  3.967e+03
## BsmtFin_SF_2                                         -5.693e+00  4.647e+00
## Bsmt_Unf_SF                                          -2.028e+01  1.898e+00
## Total_Bsmt_SF                                         3.889e+01  3.012e+00
## HeatingGasA                                           1.080e+04  2.233e+04
## HeatingGasW                                           1.008e+04  2.279e+04
## HeatingGrav                                           8.183e+03  2.392e+04
## HeatingOthW                                          -1.345e+04  2.740e+04
## HeatingWall                                           2.083e+04  2.496e+04
## Heating_QCFair                                       -4.836e+03  2.849e+03
## Heating_QCGood                                       -1.762e+03  1.296e+03
## Heating_QCPoor                                       -1.589e+04  1.553e+04
## Heating_QCTypical                                    -3.240e+03  1.276e+03
## Central_AirY                                         -2.879e+03  2.269e+03
## ElectricalFuseF                                      -1.420e+03  3.762e+03
## ElectricalFuseP                                      -1.841e+03  8.563e+03
## ElectricalMix                                         3.208e+04  3.011e+04
## ElectricalSBrkr                                      -9.048e+02  1.846e+03
## ElectricalUnknown                                     5.572e+03  2.138e+04
## First_Flr_SF                                          4.946e+01  3.171e+00
## Second_Flr_SF                                         6.087e+01  3.503e+00
## Low_Qual_Fin_SF                                       3.005e+01  1.052e+01
## Gr_Liv_Area                                                  NA         NA
## Bsmt_Full_Bath                                        1.545e+03  1.203e+03
## Bsmt_Half_Bath                                       -1.711e+02  1.834e+03
## Full_Bath                                             3.923e+03  1.352e+03
## Half_Bath                                             2.495e+03  1.285e+03
## Bedroom_AbvGr                                        -2.820e+03  8.567e+02
## Kitchen_AbvGr                                        -8.627e+03  3.978e+03
## Kitchen_QualFair                                     -1.638e+04  3.943e+03
## Kitchen_QualGood                                     -1.328e+04  2.364e+03
## Kitchen_QualPoor                                      2.238e+04  2.276e+04
## Kitchen_QualTypical                                  -1.469e+04  2.603e+03
## TotRms_AbvGrd                                         6.877e+02  5.745e+02
## FunctionalMaj2                                        1.156e+02  9.771e+03
## FunctionalMin1                                        1.614e+04  6.238e+03
## FunctionalMin2                                        1.250e+04  6.297e+03
## FunctionalMod                                         7.954e+03  6.892e+03
## FunctionalSal                                        -1.076e+04  2.334e+04
## FunctionalSev                                        -1.526e+04  1.764e+04
## FunctionalTyp                                         2.333e+04  5.643e+03
## Fireplaces                                            7.896e+03  1.591e+03
## Fireplace_QuFair                                     -3.676e+03  4.405e+03
## Fireplace_QuGood                                     -3.025e+02  3.520e+03
## Fireplace_QuNo_Fireplace                              5.283e+03  4.010e+03
## Fireplace_QuPoor                                      2.153e+02  4.866e+03
## Fireplace_QuTypical                                  -1.224e+03  3.622e+03
## Garage_TypeBasment                                    1.306e+03  4.243e+03
## Garage_TypeBuiltIn                                   -7.939e+02  2.063e+03
## Garage_TypeCarPort                                   -4.466e+03  6.080e+03
## Garage_TypeDetchd                                    -9.058e+01  1.472e+03
## Garage_TypeMore_Than_Two_Types                       -1.187e+04  4.945e+03
## Garage_TypeNo_Garage                                  1.094e+04  1.764e+04
## Garage_FinishNo_Garage                               -2.119e+04  2.219e+04
## Garage_FinishRFn                                     -1.531e+03  1.243e+03
## Garage_FinishUnf                                      4.216e+02  1.488e+03
## Garage_Cars                                           3.766e+03  1.431e+03
## Garage_Area                                           1.415e+01  4.793e+00
## Garage_QualFair                                      -5.521e+04  1.875e+04
## Garage_QualGood                                      -4.200e+04  1.857e+04
## Garage_QualNo_Garage                                         NA         NA
## Garage_QualPoor                                      -6.145e+04  2.312e+04
## Garage_QualTypical                                   -5.382e+04  1.858e+04
## Garage_CondFair                                       3.590e+04  1.798e+04
## Garage_CondGood                                       3.170e+04  1.846e+04
## Garage_CondNo_Garage                                         NA         NA
## Garage_CondPoor                                       3.532e+04  1.934e+04
## Garage_CondTypical                                    3.819e+04  1.771e+04
## Paved_DrivePartial_Pavement                          -2.010e+03  3.315e+03
## Paved_DrivePaved                                     -4.863e+02  2.102e+03
## Wood_Deck_SF                                          7.655e+00  3.667e+00
## Open_Porch_SF                                         8.550e-01  7.007e+00
## Enclosed_Porch                                        1.056e+01  7.388e+00
## Three_season_porch                                    1.967e+01  1.654e+01
## Screen_Porch                                          4.506e+01  7.679e+00
## Pool_Area                                            -1.795e+01  5.457e+01
## Pool_QCFair                                          -4.416e+04  2.953e+04
## Pool_QCGood                                          -3.743e+04  2.596e+04
## Pool_QCNo_Pool                                       -8.124e+04  2.229e+04
## Pool_QCTypical                                       -6.947e+04  1.847e+04
## FenceGood_Wood                                        2.156e+03  2.984e+03
## FenceMinimum_Privacy                                  2.773e+03  2.423e+03
## FenceMinimum_Wood_Wire                               -3.897e+02  6.603e+03
## FenceNo_Fence                                         8.514e+02  2.184e+03
## Misc_FeatureGar2                                      5.565e+05  2.893e+04
## Misc_FeatureNone                                      5.565e+05  3.661e+04
## Misc_FeatureOthr                                      5.690e+05  3.459e+04
## Misc_FeatureShed                                      5.563e+05  3.570e+04
## Misc_FeatureTenC                                      4.948e+05  4.628e+04
## Misc_Val                                              9.341e-01  1.652e+00
## Mo_Sold                                              -1.398e+02  1.529e+02
## Year_Sold                                            -5.532e+02  3.213e+02
## Sale_TypeCon                                          2.867e+04  1.007e+04
## Sale_TypeConLD                                        6.356e+03  5.293e+03
## Sale_TypeConLI                                       -4.035e+03  7.682e+03
## Sale_TypeConLw                                        5.336e+03  8.335e+03
## Sale_TypeCWD                                          7.068e+03  6.734e+03
## Sale_TypeNew                                          1.228e+04  9.638e+03
## Sale_TypeOth                                          4.671e+03  8.469e+03
## Sale_TypeVWD                                         -6.185e+03  2.172e+04
## Sale_TypeWD                                           2.034e+03  2.609e+03
## Sale_ConditionAdjLand                                 2.100e+04  7.209e+03
## Sale_ConditionAlloca                                  1.305e+04  5.395e+03
## Sale_ConditionFamily                                  2.933e+03  3.637e+03
## Sale_ConditionNormal                                  8.738e+03  1.858e+03
## Sale_ConditionPartial                                 8.433e+03  9.281e+03
## Longitude                                             4.054e+04  8.903e+04
## Latitude                                              3.006e+05  1.308e+05
##                                                      t value Pr(>|t|)    
## (Intercept)                                           -1.095 0.273511    
## MS_SubClassOne_and_Half_Story_Finished_All_Ages        1.458 0.144976    
## MS_SubClassOne_and_Half_Story_PUD_All_Ages            -0.849 0.396093    
## MS_SubClassOne_and_Half_Story_Unfinished_All_Ages      1.466 0.142794    
## MS_SubClassOne_Story_1945_and_Older                    1.303 0.192556    
## MS_SubClassOne_Story_1946_and_Newer_All_Styles         1.060 0.289134    
## MS_SubClassOne_Story_PUD_1946_and_Newer               -0.094 0.925013    
## MS_SubClassOne_Story_with_Finished_Attic_All_Ages      1.534 0.125214    
## MS_SubClassPUD_Multilevel_Split_Level_Foyer           -0.303 0.761984    
## MS_SubClassSplit_Foyer                                 0.774 0.438771    
## MS_SubClassSplit_or_Multilevel                         0.369 0.711869    
## MS_SubClassTwo_and_Half_Story_All_Ages                 0.803 0.421869    
## MS_SubClassTwo_Family_conversion_All_Styles_and_Ages   2.234 0.025548 *  
## MS_SubClassTwo_Story_1945_and_Older                    1.546 0.122215    
## MS_SubClassTwo_Story_1946_and_Newer                    1.018 0.308720    
## MS_SubClassTwo_Story_PUD_1946_and_Newer               -0.199 0.842277    
## MS_ZoningC_all                                         0.587 0.557498    
## MS_ZoningFloating_Village_Residential                  0.842 0.399789    
## MS_ZoningI_all                                         0.645 0.518823    
## MS_ZoningResidential_High_Density                      0.949 0.342517    
## MS_ZoningResidential_Low_Density                       0.888 0.374528    
## MS_ZoningResidential_Medium_Density                    0.788 0.430506    
## Lot_Frontage                                           0.816 0.414490    
## Lot_Area                                               7.243 5.75e-13 ***
## StreetPave                                             2.426 0.015342 *  
## AlleyNo_Alley_Access                                  -0.422 0.672879    
## AlleyPaved                                             0.006 0.995231    
## Lot_ShapeModerately_Irregular                          0.005 0.995776    
## Lot_ShapeRegular                                      -0.616 0.537975    
## Lot_ShapeSlightly_Irregular                           -0.840 0.400815    
## Land_ContourHLS                                        2.379 0.017428 *  
## Land_ContourLow                                       -1.095 0.273603    
## Land_ContourLvl                                        1.926 0.054196 .  
## UtilitiesNoSeWa                                       -1.268 0.204981    
## UtilitiesNoSewr                                       -0.597 0.550866    
## Lot_ConfigCulDSac                                      1.969 0.049037 *  
## Lot_ConfigFR2                                         -2.293 0.021926 *  
## Lot_ConfigFR3                                         -0.697 0.485773    
## Lot_ConfigInside                                      -0.415 0.678397    
## Land_SlopeMod                                          1.941 0.052343 .  
## Land_SlopeSev                                         -3.224 0.001280 ** 
## NeighborhoodBlueste                                    1.864 0.062415 .  
## NeighborhoodBriardale                                  1.053 0.292359    
## NeighborhoodBrookside                                  0.135 0.892858    
## NeighborhoodClear_Creek                                0.508 0.611299    
## NeighborhoodCollege_Creek                              0.433 0.665275    
## NeighborhoodCrawford                                   2.281 0.022608 *  
## NeighborhoodEdwards                                   -0.639 0.522579    
## NeighborhoodGilbert                                   -2.077 0.037918 *  
## NeighborhoodGreen_Hills                                6.909 6.11e-12 ***
## NeighborhoodGreens                                     0.938 0.348238    
## NeighborhoodIowa_DOT_and_Rail_Road                    -0.273 0.784762    
## NeighborhoodLandmark                                   0.286 0.774590    
## NeighborhoodMeadow_Village                             0.993 0.320874    
## NeighborhoodMitchell                                   0.379 0.704715    
## NeighborhoodNorth_Ames                                -1.572 0.115973    
## NeighborhoodNorthpark_Villa                            1.549 0.121591    
## NeighborhoodNorthridge                                 3.871 0.000111 ***
## NeighborhoodNorthridge_Heights                         2.430 0.015153 *  
## NeighborhoodNorthwest_Ames                            -2.067 0.038822 *  
## NeighborhoodOld_Town                                  -1.367 0.171827    
## NeighborhoodSawyer                                     0.108 0.914040    
## NeighborhoodSawyer_West                               -0.138 0.890391    
## NeighborhoodSomerset                                   1.615 0.106482    
## NeighborhoodSouth_and_West_of_Iowa_State_University   -0.135 0.892302    
## NeighborhoodStone_Brook                                5.419 6.55e-08 ***
## NeighborhoodTimberland                                 1.097 0.272947    
## NeighborhoodVeenker                                   -0.670 0.503048    
## Condition_1Feedr                                       1.072 0.283705    
## Condition_1Norm                                        4.791 1.75e-06 ***
## Condition_1PosA                                        2.579 0.009955 ** 
## Condition_1PosN                                        3.430 0.000614 ***
## Condition_1RRAe                                       -0.367 0.713306    
## Condition_1RRAn                                        1.805 0.071203 .  
## Condition_1RRNe                                        0.201 0.840547    
## Condition_1RRNn                                        0.000 0.999869    
## Condition_2Feedr                                      -0.542 0.587921    
## Condition_2Norm                                        0.283 0.777186    
## Condition_2PosA                                        1.821 0.068668 .  
## Condition_2PosN                                       -8.121 7.04e-16 ***
## Condition_2RRAe                                       -1.397 0.162532    
## Condition_2RRAn                                        0.103 0.917584    
## Condition_2RRNn                                        0.782 0.434542    
## Bldg_TypeOneFam                                        0.138 0.889960    
## Bldg_TypeTwnhs                                        -0.306 0.759594    
## Bldg_TypeTwnhsE                                        0.031 0.975071    
## Bldg_TypeTwoFmCon                                         NA       NA    
## House_StyleOne_and_Half_Unf                            0.491 0.623177    
## House_StyleOne_Story                                   1.920 0.055018 .  
## House_StyleSFoyer                                      1.731 0.083650 .  
## House_StyleSLvl                                        1.693 0.090641 .  
## House_StyleTwo_and_Half_Fin                           -0.648 0.517291    
## House_StyleTwo_and_Half_Unf                            0.758 0.448505    
## House_StyleTwo_Story                                   0.122 0.902673    
## Overall_QualAverage                                   -2.107 0.035171 *  
## Overall_QualBelow_Average                             -2.435 0.014968 *  
## Overall_QualExcellent                                 13.693  < 2e-16 ***
## Overall_QualFair                                      -0.774 0.438966    
## Overall_QualGood                                       3.720 0.000204 ***
## Overall_QualPoor                                      -0.588 0.556856    
## Overall_QualVery_Excellent                            13.126  < 2e-16 ***
## Overall_QualVery_Good                                  8.737  < 2e-16 ***
## Overall_QualVery_Poor                                  0.512 0.608745    
## Overall_CondAverage                                   -4.270 2.03e-05 ***
## Overall_CondBelow_Average                             -5.426 6.28e-08 ***
## Overall_CondExcellent                                  4.531 6.13e-06 ***
## Overall_CondFair                                      -6.941 4.90e-12 ***
## Overall_CondGood                                       4.991 6.41e-07 ***
## Overall_CondPoor                                      -2.802 0.005117 ** 
## Overall_CondVery_Good                                  4.752 2.12e-06 ***
## Overall_CondVery_Poor                                 -1.516 0.129637    
## Year_Built                                             7.628 3.30e-14 ***
## Year_Remod_Add                                         1.886 0.059456 .  
## Roof_StyleGable                                        0.923 0.355859    
## Roof_StyleGambrel                                      0.430 0.667298    
## Roof_StyleHip                                          0.941 0.346657    
## Roof_StyleMansard                                     -0.208 0.834869    
## Roof_StyleShed                                         1.627 0.103774    
## Roof_MatlCompShg                                      21.816  < 2e-16 ***
## Roof_MatlMembran                                      18.489  < 2e-16 ***
## Roof_MatlMetal                                        18.121  < 2e-16 ***
## Roof_MatlRoll                                         18.092  < 2e-16 ***
## Roof_MatlTar&Grv                                      21.083  < 2e-16 ***
## Roof_MatlWdShake                                      20.977  < 2e-16 ***
## Roof_MatlWdShngl                                      22.533  < 2e-16 ***
## Exterior_1stAsphShn                                   -0.512 0.608628    
## Exterior_1stBrkComm                                    0.583 0.559609    
## Exterior_1stBrkFace                                    1.757 0.078963 .  
## Exterior_1stCBlock                                    -0.208 0.835620    
## Exterior_1stCemntBd                                    0.097 0.922645    
## Exterior_1stHdBoard                                   -0.811 0.417672    
## Exterior_1stImStucc                                   -1.053 0.292646    
## Exterior_1stMetalSd                                    0.180 0.857391    
## Exterior_1stPlywood                                   -0.573 0.566459    
## Exterior_1stPreCast                                    2.721 0.006546 ** 
## Exterior_1stStone                                     -0.123 0.901903    
## Exterior_1stStucco                                     0.153 0.878379    
## Exterior_1stVinylSd                                   -1.140 0.254581    
## Exterior_1stWd Sdng                                   -0.390 0.696484    
## Exterior_1stWdShing                                   -0.766 0.443939    
## Exterior_2ndAsphShn                                    0.942 0.346090    
## Exterior_2ndBrk Cmn                                   -0.027 0.978130    
## Exterior_2ndBrkFace                                   -0.052 0.958419    
## Exterior_2ndCBlock                                     0.396 0.692463    
## Exterior_2ndCmentBd                                    0.019 0.984962    
## Exterior_2ndHdBoard                                    0.670 0.502611    
## Exterior_2ndImStucc                                    0.908 0.364189    
## Exterior_2ndMetalSd                                    0.123 0.901947    
## Exterior_2ndOther                                     -0.486 0.627289    
## Exterior_2ndPlywood                                    0.429 0.667642    
## Exterior_2ndPreCast                                       NA       NA    
## Exterior_2ndStone                                     -0.444 0.657012    
## Exterior_2ndStucco                                     0.380 0.703769    
## Exterior_2ndVinylSd                                    1.349 0.177584    
## Exterior_2ndWd Sdng                                    0.543 0.586957    
## Exterior_2ndWd Shng                                    0.861 0.389179    
## Mas_Vnr_TypeBrkFace                                    0.687 0.491974    
## Mas_Vnr_TypeCBlock                                    -2.626 0.008687 ** 
## Mas_Vnr_TypeNone                                       1.489 0.136620    
## Mas_Vnr_TypeStone                                      1.764 0.077872 .  
## Mas_Vnr_Area                                           5.212 2.01e-07 ***
## Exter_QualFair                                        -0.992 0.321364    
## Exter_QualGood                                        -3.128 0.001781 ** 
## Exter_QualTypical                                     -3.166 0.001563 ** 
## Exter_CondFair                                         0.133 0.894313    
## Exter_CondGood                                         0.466 0.641584    
## Exter_CondPoor                                        -0.473 0.636592    
## Exter_CondTypical                                      0.770 0.441422    
## FoundationCBlock                                      -0.044 0.964923    
## FoundationPConc                                        0.759 0.447809    
## FoundationSlab                                        -0.535 0.592540    
## FoundationStone                                        1.533 0.125300    
## FoundationWood                                        -1.765 0.077632 .  
## Bsmt_QualFair                                         -2.619 0.008876 ** 
## Bsmt_QualGood                                         -6.067 1.49e-09 ***
## Bsmt_QualNo_Basement                                   0.462 0.644110    
## Bsmt_QualPoor                                         -0.445 0.656426    
## Bsmt_QualTypical                                      -4.208 2.66e-05 ***
## Bsmt_CondFair                                          0.104 0.917456    
## Bsmt_CondGood                                          0.064 0.949340    
## Bsmt_CondNo_Basement                                      NA       NA    
## Bsmt_CondPoor                                         -0.079 0.937387    
## Bsmt_CondTypical                                       0.058 0.953748    
## Bsmt_ExposureGd                                        5.400 7.27e-08 ***
## Bsmt_ExposureMn                                       -3.124 0.001804 ** 
## Bsmt_ExposureNo                                       -3.371 0.000760 ***
## Bsmt_ExposureNo_Basement                              -0.910 0.362864    
## BsmtFin_Type_1BLQ                                     -0.588 0.556851    
## BsmtFin_Type_1GLQ                                     -0.261 0.793901    
## BsmtFin_Type_1LwQ                                     -0.753 0.451749    
## BsmtFin_Type_1No_Basement                                 NA       NA    
## BsmtFin_Type_1Rec                                     -0.632 0.527562    
## BsmtFin_Type_1Unf                                     -0.459 0.646019    
## BsmtFin_SF_1                                           0.552 0.581307    
## BsmtFin_Type_2BLQ                                     -1.703 0.088640 .  
## BsmtFin_Type_2GLQ                                      1.293 0.196168    
## BsmtFin_Type_2LwQ                                     -2.435 0.014959 *  
## BsmtFin_Type_2No_Basement                             -1.223 0.221477    
## BsmtFin_Type_2Rec                                     -2.126 0.033570 *  
## BsmtFin_Type_2Unf                                     -1.007 0.314011    
## BsmtFin_SF_2                                          -1.225 0.220638    
## Bsmt_Unf_SF                                          -10.683  < 2e-16 ***
## Total_Bsmt_SF                                         12.912  < 2e-16 ***
## HeatingGasA                                            0.484 0.628535    
## HeatingGasW                                            0.442 0.658440    
## HeatingGrav                                            0.342 0.732318    
## HeatingOthW                                           -0.491 0.623617    
## HeatingWall                                            0.835 0.404054    
## Heating_QCFair                                        -1.698 0.089694 .  
## Heating_QCGood                                        -1.360 0.174078    
## Heating_QCPoor                                        -1.023 0.306487    
## Heating_QCTypical                                     -2.540 0.011153 *  
## Central_AirY                                          -1.269 0.204691    
## ElectricalFuseF                                       -0.377 0.705917    
## ElectricalFuseP                                       -0.215 0.829748    
## ElectricalMix                                          1.065 0.286828    
## ElectricalSBrkr                                       -0.490 0.624012    
## ElectricalUnknown                                      0.261 0.794401    
## First_Flr_SF                                          15.597  < 2e-16 ***
## Second_Flr_SF                                         17.377  < 2e-16 ***
## Low_Qual_Fin_SF                                        2.857 0.004309 ** 
## Gr_Liv_Area                                               NA       NA    
## Bsmt_Full_Bath                                         1.283 0.199468    
## Bsmt_Half_Bath                                        -0.093 0.925692    
## Full_Bath                                              2.902 0.003737 ** 
## Half_Bath                                              1.941 0.052388 .  
## Bedroom_AbvGr                                         -3.292 0.001010 ** 
## Kitchen_AbvGr                                         -2.169 0.030203 *  
## Kitchen_QualFair                                      -4.155 3.36e-05 ***
## Kitchen_QualGood                                      -5.616 2.16e-08 ***
## Kitchen_QualPoor                                       0.983 0.325452    
## Kitchen_QualTypical                                   -5.646 1.82e-08 ***
## TotRms_AbvGrd                                          1.197 0.231386    
## FunctionalMaj2                                         0.012 0.990561    
## FunctionalMin1                                         2.588 0.009703 ** 
## FunctionalMin2                                         1.985 0.047226 *  
## FunctionalMod                                          1.154 0.248566    
## FunctionalSal                                         -0.461 0.644950    
## FunctionalSev                                         -0.865 0.387064    
## FunctionalTyp                                          4.134 3.67e-05 ***
## Fireplaces                                             4.963 7.38e-07 ***
## Fireplace_QuFair                                      -0.835 0.404047    
## Fireplace_QuGood                                      -0.086 0.931535    
## Fireplace_QuNo_Fireplace                               1.317 0.187833    
## Fireplace_QuPoor                                       0.044 0.964702    
## Fireplace_QuTypical                                   -0.338 0.735440    
## Garage_TypeBasment                                     0.308 0.758221    
## Garage_TypeBuiltIn                                    -0.385 0.700385    
## Garage_TypeCarPort                                    -0.735 0.462653    
## Garage_TypeDetchd                                     -0.062 0.950923    
## Garage_TypeMore_Than_Two_Types                        -2.400 0.016468 *  
## Garage_TypeNo_Garage                                   0.620 0.535284    
## Garage_FinishNo_Garage                                -0.955 0.339685    
## Garage_FinishRFn                                      -1.232 0.218061    
## Garage_FinishUnf                                       0.283 0.776873    
## Garage_Cars                                            2.631 0.008554 ** 
## Garage_Area                                            2.951 0.003190 ** 
## Garage_QualFair                                       -2.944 0.003265 ** 
## Garage_QualGood                                       -2.262 0.023769 *  
## Garage_QualNo_Garage                                      NA       NA    
## Garage_QualPoor                                       -2.658 0.007919 ** 
## Garage_QualTypical                                    -2.896 0.003811 ** 
## Garage_CondFair                                        1.996 0.046007 *  
## Garage_CondGood                                        1.717 0.086110 .  
## Garage_CondNo_Garage                                      NA       NA    
## Garage_CondPoor                                        1.826 0.067971 .  
## Garage_CondTypical                                     2.157 0.031109 *  
## Paved_DrivePartial_Pavement                           -0.606 0.544267    
## Paved_DrivePaved                                      -0.231 0.817059    
## Wood_Deck_SF                                           2.087 0.036959 *  
## Open_Porch_SF                                          0.122 0.902892    
## Enclosed_Porch                                         1.430 0.152938    
## Three_season_porch                                     1.189 0.234387    
## Screen_Porch                                           5.867 4.98e-09 ***
## Pool_Area                                             -0.329 0.742255    
## Pool_QCFair                                           -1.495 0.134919    
## Pool_QCGood                                           -1.442 0.149390    
## Pool_QCNo_Pool                                        -3.644 0.000273 ***
## Pool_QCTypical                                        -3.762 0.000172 ***
## FenceGood_Wood                                         0.723 0.469964    
## FenceMinimum_Privacy                                   1.144 0.252553    
## FenceMinimum_Wood_Wire                                -0.059 0.952938    
## FenceNo_Fence                                          0.390 0.696682    
## Misc_FeatureGar2                                      19.240  < 2e-16 ***
## Misc_FeatureNone                                      15.202  < 2e-16 ***
## Misc_FeatureOthr                                      16.452  < 2e-16 ***
## Misc_FeatureShed                                      15.585  < 2e-16 ***
## Misc_FeatureTenC                                      10.692  < 2e-16 ***
## Misc_Val                                               0.566 0.571768    
## Mo_Sold                                               -0.914 0.360556    
## Year_Sold                                             -1.722 0.085255 .  
## Sale_TypeCon                                           2.846 0.004458 ** 
## Sale_TypeConLD                                         1.201 0.229960    
## Sale_TypeConLI                                        -0.525 0.599439    
## Sale_TypeConLw                                         0.640 0.522059    
## Sale_TypeCWD                                           1.050 0.294015    
## Sale_TypeNew                                           1.274 0.202605    
## Sale_TypeOth                                           0.552 0.581301    
## Sale_TypeVWD                                          -0.285 0.775817    
## Sale_TypeWD                                            0.780 0.435701    
## Sale_ConditionAdjLand                                  2.914 0.003602 ** 
## Sale_ConditionAlloca                                   2.419 0.015614 *  
## Sale_ConditionFamily                                   0.806 0.420044    
## Sale_ConditionNormal                                   4.704 2.68e-06 ***
## Sale_ConditionPartial                                  0.909 0.363618    
## Longitude                                              0.455 0.648861    
## Latitude                                               2.299 0.021611 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20770 on 2630 degrees of freedom
## Multiple R-squared:  0.9393, Adjusted R-squared:  0.9324 
## F-statistic: 136.1 on 299 and 2630 DF,  p-value: < 2.2e-16

We have a good absolute fit in terms of R-squared (which is expected given the large number of parameters to fit the data). However, the presence of many non-significant variables strongly suggests the use of a penalizing method. Additionally, there are NA values in the estimated coefficients because standard OLS naturally drops perfectly collinear dummy variables.

We will compute the $L_1$ and $L_2$ norms for the least squares estimates so we can later compare how much Ridge and Lasso shrink these coefficients.

Note that we omit the intercept (beta.hat.ls[-1]) because shrinkage penalties are generally only applied to the feature coefficients, not the baseline intercept. We also use na.rm = TRUE to handle the NA values when computing the sums.

# Compute L1 and L2 norms for least square estimates (omit the intercept)
beta.hat.ls <- coefficients(lm.mod)[-1]

# L1 norm
ell_1.ls <- sum(abs(beta.hat.ls), na.rm = TRUE)
ell_1.ls

## [1] 11430475

# L2 norm 
ell_2.ls <- sqrt(sum(beta.hat.ls^2, na.rm = TRUE))
ell_2.ls

## [1] 2292824

2. Ridge Regression

2.1 Introducing `glmnet`

To apply Ridge Regression, we will use the glmnet package. Unlike standard lm(), glmnet() does not use formula interfaces (e.g., y ~ x). It strictly requires a design matrix X and a response vector y.

We use the model.matrix() function to automatically convert all categorical variables into dummy variables. We then drop the first column of X to remove the intercept, as glmnet handles the intercept internally.

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.6.1

## Loading required package: Matrix

## Loaded glmnet 5.0

# Design matrix without the first column (intercept)
X <- model.matrix(Sale_Price ~ ., data = ames)
X <- X[, -1]

# Vector of responses
y <- ames$Sale_Price

Shrinkage methods require a grid of $\lambda$ (lambda) values, which controls the strength of the penalty. We generate a sequence of 100 values ranging from $10^9$ down to $10^{-2}$. We use a larger upper bound here because housing prices scale into the hundreds of thousands.

grid <- 10^seq(9, -2, length = 100)
plot(1:100, grid, type = "b", ylab = "Lambda Values", pch = 20)

2.2 Fitting the Ridge Model

We are now ready to apply the glmnet() function. The behavior of the function is governed by several key arguments: 1. alpha: Controls the penalty type. alpha = 0 applies Ridge regression, while alpha = 1 (the default) applies Lasso. 2. standardize: Defaults to TRUE. It is crucial to standardize variables before applying shrinkage so that the penalty is applied fairly across variables of different scales (e.g., square feet vs. number of bedrooms). Note that, by default, the output coefficients are scaled back to their original units. This allows us, for instance, to compare the coefficient norms under OLS with those obtained by applying shrinkage methods. 3. lambda: The grid of values to test. If omitted, glmnet() will automatically choose its own sequence.

ridge.mod <- glmnet(X, y, alpha = 0, lambda = grid)

The plot() method for a glmnet object produces a coefficient profile plot, showing how the coefficients shrink as the penalty changes. The xvar argument dictates the X-axis: * "norm": Plots against the $L_1$-norm of the coefficients. * "lambda": Plots against the log of the lambda sequence. * "dev": Plots against the percent of deviance explained.

par(mfrow = c(1, 3))
plot(ridge.mod, xvar = "lambda", label = TRUE)
plot(ridge.mod, xvar = "norm", label = TRUE)
plot(ridge.mod, xvar = "dev", label = TRUE)

2.3 Inspecting Coefficients

The coef() function returns a matrix where each row is a predictor and each column corresponds to a specific $\lambda$ in our grid. We can extract parameters estimated at different $\lambda$ values and observe how the $L_2$ norm shrinks relative to our OLS baseline.

dim(coef(ridge.mod))

## [1] 307 100

# Coefficients and shrinkage for a strong penalty (lambda[30])
ridge.mod$lambda[30]

## [1] 599484.3

sqrt(sum(coef(ridge.mod)[-1, 30]^2))

## [1] 86499.82

sqrt(sum(coef(ridge.mod)[-1, 30]^2)) / ell_2.ls

## [1] 0.03772631

cat("--------- \n")

## ---------

# Coefficients and shrinkage for a weaker penalty (lambda[70])
ridge.mod$lambda[70]

## [1] 21.54435

sqrt(sum(coef(ridge.mod)[-1, 70]^2))

## [1] 2064281

sqrt(sum(coef(ridge.mod)[-1, 70]^2)) / ell_2.ls

## [1] 0.9003222

2.4 Making Predictions

The predict() function is highly versatile for glmnet objects. We can supply an arbitrary $\lambda$ value using the s argument. Even if the exact $\lambda$ wasn’t in our original grid, glmnet will smoothly interpolate the coefficients. * type = "coefficients": Returns the estimated $\beta$ values. * type = "response": Returns the predicted $\hat{y}$ values given new data (newx).

# Extract coefficients for an arbitrary lambda (s = 100,000)
predict(ridge.mod, s = 100000, type = "coefficients")[1:10, ]

##                                       (Intercept) 
##                                     -6099889.6362 
##   MS_SubClassOne_and_Half_Story_Finished_All_Ages 
##                                          418.2984 
##        MS_SubClassOne_and_Half_Story_PUD_All_Ages 
##                                       -20494.8448 
## MS_SubClassOne_and_Half_Story_Unfinished_All_Ages 
##                                         1132.5200 
##               MS_SubClassOne_Story_1945_and_Older 
##                                        -3510.7074 
##    MS_SubClassOne_Story_1946_and_Newer_All_Styles 
##                                         1186.1489 
##           MS_SubClassOne_Story_PUD_1946_and_Newer 
##                                        -2817.2026 
## MS_SubClassOne_Story_with_Finished_Attic_All_Ages 
##                                         2259.6478 
##       MS_SubClassPUD_Multilevel_Split_Level_Foyer 
##                                        -5762.1503 
##                            MS_SubClassSplit_Foyer 
##                                         -866.4811

# Generate price predictions for the first 5 houses in the dataset
y.hat <- predict(ridge.mod, s = 100000, newx = X[1:5, ], type = "response")
y.hat

##    s=1e+05
## 1 226218.1
## 2 115253.9
## 3 138587.6
## 4 241669.2
## 5 192860.0

2.5 Cross-Validation for Ridge

Instead of arbitrarily picking $\lambda$, we use cross-validation to find the value that minimizes test error. The cv.glmnet() function automates this. The nfolds argument defaults to 10-fold cross-validation. If lambda is not provided, the function intelligently builds its own sequence.

set.seed(1)
cv.out <- cv.glmnet(X, y, alpha = 0, nfolds = 10, lambda = grid)

The output contains several useful objects: * cv.out$lambda: The sequence of lambdas used. * cv.out$cvm: The mean cross-validated error for each lambda. * cv.out$cvsd: The standard error of the mean cross-validated error.

length(cv.out$lambda)

## [1] 100

cv.out$lambda[1:10]

##  [1] 1000000000  774263683  599484250  464158883  359381366  278255940
##  [7]  215443469  166810054  129154967  100000000

cv.out$cvm[1:10]

##  [1] 6371830234 6368237538 6363595405 6357609417 6349894111 6339955846
##  [7] 6327164001 6310715582 6289592404 6262510512

cv.out$cvsd[1:10]

##  [1] 362266384 362163002 362029441 361857126 361634881 361348353 360979141
##  [8] 360503704 359892009 359105881

We can plot the cross-validation curve. The plot displays the Mean Squared Error (MSE) curve (red dotted line) along with standard deviation bounds (error bars). The two vertical dotted lines represent lambda.min (minimizes MSE) and lambda.1se (the most regularized model within one standard error of the minimum).

par(mfrow = c(1, 1))
plot(cv.out)

Finally, we isolate the optimal $\lambda$ and fit our selected model, checking the final amount of shrinkage applied.

# Identify the best lambda value
i.bestlam <- which.min(cv.out$cvm)
bestlam <- cv.out$lambda[i.bestlam]

# Note: this is identical to calling cv.out$lambda.min
bestlam

## [1] 21544.35

# Fit the selected model with all the available data
beta.R <- predict(ridge.mod, s = bestlam, type = "coefficients")

# Check the final amount of shrinkage relative to OLS
sqrt(sum(beta.R[-1, 1]^2)) / ell_2.ls

## [1] 0.1325611

We have shrunk the $L_2$ norm to 13% of its original value.

3. Lasso Regression

3.1 Lasso Estimation

Next, we explore Lasso regression by setting alpha = 1. In addition to shrinking coefficients, Lasso can force coefficients exactly to zero, performing automated variable selection. We also specify thresh = 1e-10, which tightly controls the convergence threshold for the coordinate descent algorithm used by glmnet.

lasso.mod <- glmnet(X, y, alpha = 1, lambda = grid, thresh = 1e-10)

## Warning in .resolve_control(control = control, nvars = nvars, deprecated =
## list(thresh = if (!missing(thresh)) thresh, : Passing 'thresh' to glmnet() is
## deprecated. Use control = list(thresh = ...) instead.

par(mfrow = c(1, 3))
plot(lasso.mod, xvar = "lambda", label = TRUE)
plot(lasso.mod, xvar = "norm", label = TRUE)

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

plot(lasso.mod, xvar = "dev", label = TRUE)

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

Notice in the plots that many lines hit exactly zero at varying values of $\lambda$. We can verify this numerically by counting the non-zero coefficients at a specific index.

i <- 50 # Try values like 10, 30, 50, 70, 90 to see the sparsity change

beta.L <- round(coef(lasso.mod)[, i], 4)
beta.L <- beta.L[-1] # remove the intercept

# Number of non-zero coefficients
sum(beta.L != 0)

## [1] 32

# Calculate L1 norm shrinkage relative to OLS
sum(abs(beta.L)) / ell_1.ls

## [1] 0.02819341

With lambda set equal to ‘r lasso.mod$lambda[i]’, the fraction of of selected variable is of only ’sum(beta.L != 0)/ncol(X).

3.2 Cross-Validation for Lasso

We apply 10-fold cross-validation to select the optimal $\lambda$ for Lasso. Notice we do not explicitly supply the lambda grid here; it is often better to let glmnet build a custom sequence optimized for Lasso’s sparsity because it exploits warm starts. Indeed, as you may notice, glmnet is incredibly efficient and runs very quickly, even when evaluating across many lambda values and 10 folds.

set.seed(10)
cv.out.lasso <- cv.glmnet(X, y, alpha = 1, nfolds = 10)

par(mfrow = c(1, 1))
plot(cv.out.lasso)

bestlam.lasso <- cv.out.lasso$lambda.min
bestlam.lasso

## [1] 858.0304

Using the optimal $\lambda$, we extract the final model parameters and evaluate how many variables the Lasso completely removed from our massive predictor matrix.

# Refit the lasso model with automatic lambda sequence on all data
lasso.mod.auto <- glmnet(X, y, alpha = 1)
beta.L.best <- predict(lasso.mod.auto, s = bestlam.lasso, type = "coefficients")

# Number of non-zero coefficients (how many features survived?)
sum(beta.L.best[-1, 1] != 0)

## [1] 105

# Check the final amount of L1 shrinkage 
sum(abs(beta.L.best[-1, 1])) / ell_1.ls

## [1] 0.09624803

3.3 Variable selection

In the Lasso framework, it is often preferable to rely on a more parsimonious model that achieves a similar error rate (specifically, within one standard error of the minimum MSE). This greatly simplifies interpretation by isolating only the main drivers of the target variable.

Therefore, we select the lambda.1se value and print the resulting non-zero coefficients.

# Extract the lambda within 1 standard error of the minimum
bestlam.1se <- cv.out.lasso$lambda.1se

# Extract coefficients using the 1-SE lambda
beta.L.1se <- predict(cv.out.lasso, s = bestlam.1se, type = "coefficients")

# Filter and print only the non-zero coefficients for easier interpretation
non_zero_coeffs <- beta.L.1se[beta.L.1se[, 1] != 0, ]
non_zero_coeffs

##                    (Intercept)                       Lot_Area 
##                  -1.062355e+06                   7.668697e-02 
##                Land_ContourHLS           NeighborhoodCrawford 
##                   1.091858e+03                   9.640941e+03 
##         NeighborhoodNorthridge NeighborhoodNorthridge_Heights 
##                   1.566855e+04                   1.845214e+04 
##        NeighborhoodStone_Brook                Bldg_TypeOneFam 
##                   2.040019e+04                   1.053676e+04 
##          Overall_QualExcellent     Overall_QualVery_Excellent 
##                   7.046092e+04                   8.096670e+04 
##          Overall_QualVery_Good               Overall_CondFair 
##                   1.976773e+04                  -5.978598e+03 
##                     Year_Built                 Year_Remod_Add 
##                   2.922360e+02                   2.690820e+02 
##                  Roof_StyleHip               Roof_MatlWdShngl 
##                   4.096688e+02                   8.542416e+01 
##                   Mas_Vnr_Area              Exter_QualTypical 
##                   1.019847e+01                  -6.633668e+03 
##                Bsmt_ExposureGd                Bsmt_ExposureNo 
##                   1.356446e+04                  -5.510536e+02 
##              BsmtFin_Type_1GLQ                   BsmtFin_SF_1 
##                   6.679853e+03                  -2.278525e+01 
##                  Total_Bsmt_SF                    Gr_Liv_Area 
##                   1.795301e+01                   4.852856e+01 
##                 Bsmt_Full_Bath                  Kitchen_AbvGr 
##                   4.431592e+03                  -3.894106e+03 
##            Kitchen_QualTypical                  FunctionalTyp 
##                  -5.684014e+03                   3.800630e+03 
##                     Fireplaces                    Garage_Cars 
##                   8.563160e+03                   8.573784e+03 
##                    Garage_Area                   Screen_Porch 
##                   1.032046e+01                   1.350973e+01 
##                    Pool_QCGood                       Misc_Val 
##                  -1.050901e+04                  -4.583767e-01 
##                   Sale_TypeNew 
##                   5.473943e+03

length(non_zero_coeffs)

## [1] 35

With only 35 variables, we are able to achieve similar prediction performance while simultaneously creating a much simpler and more interpretable pricing algorithm.

Because the variables are on different scales (e.g., square footage vs. categorical dummy variables), and the glmnet output coefficients are scaled back to their original scale, we cannot simply look at the raw glmnet output to determine feature importance. Instead, we must calculate the standardized coefficients, which represent the change in house price for a one-standard-deviation increase in the predictor.

# Identify the names of the selected variables (excluding the intercept)
selected_vars <- names(non_zero_coeffs[-1])

# Calculate standard deviations ONLY for the selected predictors
sd_X_selected <- apply(X[, selected_vars], 2, sd)

# Calculate standardized coefficients
coef_std <- non_zero_coeffs[-1] * sd_X_selected

# Sort by absolute magnitude to find the most important drivers
sort(abs(coef_std), decreasing = TRUE)[1:5]

##                Gr_Liv_Area      Overall_QualExcellent 
##                  24531.620                  13219.108 
##                 Year_Built Overall_QualVery_Excellent 
##                   8838.782                   8285.484 
##              Total_Bsmt_SF 
##                   7916.705

Looking at the standardized coefficients, the most relevant feature driving the house price is the above-ground living area (Gr_Liv_Area). For example, a one-standard-deviation increase in Gr_Liv_Area (corresponding to an increase of 506 sq. feet) is associated with a roughly $24,532 increase in the predicted price. Other critical drivers relate strictly to the overall quality and age of the house: having an “Excellent” or “Very Excellent” overall quality rating (Overall_QualExcellent, Overall_QualVery_Excellent) substantially boosts the property’s value, as do newer construction years (Year_Built) and larger basement spaces (Total_Bsmt_SF).

4. Elastic Net Example

Elastic Net acts as a compromise between Ridge and Lasso. By setting alpha to a value between 0 and 1, we apply a mix of both $L_1$ and $L_2$ penalties.

This is particularly useful for datasets like Ames housing. When multiple features are highly correlated (e.g., Garage_Area and Garage_Cars), Lasso tends to randomly select one and drop the others. Ridge keeps them all but shrinks them together. Elastic Net will often select the correlated group and keep them together while still dropping completely unhelpful variables.

Here, we set alpha = 0.5 for an even 50/50 mix of the penalties.

set.seed(42)

# Fit cross-validated Elastic Net
cv.out.enet <- cv.glmnet(X, y, alpha = 0.5, nfolds = 10)

plot(cv.out.enet)

# Extract optimal lambda
bestlam.enet <- cv.out.enet$lambda.min

# Check sparsity of the Elastic Net
beta.EN <- predict(cv.out.enet, s = bestlam.enet, type = "coefficients")

Notice how Elastic Netdrops variables, but generally fewer than pure Lasso.

sum(beta.EN[-1, 1] != 0)

## [1] 169

5. Model Comparison

Ideally, we would need a fresh, unseen test set to properly compare the final performance of the three methods. However, we can look at the minimum and 1-standard-error cross-validation errors (MSE) of all three models to get a rough idea of their relative predictive accuracy.

# Extract the minimum and 1-SE cross-validation errors for Ridge
mse.ridge.min <- cv.out$cvm[cv.out$lambda == cv.out$lambda.min]
mse.ridge.1se <- cv.out$cvm[cv.out$lambda == cv.out$lambda.1se]

# Extract the minimum and 1-SE cross-validation errors for Lasso
mse.lasso.min <- cv.out.lasso$cvm[cv.out.lasso$lambda == cv.out.lasso$lambda.min]
mse.lasso.1se <- cv.out.lasso$cvm[cv.out.lasso$lambda == cv.out.lasso$lambda.1se]

# Extract the minimum and 1-SE cross-validation errors for Elastic Net
mse.enet.min <- cv.out.enet$cvm[cv.out.enet$lambda == cv.out.enet$lambda.min]
mse.enet.1se <- cv.out.enet$cvm[cv.out.enet$lambda == cv.out.enet$lambda.1se]

# Create a clean data frame to display the comparison
comparison_table <- data.frame(
  Model = c("Ridge", "Lasso", "Elastic Net (alpha=0.5)"),
  MSE_min = c(mse.ridge.min, mse.lasso.min, mse.enet.min),
  MSE_1se = c(mse.ridge.1se, mse.lasso.1se, mse.enet.1se)
)

# Print the comparison
comparison_table

##                     Model   MSE_min    MSE_1se
## 1                   Ridge 845213971 1020466596
## 2                   Lasso 870343220  997930426
## 3 Elastic Net (alpha=0.5) 841270631  966763405

Looking at the cross-validation errors, Elastic Net emerges as the strongest overall performer for this dataset, achieving both the lowest minimum error (MSE_min) and the lowest regularized error (MSE_1se).

While Ridge slightly outperforms Lasso in raw predictive accuracy at the minimum error threshold, it does not perform variable selection, leaving us with a highly complex and difficult-to-interpret model containing hundreds of predictors. Lasso provides a much more interpretable, parsimonious model but sacrifices a small amount of predictive accuracy compared to Ridge.

Therefore, Elastic Net may represent the optimal compromise for the Ames dataset: it leverages the grouping effect of Ridge to handle the highly correlated housing features (yielding the best overall predictive accuracy), while still applying enough Lasso penalty to drop unhelpful variables.

The `UScrime` dataset

Criminologists are interested in the effect of punishment regimes on crime rates. This has been studied using aggregate data on 47 states of the USA for 1960. The data set contains the following columns:

M percentage of males aged 14–24 in total state population
So indicator variable for a southern state
Ed mean years of schooling of the population aged 25 years or over
Po1 per capita expenditure on police protection in 1960
Po2 per capita expenditure on police protection in 1959
LF labour force participation rate of civilian urban males in the age-group 14-24
M.F number of males per 100 females
Pop state population in 1960 in hundred thousands
NW percentage of nonwhites in the population
U1 unemployment rate of urban males 14–24
U2 unemployment rate of urban males 35–39
Wealth wealth: median value of transferable assets or family income
Ineq income inequality: percentage of families earning below half the median income
Prob probability of imprisonment: ratio of number of commitments to number of offenses
Time average time in months served by offenders in state prisons before their first release
Crime crime rate: number of offenses per 100,000 population in 1960

UScrime <- read.table("UScrime.txt", head=TRUE)
str(UScrime)

## 'data.frame':    47 obs. of  16 variables:
##  $ M     : num  15.1 14.3 14.2 13.6 14.1 12.1 12.7 13.1 15.7 14 ...
##  $ So    : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ Ed    : num  9.1 11.3 8.9 12.1 12.1 11 11.1 10.9 9 11.8 ...
##  $ Po1   : num  5.8 10.3 4.5 14.9 10.9 11.8 8.2 11.5 6.5 7.1 ...
##  $ Po2   : num  5.6 9.5 4.4 14.1 10.1 11.5 7.9 10.9 6.2 6.8 ...
##  $ LF    : num  0.51 0.583 0.533 0.577 0.591 0.547 0.519 0.542 0.553 0.632 ...
##  $ M.F   : num  95 101.2 96.9 99.4 98.5 ...
##  $ Pop   : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ NW    : num  30.1 10.2 21.9 8 3 4.4 13.9 17.9 28.6 1.5 ...
##  $ U1    : num  0.108 0.096 0.094 0.102 0.091 0.084 0.097 0.079 0.081 0.1 ...
##  $ U2    : num  4.1 3.6 3.3 3.9 2 2.9 3.8 3.5 2.8 2.4 ...
##  $ Wealth: int  3940 5570 3180 6730 5780 6890 6200 4720 4210 5260 ...
##  $ Ineq  : num  26.1 19.4 25 16.7 17.4 12.6 16.8 20.6 23.9 17.4 ...
##  $ Prob  : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ Time  : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ Crime : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

1. Provide a visual representation and comment on the distribution of the response variable `Crime`. Compute and print or plot the correlation matrix for all variables. Identify any obvious structural problems among the predictors. Finally, report the marginal correlation between `Po2` and the response, and between `Ineq` and the response, and comment on the result.

hist(UScrime$Crime, main="Histogram of Crime Rate", xlab="Crime Rate", breaks=10, col="lightblue")

cor_matrix <- cor(UScrime)
corrplot(cor_matrix, method="color", type="upper", tl.cex=0.7)

cor_Po2 <- cor(UScrime$Po2, UScrime$Crime)
cor_Ineq <- cor(UScrime$Ineq, UScrime$Crime)

cat("Cor(Po2, y):", cor_Po2, "\n")

## Cor(Po2, y): 0.6667141

cat("Cor(Ineq, y):", cor_Ineq, "\n")

## Cor(Ineq, y): -0.1790237

The histogram shows that the crime rate is strictly positive, unimodal, and mildly right-skewed. Because of this skewness, the Gaussian assumption of the error terms in a standard OLS model should be carefully checked via diagnostic plots after fitting the model.

The correlation matrix reveals blocks of highly correlated predictors, indicating possible structural problems (e.g., Po1 and Po2, or U1 and U2). As expected, Po2 (police expenditure the year before) is strongly positively correlated with the response. However, rather than implying that more police causes more crime, this is likely a spurious correlation driven by confounding variables (e.g., with historically higher crime rates naturally require larger police budgets every year). Conversely, Ineq (Income inequality) exhibits a negative marginal correlation with the crime rate, suggesting that a larger percentage of poor families implies less crime. However, even in this case, this causal interpretation is counterintuitive and is likely the result of confounding bias.

In absolute terms, Po2 is more highly correlated with the response than Ineq. However, we must interpret these pairwise associations with caution, as marginal correlations ignore the presence of all other variables in the Data-Generating Process.

2. Fit a full multiple linear regression model predicting the crime rate using all available features. Looking at the summary, which variables are statistically significant at the 0.05 level? Reconcile this result with your findings from Q1 regarding `Po2` and `Ineq`. How do you explain this phenomenon statistically? Why do the standard errors of the coefficients behave this way in this specific dataset? Why does this render our hypothesis tests (t-tests) unreliable, even if the model’s overall $R^2$ remains very high?

full.mod <- lm(Crime ~ ., data = UScrime)
summary(full.mod)

## 
## Call:
## lm(formula = Crime ~ ., data = UScrime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.984e+03  1.628e+03  -3.675 0.000893 ***
## M            8.783e+01  4.171e+01   2.106 0.043443 *  
## So          -3.803e+00  1.488e+02  -0.026 0.979765    
## Ed           1.883e+02  6.209e+01   3.033 0.004861 ** 
## Po1          1.928e+02  1.061e+02   1.817 0.078892 .  
## Po2         -1.094e+02  1.175e+02  -0.931 0.358830    
## LF          -6.638e+02  1.470e+03  -0.452 0.654654    
## M.F          1.741e+01  2.035e+01   0.855 0.398995    
## Pop         -7.330e-01  1.290e+00  -0.568 0.573845    
## NW           4.204e+00  6.481e+00   0.649 0.521279    
## U1          -5.827e+03  4.210e+03  -1.384 0.176238    
## U2           1.678e+02  8.234e+01   2.038 0.050161 .  
## Wealth       9.617e-02  1.037e-01   0.928 0.360754    
## Ineq         7.067e+01  2.272e+01   3.111 0.003983 ** 
## Prob        -4.855e+03  2.272e+03  -2.137 0.040627 *  
## Time        -3.479e+00  7.165e+00  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

Despite Po2 having a very high marginal correlation with Crime in Question 1, its coefficient is entirely insignificant in the multiple regression model (p-value > 0.05). Similarly, the structural significance of Ineq changes when evaluated alongside all other features. This occurs because marginal correlation measures simple, isolated association, whereas a multiple regression coefficient measures the partial effect (the effect of a variable holding all other covariates constant). Because Po1 and Po2 contain nearly identical information, Po2 adds no unique explanatory power to the DGP once Po1 is already accounted for in the model.

The standard errors for these coefficients become large due to severe multicollinearity (variance inflation). When predictors are highly correlated, the design matrix $X^T X$ becomes nearly singular (non-invertible). This mathematically destroys the structural identifiability of the individual parameters, massively inflating the variance of the OLS estimator. The model’s overall predictive power ($R^2$) remains high because the overall regression hyperplane is well-estimated, but the specific slopes along those collinear axes cannot be precisely isolated, rendering their individual t-tests meaningless.

3. Perform a stepwise backward selection based on the Variance Inflation Factor (VIF). At each step, iteratively remove the variable with the highest VIF until fitting a model with only the intercept and two predictors. Compare all the $p-1$ models you constructed using a suitable goodness-of-fit criterion that accounts for the bias-variance trade-off. Motivate your choice of criterion.

Hint: Use the `vif()` function from the `car` package to calculate the variance inflation factors. You can initialize an R list, such as `models_list <- list()`, to store the sequence of fitted models (e.g., `models_list[[15]] <- lm(Crime ~ ., data = UScrime)`). If you are running short on time, you may stop the backward selection process early as soon as your chosen goodness-of-fit criterion starts to worsen.

library(car)

## Warning: package 'car' was built under R version 4.6.1

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.6.1

## Registered S3 method overwritten by 'car':
##   method           from
##   na.action.merMod lme4

## 
## Attaching package: 'car'

## The following object is masked from 'package:boot':
## 
##     logit

## The following objects are masked from 'package:faraway':
## 
##     logit, vif

# Ensure the full model is fit first
full.mod <- lm(Crime ~ ., data = UScrime)

# Initialize a list to store the models
# We will use the index to represent the number of predictors in the model
models_list <- list()
models_list[[15]] <- full.mod 

aic = rep(NA, 15)
aic[15] = AIC(full.mod)

current_mod <- full.mod

# Loop backwards starting from 15 predictors down to 3.
# In each iteration, we drop one variable, leaving a model with i-1 predictors.
# The final iteration (i=3) will drop a variable and leave exactly 2 predictors.
for (i in 15:3) {
  # Calculate VIFs for the current model
  vifs <- vif(current_mod)
  
  # Identify the variable with the maximum VIF
  var_to_remove <- names(which.max(vifs))
  
  # Update the model by removing that specific variable
  current_mod <- update(current_mod, paste(". ~ . -", var_to_remove))
  
  # Store the new model at the index corresponding to its number of predictors
  models_list[[i - 1]] <- current_mod
  
  # Calculate AIC for the current model
  aic[i-1] = AIC(current_mod)
}

# Extract the p-1 models (models with 2 to 15 predictors) to compare
valid_models <- models_list[2:15]

# Find the index of the model with the lowest AIC
best_num_predictors <- which.min(aic)
best_lm_model = models_list[[best_num_predictors]]

cat("Best model has", best_num_predictors, "predictors with an AIC of", min(aic, na.rm=T), "\n")

## Best model has 13 predictors with an AIC of 648.4461

Comparing the models using the training MSE or standard $R^2$ is structurally flawed, as these metrics will always favor the more complex full model even if it is overfitting the 47 observations.

Instead, we must use a penalized likelihood criterion like AIC, Adjusted $R^2$, Mallows’ $C_p$, or BIC, which explicitly accounts for the bias-variance trade-off. By adding a penalty for the number of parameters, these criteria estimate the model’s out-of-sample generalization.

Looking at the results, the model with 13 predictors has the lowest AIC, indicating that selectively dropping the two highly collinear variables removed more noise and variance than it did actual structural signal, ultimately yielding a model that better represents the underlying Data-Generating Process.

4. Look at the summary output of your selected model and comment on the p-values. From a statistical learning perspective, are these hypothesis tests reliable or overconfident? Motivate your answer based on how the model was constructed.

summary(best_lm_model)

## 
## Call:
## lm(formula = Crime ~ M + So + Ed + Po1 + LF + M.F + Pop + NW + 
##     U1 + U2 + Ineq + Prob + Time, data = UScrime)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -469.4  -93.1   12.6  117.3  506.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6041.0176  1515.7345  -3.986 0.000351 ***
## M              84.0350    40.8957   2.055 0.047879 *  
## So             35.2894   143.7092   0.246 0.807543    
## Ed            185.9198    59.8202   3.108 0.003861 ** 
## Po1           105.0940    21.7659   4.828 3.06e-05 ***
## LF           -127.9865  1392.3561  -0.092 0.927317    
## M.F            20.1254    20.1066   1.001 0.324141    
## Pop            -0.6822     1.2761  -0.535 0.596494    
## NW              1.3912     6.0482   0.230 0.819502    
## U1          -5748.4126  4146.8729  -1.386 0.174980    
## U2            180.7362    80.8400   2.236 0.032251 *  
## Ineq           60.7323    17.9172   3.390 0.001829 ** 
## Prob        -4517.0792  2160.3360  -2.091 0.044315 *  
## Time           -0.5337     6.6346  -0.080 0.936366    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.9 on 33 degrees of freedom
## Multiple R-squared:  0.7927, Adjusted R-squared:  0.711 
## F-statistic: 9.707 on 13 and 33 DF,  p-value: 7.32e-08

Looking at the summary of the reduced model, the p-values for the remaining predictors (like Po1 and U2) appear now significant.

However, from a statistical learning perspective, these hypothesis tests are overconfident. This occurs due to the “double-dipping” (or selective inference) trap. We used the data to manually search for the best model structure by repeatedly checking VIFs and evaluating the remaining variables. We are now using that exact same data to test the significance of those variables, pretending the model structure was fixed before we ever saw the data. Because the search strategy already filtered out the “weak” variables, classical p-values no longer follow the assumed theoretical distributions.

5. Generate the standard diagnostic plots for your final reduced model. Comment on the outputs, paying attention to the assumptions of the linear model. Based on these plots and the nature of the data, motivate if a different structural specification would be beneficial (you are not required to fit the new model).

par(mfrow=c(2,2))
plot(best_lm_model)

par(mfrow=c(1,1))

Looking at the four standard diagnostic plots, we can evaluate the structural assumptions of our OLS model:

Residuals vs Fitted: The plot seems to show a slight non-linear pattern in the residuals, suggesting that the assumption of a strictly linear relationship may not perfectly capture the underlying Data-Generating Process (DGP).
Normal Q-Q: According to the behavior of the residuals in the Q-Q plot, the normality assumption of the error terms seems to be reasonably satisfied.
Scale-Location: This plot is also satisfying; the spread of the residuals remains relatively constant across the fitted values, meaning the homoscedasticity (constant variance) assumption seems to be satisfied.
Residuals vs Leverage: The plot does not show the presence of extreme outliers. Furthermore, applying the rule of thumb that identifies high-leverage points as those with a leverage statistic greater than $3(p+1)/n$, we can conclude there are no concerning high-leverage points disproportionately anchoring the regression hyperplane.

Because the Gaussian error assumptions (normality and homoscedasticity) seem satisfied, standard OLS inference remains mathematically valid.

6. Given the near-collinear variables identified previously, which kind of alternative estimator could work better than OLS with manual VIF-dropping? Why? Estimate a model using this approach, relying on Leave-One-Out Cross-Validation (LOOCV) to tune the method’s hyperparameters. After choosing the optimal hyperparameter, fit the final model on the entire dataset.

Hint: You may need to define the predictors as a numeric design matrix without intercept using X <- model.matrix(Crime ~ ., data = UScrime)[,-1]

library(glmnet)
# Define matrices for glmnet
X <- model.matrix(Crime ~ ., data = UScrime)[,-1]
Y <- UScrime$Crime

# Ridge regression (alpha = 0) tuned via LOOCV (nfolds = n)
cv.ridge <- cv.glmnet(X, Y, alpha = 0, nfolds = nrow(X))

## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold

plot(cv.ridge)

best_lambda <- cv.ridge$lambda.min


# Fit final Ridge on all data
ridge.final <- glmnet(X, Y, alpha = 0)
plot(ridge.final)
abline(v = -log(best_lambda))

Rather than manually dropping collinear variables (which discards potentially useful information), Ridge Regression is the ideal alternative estimator here. Ridge mathematically stabilizes the inversion of the highly correlated $X^T X$ matrix by adding an $L2$ penalty ($\lambda I$) to the diagonal. This shrinks the coefficients of highly correlated variables (like Po1 and Po2) proportionally toward zero without forcing them out of the model, allowing us to borrow strength across all predictors safely.

We used Leave-One-Out Cross-Validation (setting the number of folds equal to the number of observations, $n=47$) to find the optimal penalty term $\lambda$ that minimizes the out-of-sample prediction error.

7. Calculate the Mean Squared Error (MSE) of the best selected linear model from ex 3 on the full training set and via LOOCV. Compare these results with the full MSE and the LOOCV MSE obtained in Question 6 and comment the results.

Hint: Use the function `cv.glm` from the package `boot` to compute the leave-one-out cv error for the linear model

library(boot)

# 1. Training MSE of Reduced Model
train_mse_reduced <- mean(best_lm_model$residuals^2)

# 2. LOOCV MSE of Reduced Model using cv.glm()
glm.reduced <- glm(formula(best_lm_model), data = UScrime)
loocv_reduced <- cv.glm(UScrime, glm.reduced)$delta[1]

# 3. Training MSE of Ridge model
# (Using the X matrix and Y vector defined in Question 6)
ridge_train_preds <- predict(ridge.final, s = best_lambda, newx = X)
train_mse_ridge <- mean((Y - ridge_train_preds)^2)

# 4. LOOCV MSE of Ridge (from cv.glmnet output)
loocv_ridge <- cv.ridge$cvm[cv.ridge$lambda == best_lambda]

cat("Linear Model Train MSE:", train_mse_reduced, "\n")

## Linear Model Train MSE: 30349.95

cat("Linear Model LOOCV MSE:", loocv_reduced, "\n")

## Linear Model LOOCV MSE: 70638.74

cat("Ridge Train MSE:", train_mse_ridge, "\n")

## Ridge Train MSE: 35570.69

cat("Ridge LOOCV MSE:", loocv_ridge, "\n")

## Ridge LOOCV MSE: 67888.93

The training MSE of the reduced linear model is artificially low because it is evaluated on the exact same 47 states used to estimate the coefficients. Furthermore, it is lower than the training MSE of the Ridge estimator because standard OLS explicitly minimizes the sum of squared residuals without constraints, making it highly flexible and prone to memorizing the training data. By contrast, Ridge regression penalizes the coefficients, intentionally sacrificing some training fit (resulting in a higher training MSE) to restrict the model’s flexibility.

However, when we evaluate both models via LOOCV, the linear model’s MSE increases significantly, revealing its true out-of-sample generalization error. The LOOCV MSE of the Ridge model is ultimately lower than the LOOCV MSE of the reduced linear model.

This perfectly demonstrates the bias-variance trade-off in action. The classical subset selection approach remains too unstable for a dataset with only 47 observations. By deliberately introducing a small amount of bias via shrinkage, Ridge regression achieves a massive reduction in the model’s variance, ultimately generalizing to unseen data much more effectively.

Statistical learning Mod B Exam preparation

Anoosha Qaisar

2026-06-17

Task 8: ANOVA and Tukey HSD

Function for the computation of performance measures

Arguments:

true.values = binary vector of true values

pred.values = binary vector of predicted values

lab.pos = label of the positive class

1. **The Hitters Data**

2. Subset Selection via Analytical Indexes

3. Forward Stepwise Selection via Held-Out Approximation

3.1 The Validation Set Approach

3.2 k-Fold Cross-Validation

4. Fast Leave-One-Out Cross-Validation (LOOCV)

4.1 Cross Validation function

4.2 The LOOCV Analytic Shortcut

Appendix: variables in Hitter dataset

The Ames Housing Dataset

1. Baseline: Least Squares (OLS)

2. Ridge Regression

2.1 Introducing `glmnet`

2.2 Fitting the Ridge Model

2.3 Inspecting Coefficients

2.4 Making Predictions

2.5 Cross-Validation for Ridge

3. Lasso Regression

3.1 Lasso Estimation

3.2 Cross-Validation for Lasso

3.3 Variable selection

4. Elastic Net Example

5. Model Comparison

The `UScrime` dataset

4. Look at the summary output of your selected model and comment on the p-values. From a statistical learning perspective, are these hypothesis tests reliable or overconfident? Motivate your answer based on how the model was constructed.

Hint: You may need to define the predictors as a numeric design matrix without intercept using X <- model.matrix(Crime ~ ., data = UScrime)[,-1]

7. Calculate the Mean Squared Error (MSE) of the best selected linear model from ex 3 on the full training set and via LOOCV. Compare these results with the full MSE and the LOOCV MSE obtained in Question 6 and comment the results.

Hint: Use the function `cv.glm` from the package `boot` to compute the leave-one-out cv error for the linear model

Statistical learning Mod B Exam preparation

Anoosha Qaisar

2026-06-17

Task 8: ANOVA and Tukey HSD

Function for the computation of performance measures

Arguments:

true.values = binary vector of true values

pred.values = binary vector of predicted values

lab.pos = label of the positive class

1. The Hitters Data

2. Subset Selection via Analytical Indexes

3. Forward Stepwise Selection via Held-Out Approximation

3.1 The Validation Set Approach

3.2 k-Fold Cross-Validation

4. Fast Leave-One-Out Cross-Validation (LOOCV)

4.1 Cross Validation function

4.2 The LOOCV Analytic Shortcut

Appendix: variables in Hitter dataset

The Ames Housing Dataset

1. Baseline: Least Squares (OLS)

2. Ridge Regression

2.1 Introducing glmnet

2.2 Fitting the Ridge Model

2.3 Inspecting Coefficients

2.4 Making Predictions

2.5 Cross-Validation for Ridge

3. Lasso Regression

3.1 Lasso Estimation

3.2 Cross-Validation for Lasso

3.3 Variable selection

4. Elastic Net Example

5. Model Comparison

The UScrime dataset

4. Look at the summary output of your selected model and comment on the p-values. From a statistical learning perspective, are these hypothesis tests reliable or overconfident? Motivate your answer based on how the model was constructed.

Hint: You may need to define the predictors as a numeric design matrix without intercept using X <- model.matrix(Crime ~ ., data = UScrime)[,-1]

7. Calculate the Mean Squared Error (MSE) of the best selected linear model from ex 3 on the full training set and via LOOCV. Compare these results with the full MSE and the LOOCV MSE obtained in Question 6 and comment the results.

Hint: Use the function cv.glm from the package boot to compute the leave-one-out cv error for the linear model

1. **The Hitters Data**

2.1 Introducing `glmnet`

The `UScrime` dataset

Hint: Use the function `cv.glm` from the package `boot` to compute the leave-one-out cv error for the linear model