###Question 1
We discussed this example in class using the following dummy variable Origin[Internal]=1, if Origin=``Internal”; =0, otherwise, and considered the following interaction model:
The definition on the alpha coefficients Origin[Internal]=1, if Origin=``Internal”; =-1, otherwise,
The relationship between the alphas and the betas
We begin by defining independent variables in our case: When we have origin [Internal] = 1: rating = 𝛽0 + 𝛽1 + 𝛽2salary + 𝛽3salary + 𝜀.
and When Origin[Internal] will imply 0: rating = 𝛽0 + 𝛽2𝑆salary + 𝜀.
Introducing the independent value of second model When we have Origin[Internal] = 1: rating = 𝛼0 + 𝛼1+ 𝛼2salary + 𝛼3salary + 𝜀.
provided that Origin[Internal] =-1: Rating = 𝛼0 - 𝛼1 + 𝛼2𝑆Salary - 𝛼3salary + 𝜀.
We set the Rating equal for both the case in which the origin is internal or external:
Origin[External]: 𝛽0 + 𝛽2salary + 𝜀 = 𝛼0 - 𝛼1 + 𝛼2salary - 𝛼3 salary + 𝜀
Origin[External]: 𝛽0 + 𝛽2salary= 𝛼0 - 𝛼1 + (𝛼2- 𝛼3)salary
We find 𝛽0= 𝛼0 - 𝛼1 , 𝛽2= 𝛼2- 𝛼3
Alternatively 𝛽3= 2𝛼3 𝛽1= 2𝛼1
###Question 2
library(haven)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
data <- read.table("C:/Users/User/Desktop/ProdTime.dat", header=TRUE,sep=",")
head(data)
dim(data)
## [1] 61 3
data$Manager <- as.factor(data$Manager)
aggregate(Time.for.Run~Manager,data,sum)
aggregate(Run.Size~Manager,data,sum)
Manager a is the best Manager in terms of bothe Time for Run and Run Size.
###Question 3
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.2.1
auto_data <- ISLR::Auto
dim(auto_data)
## [1] 392 9
head(auto_data)
#Q3.1
pairs(auto_data)
summary(auto_data)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
#Q3.2
model1 <- lm(mpg ~ year, data = auto_data)
summary(model1)
##
## Call:
## lm(formula = mpg ~ year, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0212 -5.4411 -0.4412 4.9739 18.2088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -70.01167 6.64516 -10.54 <2e-16 ***
## year 1.23004 0.08736 14.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.363 on 390 degrees of freedom
## Multiple R-squared: 0.337, Adjusted R-squared: 0.3353
## F-statistic: 198.3 on 1 and 390 DF, p-value: < 2.2e-16
Year is very significant in predicting mpg since it p_value is less than 0.05. It has a positive effect on mpg since the beta coefficient is positive.
model2 <- lm(mpg ~ year + horsepower, data = auto_data)
summary(model2)
##
## Call:
## lm(formula = mpg ~ year + horsepower, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0768 -3.0783 -0.4308 2.5884 15.3153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.739166 5.349027 -2.382 0.0177 *
## year 0.657268 0.066262 9.919 <2e-16 ***
## horsepower -0.131654 0.006341 -20.761 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.388 on 389 degrees of freedom
## Multiple R-squared: 0.6855, Adjusted R-squared: 0.6839
## F-statistic: 423.9 on 2 and 389 DF, p-value: < 2.2e-16
plot(model2)
Year is still very significant in predicting mpg and with a positive relationship with mpg.
The Residuals are concentrated at 0 and widely stretched as the fitted value increase.
#95% CI’s The CI is now narrower in ii compared to i because by adding horsepower it defines the variance more making the confidence interval narrow when the dependent variable mpg can be explained more.
model3 <- lm(mpg ~ year*horsepower, data = auto_data)
summary(model3)
##
## Call:
## lm(formula = mpg ~ year * horsepower, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3492 -2.4509 -0.4557 2.4056 14.4437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.266e+02 1.212e+01 -10.449 <2e-16 ***
## year 2.192e+00 1.613e-01 13.585 <2e-16 ***
## horsepower 1.046e+00 1.154e-01 9.063 <2e-16 ***
## year:horsepower -1.596e-02 1.562e-03 -10.217 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.901 on 388 degrees of freedom
## Multiple R-squared: 0.7522, Adjusted R-squared: 0.7503
## F-statistic: 392.5 on 3 and 388 DF, p-value: < 2.2e-16
The year effect is very significant on the model having year and horsepower as interaction also as an interaction factor with horsepower it is as well very significant. The interaction effect is also very significant since its p_value is less than 0.05.
#Q3.3
model4 <- lm(mpg ~ horsepower + cylinders, data=auto_data)
summary(model4)
##
## Call:
## lm(formula = mpg ~ horsepower + cylinders, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4378 -3.2422 -0.3721 2.3532 16.9289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.94842 0.77880 55.147 < 2e-16 ***
## horsepower -0.08612 0.01119 -7.693 1.19e-13 ***
## cylinders -1.91982 0.25261 -7.600 2.24e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.584 on 389 degrees of freedom
## Multiple R-squared: 0.6569, Adjusted R-squared: 0.6551
## F-statistic: 372.4 on 2 and 389 DF, p-value: < 2.2e-16
At a significance level of 0.01 cylinders is a strongly significant variable in predicting mpg.
levels(as.factor(auto_data$cylinders))
## [1] "3" "4" "5" "6" "8"
model5 <- lm(mpg ~ horsepower + as.factor(cylinders), data=auto_data)
summary(model5)
##
## Call:
## lm(formula = mpg ~ horsepower + as.factor(cylinders), data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5917 -2.7067 -0.6102 1.9001 16.3258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.77614 2.41283 12.755 < 2e-16 ***
## horsepower -0.10303 0.01133 -9.095 < 2e-16 ***
## as.factor(cylinders)4 6.57344 2.16921 3.030 0.00261 **
## as.factor(cylinders)5 5.07367 3.26661 1.553 0.12120
## as.factor(cylinders)6 -0.34406 2.18580 -0.157 0.87501
## as.factor(cylinders)8 0.49738 2.27639 0.218 0.82716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.27 on 386 degrees of freedom
## Multiple R-squared: 0.7046, Adjusted R-squared: 0.7008
## F-statistic: 184.1 on 5 and 386 DF, p-value: < 2.2e-16
Cylinders category 4 is very significant because its p_value is less than 0.01 it is very significant in predicting mpg.
anova(model4, model5)
Model with cylinders as factor is performing better the model with it as numeric continuous variable. This conclusion is from the significant p_value in model 2 and the reduced Residual Sum of Squares.
###Question 4
crime <- read.csv("C:/Users/User/Desktop/CrimeData_sub.csv", stringsAsFactors = F, na.strings = c("?"))
crime <- na.omit(crime)
head(crime)
summary(crime)
## state fold population household.size
## Length:368 Min. : 1.000 Min. : 9.211 Min. :1.60
## Class :character 1st Qu.: 3.000 1st Qu.:10.028 1st Qu.:2.45
## Mode :character Median : 6.000 Median :10.536 Median :2.75
## Mean : 5.576 Mean :10.606 Mean :2.80
## 3rd Qu.: 8.000 3rd Qu.:11.107 3rd Qu.:3.06
## Max. :10.000 Max. :15.064 Max. :5.28
## race.pctblack race.pctwhite race.pctasian race.pcthisp
## Min. : 0.110 Min. :10.58 Min. : 0.210 Min. : 0.680
## 1st Qu.: 1.278 1st Qu.:65.58 1st Qu.: 1.740 1st Qu.: 6.843
## Median : 3.345 Median :77.83 Median : 4.470 Median :13.940
## Mean : 7.536 Mean :74.65 Mean : 7.451 Mean :20.712
## 3rd Qu.: 9.340 3rd Qu.:88.51 3rd Qu.: 9.633 3rd Qu.:26.183
## Max. :54.830 Max. :98.91 Max. :57.460 Max. :93.110
## age.pct12to21 age.pct12to29 age.pct16to24 age.pct65up
## Min. : 4.58 Min. : 9.38 Min. : 4.64 Min. : 3.150
## 1st Qu.:11.34 1st Qu.:23.99 1st Qu.:10.84 1st Qu.: 7.503
## Median :13.51 Median :27.43 Median :12.60 Median :10.165
## Mean :13.38 Mean :27.06 Mean :13.10 Mean :12.093
## 3rd Qu.:15.10 3rd Qu.:29.98 3rd Qu.:14.56 3rd Qu.:14.053
## Max. :25.63 Max. :48.68 Max. :34.93 Max. :48.350
## num.urban pct.urban med.income pct.wage.inc
## Min. : 0.000 Min. : 0.0 Min. : 14579 Min. :39.25
## 1st Qu.: 9.713 1st Qu.:100.0 1st Qu.: 27043 1st Qu.:73.47
## Median :10.507 Median :100.0 Median : 33868 Median :79.16
## Mean : 9.237 Mean : 86.1 Mean : 36926 Mean :77.64
## 3rd Qu.:11.107 3rd Qu.:100.0 3rd Qu.: 42989 3rd Qu.:84.00
## Max. :15.064 Max. :100.0 Max. :123625 Max. :94.82
## pct.farmself.inc pct.inv.inc pct.socsec.inc pct.pubasst.inc
## Min. :0.0000 Min. : 9.02 Min. : 7.45 Min. : 1.100
## 1st Qu.:0.4800 1st Qu.:32.26 1st Qu.:18.47 1st Qu.: 4.037
## Median :0.6600 Median :41.02 Median :22.54 Median : 6.415
## Mean :0.7777 Mean :41.39 Mean :25.21 Mean : 8.034
## 3rd Qu.:0.9900 3rd Qu.:50.09 3rd Qu.:28.89 3rd Qu.:11.283
## Max. :3.0500 Max. :79.43 Max. :68.37 Max. :23.930
## pct.retire med.family.inc percap.inc white.percap
## Min. : 6.34 Min. : 18522 Min. : 5935 Min. : 6065
## 1st Qu.:12.38 1st Qu.: 31156 1st Qu.:12070 1st Qu.:13395
## Median :14.79 Median : 38592 Median :15000 Median :16458
## Mean :15.93 Mean : 42124 Mean :17059 Mean :18537
## 3rd Qu.:18.42 3rd Qu.: 49902 3rd Qu.:19404 3rd Qu.:21180
## Max. :40.08 Max. :131315 Max. :63302 Max. :68850
## black.percap indian.percap asian.percap other.percap
## Min. : 0 Min. : 0.000 Min. : 0.000 Min. : 0
## 1st Qu.: 7933 1st Qu.: 9.151 1st Qu.: 9.249 1st Qu.: 7244
## Median :11254 Median : 9.434 Median : 9.516 Median : 8912
## Mean :12535 Mean : 9.236 Mean : 9.460 Mean : 9976
## 3rd Qu.:15266 3rd Qu.: 9.733 3rd Qu.: 9.740 3rd Qu.:11854
## Max. :60000 Max. :11.929 Max. :11.553 Max. :35293
## hisp.percap num.underpov pct.pop.underpov pct.less9thgrade
## Min. : 3991 Min. : 5.257 Min. : 1.230 Min. : 0.620
## 1st Qu.: 8112 1st Qu.: 7.429 1st Qu.: 5.790 1st Qu.: 4.468
## Median : 9971 Median : 8.201 Median : 9.605 Median : 7.405
## Mean :11405 Mean : 8.203 Mean :11.201 Mean : 9.923
## 3rd Qu.:13190 3rd Qu.: 8.885 3rd Qu.:15.345 3rd Qu.:11.485
## Max. :51320 Max. :13.375 Max. :36.510 Max. :49.890
## pct.not.hsgrad pct.bs.ormore pct.unemployed pct.employed
## Min. : 2.95 Min. : 1.63 Min. : 1.610 Min. :26.30
## 1st Qu.:13.36 1st Qu.:13.41 1st Qu.: 4.128 1st Qu.:55.87
## Median :20.84 Median :19.34 Median : 5.520 Median :62.12
## Mean :22.95 Mean :23.03 Mean : 6.197 Mean :61.44
## 3rd Qu.:29.04 3rd Qu.:30.82 3rd Qu.: 7.577 3rd Qu.:67.67
## Max. :73.66 Max. :71.20 Max. :23.830 Max. :83.81
## pct.employed.manuf pct.employed.profserv pct.occup.manuf pct.occup.mgmtprof
## Min. : 2.910 Min. : 8.69 Min. : 1.37 Min. : 6.48
## 1st Qu.: 9.238 1st Qu.:18.43 1st Qu.: 7.78 1st Qu.:21.17
## Median :12.975 Median :21.52 Median :11.20 Median :26.14
## Mean :15.135 Mean :22.45 Mean :12.24 Mean :28.42
## 3rd Qu.:19.332 3rd Qu.:24.60 3rd Qu.:15.07 3rd Qu.:34.55
## Max. :41.690 Max. :58.64 Max. :44.27 Max. :60.55
## male.pct.divorce male.pct.nvrmarried female.pct.divorce total.pct.divorce
## Min. : 2.350 Min. :12.06 Min. : 5.05 Min. : 4.20
## 1st Qu.: 8.422 1st Qu.:26.57 1st Qu.:12.40 1st Qu.:10.54
## Median : 9.905 Median :31.00 Median :14.05 Median :11.99
## Mean :10.079 Mean :31.55 Mean :14.05 Mean :12.11
## 3rd Qu.:11.693 3rd Qu.:35.44 3rd Qu.:15.68 3rd Qu.:13.65
## Max. :17.010 Max. :61.96 Max. :20.15 Max. :18.04
## ave.people.per.fam pct.fam2parents pct.kids2parents pct.youngkids2parents
## Min. :2.290 Min. :44.28 Min. :35.52 Min. : 46.97
## 1st Qu.:2.928 1st Qu.:67.71 1st Qu.:62.82 1st Qu.: 75.87
## Median :3.160 Median :73.42 Median :69.07 Median : 82.78
## Mean :3.201 Mean :72.62 Mean :68.67 Mean : 81.39
## 3rd Qu.:3.400 3rd Qu.:78.34 3rd Qu.:75.46 3rd Qu.: 89.12
## Max. :4.640 Max. :91.57 Max. :90.18 Max. :100.00
## pct.teens2parents pct.workmom.youngkids pct.workmom num.kids.nvrmarried
## Min. :36.91 Min. :38.27 Min. :45.04 Min. : 0.000
## 1st Qu.:70.06 1st Qu.:53.97 1st Qu.:62.41 1st Qu.: 5.624
## Median :75.11 Median :59.13 Median :67.40 Median : 6.661
## Mean :74.10 Mean :58.55 Mean :66.40 Mean : 6.609
## 3rd Qu.:79.98 3rd Qu.:63.42 3rd Qu.:70.91 3rd Qu.: 7.464
## Max. :97.32 Max. :77.21 Max. :83.96 Max. :12.265
## pct.kids.nvrmarried num.immig pct.immig.recent pct.immig.recent5
## Min. : 0.000 Min. : 4.605 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.510 1st Qu.: 7.862 1st Qu.:11.46 1st Qu.:18.20
## Median : 2.780 Median : 8.622 Median :14.70 Median :24.09
## Mean : 3.527 Mean : 8.650 Mean :14.77 Mean :23.32
## 3rd Qu.: 4.902 3rd Qu.: 9.444 3rd Qu.:18.07 3rd Qu.:28.79
## Max. :14.150 Max. :14.106 Max. :31.54 Max. :45.61
## pct.immig.recent8 pct.immig.recent10 pct.pop.immig pct.pop.immig5
## Min. : 0.00 Min. : 4.50 Min. : 0.000 Min. : 0.000
## 1st Qu.:25.64 1st Qu.:33.85 1st Qu.: 1.090 1st Qu.: 1.657
## Median :33.05 Median :42.92 Median : 2.150 Median : 3.365
## Mean :32.04 Mean :41.36 Mean : 2.836 Mean : 4.480
## 3rd Qu.:39.35 3rd Qu.:50.76 3rd Qu.: 4.020 3rd Qu.: 6.343
## Max. :55.34 Max. :68.21 Max. :13.710 Max. :19.930
## pct.pop.immig8 pct.pop.immig10 pct.english.only pct.no.english.well
## Min. : 0.000 Min. : 0.200 Min. :10.44 Min. : 0.140
## 1st Qu.: 2.328 1st Qu.: 2.950 1st Qu.:67.20 1st Qu.: 1.805
## Median : 4.800 Median : 6.205 Median :78.77 Median : 3.775
## Mean : 6.162 Mean : 7.990 Mean :74.05 Mean : 6.243
## 3rd Qu.: 8.848 3rd Qu.:11.227 3rd Qu.:86.95 3rd Qu.: 7.795
## Max. :25.340 Max. :32.630 Max. :96.89 Max. :38.330
## pct.fam.hh.large pct.occup.hh.large ave.people.per.hh
## Min. : 0.960 Min. : 0.480 Min. :1.580
## 1st Qu.: 3.902 1st Qu.: 2.647 1st Qu.:2.360
## Median : 6.065 Median : 4.315 Median :2.680
## Mean : 7.999 Mean : 5.966 Mean :2.729
## 3rd Qu.:10.107 3rd Qu.: 7.370 3rd Qu.:3.002
## Max. :34.870 Max. :30.870 Max. :4.520
## ave.people.per.ownoccup.hh ave.people.per.rented.hh pct.people.ownoccup.hh
## Min. :1.610 Min. :1.580 Min. :13.93
## 1st Qu.:2.470 1st Qu.:2.280 1st Qu.:51.03
## Median :2.750 Median :2.595 Median :60.05
## Mean :2.763 Mean :2.670 Mean :60.61
## 3rd Qu.:2.980 3rd Qu.:3.000 3rd Qu.:69.48
## Max. :4.480 Max. :4.730 Max. :96.14
## pct.people.dense.hh pct.hh.less3br med.num.br num.vacant.house
## Min. : 0.330 Min. : 3.06 Min. :1.000 Min. : 4.407
## 1st Qu.: 3.715 1st Qu.:43.45 1st Qu.:2.000 1st Qu.: 6.082
## Median : 6.645 Median :53.84 Median :2.000 Median : 6.753
## Mean : 9.939 Mean :53.44 Mean :2.399 Mean : 6.874
## 3rd Qu.:11.932 3rd Qu.:64.40 3rd Qu.:3.000 3rd Qu.: 7.614
## Max. :59.490 Max. :95.34 Max. :4.000 Max. :11.321
## pct.house.occup pct.house.ownoccup pct.house.vacant pct.house.vacant.6moplus
## Min. :58.06 Min. :16.86 Min. : 0.0000 Min. : 3.12
## 1st Qu.:90.64 1st Qu.:50.31 1st Qu.: 0.5975 1st Qu.:16.05
## Median :94.84 Median :60.02 Median : 1.3050 Median :21.05
## Mean :92.31 Mean :59.74 Mean : 1.9371 Mean :22.43
## 3rd Qu.:96.24 3rd Qu.:68.23 3rd Qu.: 2.4350 3rd Qu.:27.96
## Max. :98.09 Max. :96.22 Max. :13.1000 Max. :57.40
## med.yr.house.built pct.house.nophone pct.house.no.plumb
## Min. :1939 Min. : 0.0000 Min. :0.0000
## 1st Qu.:1962 1st Qu.: 0.9725 1st Qu.:0.1700
## Median :1970 Median : 2.4600 Median :0.3100
## Mean :1968 Mean : 3.3408 Mean :0.4152
## 3rd Qu.:1974 3rd Qu.: 4.5500 3rd Qu.:0.5100
## Max. :1987 Max. :17.8700 Max. :2.5600
## value.ownoccup.house.lowquart value.ownoccup.med value.ownoccup.highquart
## Min. : 26100 Min. : 39800 Min. : 56200
## 1st Qu.: 71775 1st Qu.: 91475 1st Qu.:121425
## Median :128600 Median :163850 Median :204450
## Mean :148651 Mean :187367 Mean :234209
## 3rd Qu.:193650 3rd Qu.:251850 3rd Qu.:321700
## Max. :500001 Max. :500001 Max. :500001
## rent.lowquart rent.med rent.highquart med.rent
## Min. : 99.0 Min. : 180.0 Min. : 286.0 Min. : 234.0
## 1st Qu.: 346.8 1st Qu.: 430.5 1st Qu.: 517.0 1st Qu.: 506.8
## Median : 434.5 Median : 552.0 Median : 671.0 Median : 612.0
## Mean : 453.0 Mean : 562.5 Mean : 680.6 Mean : 625.7
## 3rd Qu.: 549.5 3rd Qu.: 670.0 3rd Qu.: 839.0 3rd Qu.: 727.2
## Max. :1001.0 Max. :1001.0 Max. :1001.0 Max. :1001.0
## med.rent.aspct.hhinc med.owncost.aspct.hhinc.wmort
## Min. :14.90 Min. :16.00
## 1st Qu.:27.20 1st Qu.:22.80
## Median :28.50 Median :24.10
## Mean :28.59 Mean :24.26
## 3rd Qu.:30.00 3rd Qu.:25.70
## Max. :35.10 Max. :32.70
## med.owncost.as.pct.hhinc.womort num.in.shelters num.homeless
## Min. :10.1 Min. :0.000 Min. :0.0000
## 1st Qu.:11.4 1st Qu.:0.000 1st Qu.:0.0000
## Median :11.8 Median :1.040 Median :0.3466
## Mean :11.9 Mean :2.004 Mean :1.4205
## 3rd Qu.:12.4 3rd Qu.:3.989 3rd Qu.:2.7081
## Max. :14.6 Max. :8.433 Max. :8.0424
## pct.foreignborn pct.born.samestate pct.samehouse1985 pct.samecity1985
## Min. : 0.970 Min. : 6.75 Min. :11.83 Min. :27.95
## 1st Qu.: 9.062 1st Qu.:36.33 1st Qu.:39.40 1st Qu.:68.15
## Median :14.665 Median :44.98 Median :44.70 Median :74.38
## Mean :17.599 Mean :43.03 Mean :44.36 Mean :73.49
## 3rd Qu.:23.317 3rd Qu.:52.31 3rd Qu.:49.27 3rd Qu.:79.96
## Max. :59.720 Max. :67.62 Max. :68.37 Max. :91.97
## pct.samestate1985 land.area pop.density pct.use.publictransit
## Min. :53.27 Min. :0.7419 Min. : 16.3 Min. :0.0000
## 1st Qu.:82.53 1st Qu.:1.9981 1st Qu.: 2248.2 1st Qu.:0.4700
## Median :87.35 Median :2.4891 Median : 3308.7 Median :0.9419
## Mean :85.68 Mean :2.6057 Mean : 4261.6 Mean :1.0485
## 3rd Qu.:90.39 3rd Qu.:3.1068 3rd Qu.: 5435.2 3rd Qu.:1.5173
## Max. :94.76 Max. :8.1805 Max. :23054.6 Max. :3.5802
## pct.police.drugunits violentcrimes.perpop
## Min. : 0.00 Min. : 25.7
## 1st Qu.: 0.00 1st Qu.: 406.8
## Median : 0.00 Median : 688.6
## Mean : 1.40 Mean : 895.7
## 3rd Qu.: 0.00 3rd Qu.:1219.1
## Max. :18.02 Max. :3829.2
dim(crime)
## [1] 368 103
head(crime$violentcrimes.perpop)
## [1] 2605.96 374.07 644.75 560.71 894.51 346.97
#names(crime)
#Q4.1
R2 & RMSE on test set
library(Metrics)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:Metrics':
##
## accuracy
train <- crime[1:294,]
test <- crime[295:368,]
ols <- lm(violentcrimes.perpop ~ ., data=train)
test$predicted <- predict(ols,test[,-103])
rmse(test$violentcrimes.perpop, test$predicted)
## [1] 514.4767
rss <- sum((test$predicted - test$violentcrimes.perpop) ^ 2)
tss <- sum((test$violentcrimes.perpop - mean(test$violentcrimes.perpop)) ^ 2)
rsq <- 1 - rss/tss
rsq
## [1] 0.3723997
R2 & RMSE on train set
train$predicted <- predict(ols,train[,-103])
rmse(train$violentcrimes.perpop, train$predicted)
## [1] 260.5575
rss <- sum((train$predicted - train$violentcrimes.perpop) ^ 2)
tss <- sum((train$violentcrimes.perpop - mean(train$violentcrimes.perpop)) ^ 2)
rsq <- 1 - rss/tss
rsq
## [1] 0.8477243
R2 and RMSE is better in the train set compared for the test set this is because the test set was not used in training the ols model. It is a fresh dataset subjected to the already trained model on train set.
#Q4.2
df = crime
set.seed(123)
x = model.matrix(violentcrimes.perpop~., df)[,-1] # drop medv from the first column
y = as.matrix(df$violentcrimes.perpop)
index = sample(nrow(df),nrow(df)*0.80) #Split
df.X.std = scale(df[,c(-1,-103)]) #Standardize
X.train= as.matrix(df.X.std)[index,]
X.test= as.matrix(df.X.std)[-index,]
Y.train= df[index, "violentcrimes.perpop"]
Y.test<- df[-index, "violentcrimes.perpop"]
Y.train = as.matrix(Y.train)
Y.test = as.matrix(Y.test)
lasso_cv = cv.glmnet(x=X.train, y=Y.train, alpha = 1, nfolds = 5)
plot(lasso_cv)
lasso_cv$lambda.min
## [1] 18.66273
coef(lasso_cv, s=lasso_cv$lambda.min)
## 102 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 901.3983068
## fold .
## population .
## household.size .
## race.pctblack 163.9135143
## race.pctwhite .
## race.pctasian .
## race.pcthisp .
## age.pct12to21 .
## age.pct12to29 .
## age.pct16to24 .
## age.pct65up .
## num.urban .
## pct.urban .
## med.income .
## pct.wage.inc .
## pct.farmself.inc -2.5898177
## pct.inv.inc -19.8082950
## pct.socsec.inc .
## pct.pubasst.inc 8.6607916
## pct.retire .
## med.family.inc .
## percap.inc .
## white.percap .
## black.percap .
## indian.percap -3.8635397
## asian.percap .
## other.percap .
## hisp.percap .
## num.underpov .
## pct.pop.underpov .
## pct.less9thgrade .
## pct.not.hsgrad .
## pct.bs.ormore .
## pct.unemployed .
## pct.employed .
## pct.employed.manuf .
## pct.employed.profserv .
## pct.occup.manuf .
## pct.occup.mgmtprof .
## male.pct.divorce 63.7458670
## male.pct.nvrmarried .
## female.pct.divorce .
## total.pct.divorce .
## ave.people.per.fam .
## pct.fam2parents .
## pct.kids2parents -90.6751817
## pct.youngkids2parents -22.1630824
## pct.teens2parents .
## pct.workmom.youngkids .
## pct.workmom -30.5954777
## num.kids.nvrmarried .
## pct.kids.nvrmarried 172.2869146
## num.immig .
## pct.immig.recent .
## pct.immig.recent5 -2.3122375
## pct.immig.recent8 .
## pct.immig.recent10 .
## pct.pop.immig .
## pct.pop.immig5 .
## pct.pop.immig8 .
## pct.pop.immig10 .
## pct.english.only -74.3025386
## pct.no.english.well .
## pct.fam.hh.large .
## pct.occup.hh.large .
## ave.people.per.hh .
## ave.people.per.ownoccup.hh .
## ave.people.per.rented.hh .
## pct.people.ownoccup.hh .
## pct.people.dense.hh .
## pct.hh.less3br .
## med.num.br .
## num.vacant.house 30.9022950
## pct.house.occup -20.8185101
## pct.house.ownoccup .
## pct.house.vacant 12.3694990
## pct.house.vacant.6moplus 21.7570921
## med.yr.house.built -40.5500195
## pct.house.nophone 12.9380079
## pct.house.no.plumb .
## value.ownoccup.house.lowquart .
## value.ownoccup.med .
## value.ownoccup.highquart .
## rent.lowquart .
## rent.med .
## rent.highquart .
## med.rent .
## med.rent.aspct.hhinc 0.9966696
## med.owncost.aspct.hhinc.wmort .
## med.owncost.as.pct.hhinc.womort .
## num.in.shelters .
## num.homeless 50.2948217
## pct.foreignborn .
## pct.born.samestate .
## pct.samehouse1985 8.9388874
## pct.samecity1985 .
## pct.samestate1985 .
## land.area .
## pop.density .
## pct.use.publictransit .
## pct.police.drugunits .
lasso_cv$lambda.1se
## [1] 119.9654
coef(lasso_cv, s=lasso_cv$lambda.1se)
## 102 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 902.596062
## fold .
## population .
## household.size .
## race.pctblack 68.472269
## race.pctwhite .
## race.pctasian .
## race.pcthisp .
## age.pct12to21 .
## age.pct12to29 .
## age.pct16to24 .
## age.pct65up .
## num.urban .
## pct.urban .
## med.income .
## pct.wage.inc .
## pct.farmself.inc .
## pct.inv.inc .
## pct.socsec.inc .
## pct.pubasst.inc .
## pct.retire .
## med.family.inc .
## percap.inc .
## white.percap .
## black.percap .
## indian.percap .
## asian.percap .
## other.percap .
## hisp.percap .
## num.underpov .
## pct.pop.underpov .
## pct.less9thgrade .
## pct.not.hsgrad .
## pct.bs.ormore .
## pct.unemployed .
## pct.employed .
## pct.employed.manuf .
## pct.employed.profserv .
## pct.occup.manuf .
## pct.occup.mgmtprof .
## male.pct.divorce .
## male.pct.nvrmarried .
## female.pct.divorce .
## total.pct.divorce .
## ave.people.per.fam .
## pct.fam2parents .
## pct.kids2parents -160.864450
## pct.youngkids2parents .
## pct.teens2parents .
## pct.workmom.youngkids .
## pct.workmom .
## num.kids.nvrmarried .
## pct.kids.nvrmarried 227.180944
## num.immig .
## pct.immig.recent .
## pct.immig.recent5 .
## pct.immig.recent8 .
## pct.immig.recent10 .
## pct.pop.immig .
## pct.pop.immig5 .
## pct.pop.immig8 .
## pct.pop.immig10 .
## pct.english.only .
## pct.no.english.well .
## pct.fam.hh.large .
## pct.occup.hh.large .
## ave.people.per.hh .
## ave.people.per.ownoccup.hh .
## ave.people.per.rented.hh .
## pct.people.ownoccup.hh .
## pct.people.dense.hh .
## pct.hh.less3br .
## med.num.br .
## num.vacant.house .
## pct.house.occup .
## pct.house.ownoccup .
## pct.house.vacant .
## pct.house.vacant.6moplus .
## med.yr.house.built .
## pct.house.nophone .
## pct.house.no.plumb .
## value.ownoccup.house.lowquart .
## value.ownoccup.med .
## value.ownoccup.highquart .
## rent.lowquart .
## rent.med .
## rent.highquart .
## med.rent .
## med.rent.aspct.hhinc .
## med.owncost.aspct.hhinc.wmort .
## med.owncost.as.pct.hhinc.womort .
## num.in.shelters .
## num.homeless 8.501961
## pct.foreignborn .
## pct.born.samestate .
## pct.samehouse1985 .
## pct.samecity1985 .
## pct.samestate1985 .
## land.area .
## pop.density .
## pct.use.publictransit .
## pct.police.drugunits .
cv_lasso_predict_train = predict(lasso_cv, newx = X.train, s=lasso_cv$lambda.min) #Train Prediction
cv_lasso_predict_test= predict(lasso_cv, newx = X.test, s=lasso_cv$lambda.min) #Test Prediction
cv_lasso_mse = mean((Y.train-cv_lasso_predict_train)^2) #Mean squared error
cv_lasso_mpse = mean((Y.test-cv_lasso_predict_test)^2) #Mean squared prediction error
cv_sst = sum((Y.train - mean(Y.train))^2)
cv_sse = sum((Y.train-cv_lasso_predict_train)^2) #Sum squared error
cv_rsq = 1 - cv_sse / cv_sst #R_squared
cv_sst1 = sum((Y.test - mean(Y.test))^2)
cv_sse1 = sum((Y.test-cv_lasso_predict_test)^2) #Sum squared error
cv_rsq1 = 1 - cv_sse1 / cv_sst1
cv_adj_rsq = 1 - (dim(X.train)[1]-1)*(1-cv_rsq)/(dim(X.train)[1]-10-1)
cv_summary_stats_min = c("CV min model", cv_lasso_mse, cv_rsq,cv_rsq1, cv_adj_rsq,
cv_lasso_mpse)
comparison_table = c("model type", "MSE", "R-Squared_train","R-Squared_test", "Adj R-Squared", "Test MSPE")
data.frame(cbind(comparison_table, cv_summary_stats_min))
#Refitting OLS
dataf = subset(crime, select=c(household.size, race.pctblack, pct.pubasst.inc, indian.percap, hisp.percap, pct.less9thgrade, pct.not.hsgrad, pct.english.only, pct.youngkids2parents, med.rent, med.owncost.as.pct.hhinc.womort, num.homeless, land.area, violentcrimes.perpop))
R2 & RMSE on test set
train <- dataf[1:294,]
test <- dataf[295:368,]
ols <- lm(violentcrimes.perpop ~ ., data=train)
test$predicted <- predict(ols,test[,-103])
rmse(test$violentcrimes.perpop, test$predicted)
## [1] 374.5579
rss <- sum((test$predicted - test$violentcrimes.perpop) ^ 2)
tss <- sum((test$violentcrimes.perpop - mean(test$violentcrimes.perpop)) ^ 2)
rsq <- 1 - rss/tss
rsq
## [1] 0.6673485
R2 & RMSE on train set
train$predicted <- predict(ols,train[,-103])
rmse(train$violentcrimes.perpop, train$predicted)
## [1] 359.8285
rss <- sum((train$predicted - train$violentcrimes.perpop) ^ 2)
tss <- sum((train$violentcrimes.perpop - mean(train$violentcrimes.perpop)) ^ 2)
rsq <- 1 - rss/tss
rsq
## [1] 0.7095878
Comparing the latest fitted model on only significant variable we can say that the lasso is performing better in terms of R2 on both train and test set.
My final model is lasso_cv and the selection method is R2 diagnostic approach to check on the one performing better.
#Ridge
#perform k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(x=X.train, y=Y.train, alpha = 1, nfolds = 5)
#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda
## [1] 24.67104
#produce plot of test MSE by lambda value
plot(cv_model)
#find coefficients of best model
best_model <- glmnet(x, y, alpha = 0, lambda = best_lambda)
coef(best_model)
## 103 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.686561e+04
## stateFL 1.932883e+02
## fold -1.970589e+00
## population -5.666766e+01
## household.size -1.808574e+02
## race.pctblack 2.019507e+01
## race.pctwhite 7.502493e-01
## race.pctasian 1.711680e+00
## race.pcthisp 1.103466e+00
## age.pct12to21 2.838102e+01
## age.pct12to29 -1.337490e+01
## age.pct16to24 -5.070627e+00
## age.pct65up 2.617624e+00
## num.urban -1.104918e+00
## pct.urban 2.645805e-01
## med.income 5.995235e-04
## pct.wage.inc -6.249200e-01
## pct.farmself.inc -2.065658e+01
## pct.inv.inc -8.259336e+00
## pct.socsec.inc 5.055720e-01
## pct.pubasst.inc 1.606025e+01
## pct.retire -4.353621e+00
## med.family.inc -3.431551e-04
## percap.inc -8.980663e-04
## white.percap 4.229752e-03
## black.percap 7.488192e-04
## indian.percap -1.700544e+01
## asian.percap 3.175131e+01
## other.percap 1.843177e-03
## hisp.percap -2.496081e-03
## num.underpov -1.066530e+01
## pct.pop.underpov -3.019594e-01
## pct.less9thgrade -6.669992e+00
## pct.not.hsgrad 5.213730e+00
## pct.bs.ormore -2.082677e+00
## pct.unemployed -1.676191e+01
## pct.employed 9.291583e+00
## pct.employed.manuf 1.789508e+00
## pct.employed.profserv -1.683663e+00
## pct.occup.manuf 7.823162e+00
## pct.occup.mgmtprof 4.200800e+00
## male.pct.divorce 1.940005e+01
## male.pct.nvrmarried 9.625540e+00
## female.pct.divorce -1.165572e+01
## total.pct.divorce 3.863266e+00
## ave.people.per.fam 1.421264e+02
## pct.fam2parents 2.784116e+00
## pct.kids2parents -9.787103e+00
## pct.youngkids2parents -5.933778e+00
## pct.teens2parents -1.263470e+00
## pct.workmom.youngkids -3.719548e+00
## pct.workmom -5.949641e+00
## num.kids.nvrmarried -1.557932e+01
## pct.kids.nvrmarried 4.673231e+01
## num.immig 3.168416e+01
## pct.immig.recent -6.735256e+00
## pct.immig.recent5 3.735824e+00
## pct.immig.recent8 3.409682e+00
## pct.immig.recent10 -3.211240e+00
## pct.pop.immig -1.278752e+01
## pct.pop.immig5 -9.521568e+00
## pct.pop.immig8 -1.012270e+01
## pct.pop.immig10 -6.575508e+00
## pct.english.only -8.329693e+00
## pct.no.english.well 9.989198e+00
## pct.fam.hh.large -3.255665e+00
## pct.occup.hh.large -5.753114e+00
## ave.people.per.hh -3.503799e+01
## ave.people.per.ownoccup.hh -8.235360e+01
## ave.people.per.rented.hh -2.944460e+01
## pct.people.ownoccup.hh -1.499289e+00
## pct.people.dense.hh -8.495610e+00
## pct.hh.less3br -6.006948e-01
## med.num.br 7.292999e+01
## num.vacant.house 3.173217e+01
## pct.house.occup -7.336851e+00
## pct.house.ownoccup 8.720716e-01
## pct.house.vacant 9.385207e+00
## pct.house.vacant.6moplus 6.059624e-01
## med.yr.house.built -7.058868e+00
## pct.house.nophone 1.298178e+01
## pct.house.no.plumb 6.612169e+00
## value.ownoccup.house.lowquart 2.576123e-04
## value.ownoccup.med 1.333217e-04
## value.ownoccup.highquart -5.063316e-04
## rent.lowquart -4.818555e-01
## rent.med 1.700063e-01
## rent.highquart -1.656207e-02
## med.rent 4.341449e-01
## med.rent.aspct.hhinc 1.477212e+01
## med.owncost.aspct.hhinc.wmort 1.248259e+01
## med.owncost.as.pct.hhinc.womort -7.156414e+01
## num.in.shelters 2.206628e+00
## num.homeless 3.546885e+01
## pct.foreignborn 5.509345e+00
## pct.born.samestate -1.234857e+00
## pct.samehouse1985 8.040709e-01
## pct.samecity1985 -7.049156e-01
## pct.samestate1985 2.135202e+00
## land.area 5.250434e+01
## pop.density 8.998056e-03
## pct.use.publictransit -6.172411e+00
## pct.police.drugunits -3.034929e+00
#produce Ridge trace plot
plot(best_model, xvar = "lambda")
#use fitted best model to make predictions
y_predicted <- predict(best_model, s = best_lambda, newx = x)
#find SST and SSE
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)
#find R-Squared
rsq <- 1 - sse/sst
rsq
## [1] 0.784115
This being the R2 from train dataset and being higher thatn the ones achieved previous make Ridge regression the best model achieved.