Business Statistics

###Question 1

We discussed this example in class using the following dummy variable Origin[Internal]=1, if Origin=``Internal”; =0, otherwise, and considered the following interaction model:

The definition on the alpha coefficients Origin[Internal]=1, if Origin=``Internal”; =-1, otherwise,

The relationship between the alphas and the betas

We begin by defining independent variables in our case: When we have origin [Internal] = 1: rating = 𝛽0 + 𝛽1 + 𝛽2salary + 𝛽3salary + 𝜀.

and When Origin[Internal] will imply 0: rating = 𝛽0 + 𝛽2𝑆salary + 𝜀.

Introducing the independent value of second model When we have Origin[Internal] = 1: rating = 𝛼0 + 𝛼1+ 𝛼2salary + 𝛼3salary + 𝜀.

provided that Origin[Internal] =-1: Rating = 𝛼0 - 𝛼1 + 𝛼2𝑆Salary - 𝛼3salary + 𝜀.

We set the Rating equal for both the case in which the origin is internal or external:

Origin[External]: 𝛽0 + 𝛽2salary + 𝜀 = 𝛼0 - 𝛼1 + 𝛼2salary - 𝛼3 salary + 𝜀

Origin[External]: 𝛽0 + 𝛽2salary= 𝛼0 - 𝛼1 + (𝛼2- 𝛼3)salary

We find 𝛽0= 𝛼0 - 𝛼1 , 𝛽2= 𝛼2- 𝛼3

Alternatively 𝛽3= 2𝛼3 𝛽1= 2𝛼1

###Question 2

library(haven)
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-4

data <- read.table("C:/Users/User/Desktop/ProdTime.dat", header=TRUE,sep=",")
head(data)

dim(data)

## [1] 61  3

data$Manager <- as.factor(data$Manager)
aggregate(Time.for.Run~Manager,data,sum)

aggregate(Run.Size~Manager,data,sum)

Manager a is the best Manager in terms of bothe Time for Run and Run Size.

###Question 3

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.2.1

auto_data   <- ISLR::Auto
dim(auto_data)

## [1] 392   9

head(auto_data)

#Q3.1

pairs(auto_data)

summary(auto_data)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

#Q3.2

model1 <- lm(mpg ~ year, data = auto_data)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ year, data = auto_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0212  -5.4411  -0.4412   4.9739  18.2088 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -70.01167    6.64516  -10.54   <2e-16 ***
## year          1.23004    0.08736   14.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.363 on 390 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.3353 
## F-statistic: 198.3 on 1 and 390 DF,  p-value: < 2.2e-16

Year is very significant in predicting mpg since it p_value is less than 0.05. It has a positive effect on mpg since the beta coefficient is positive.

model2 <- lm(mpg ~ year + horsepower, data = auto_data)
summary(model2)

## 
## Call:
## lm(formula = mpg ~ year + horsepower, data = auto_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0768  -3.0783  -0.4308   2.5884  15.3153 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.739166   5.349027  -2.382   0.0177 *  
## year          0.657268   0.066262   9.919   <2e-16 ***
## horsepower   -0.131654   0.006341 -20.761   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.388 on 389 degrees of freedom
## Multiple R-squared:  0.6855, Adjusted R-squared:  0.6839 
## F-statistic: 423.9 on 2 and 389 DF,  p-value: < 2.2e-16

plot(model2)

Year is still very significant in predicting mpg and with a positive relationship with mpg.

The Residuals are concentrated at 0 and widely stretched as the fitted value increase.

#95% CI’s The CI is now narrower in ii compared to i because by adding horsepower it defines the variance more making the confidence interval narrow when the dependent variable mpg can be explained more.

model3 <- lm(mpg ~ year*horsepower, data = auto_data)
summary(model3)

## 
## Call:
## lm(formula = mpg ~ year * horsepower, data = auto_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3492  -2.4509  -0.4557   2.4056  14.4437 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.266e+02  1.212e+01 -10.449   <2e-16 ***
## year             2.192e+00  1.613e-01  13.585   <2e-16 ***
## horsepower       1.046e+00  1.154e-01   9.063   <2e-16 ***
## year:horsepower -1.596e-02  1.562e-03 -10.217   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.901 on 388 degrees of freedom
## Multiple R-squared:  0.7522, Adjusted R-squared:  0.7503 
## F-statistic: 392.5 on 3 and 388 DF,  p-value: < 2.2e-16

The year effect is very significant on the model having year and horsepower as interaction also as an interaction factor with horsepower it is as well very significant. The interaction effect is also very significant since its p_value is less than 0.05.

#Q3.3

model4 <- lm(mpg    ~   horsepower  +   cylinders,  data=auto_data)
summary(model4)

## 
## Call:
## lm(formula = mpg ~ horsepower + cylinders, data = auto_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4378  -3.2422  -0.3721   2.3532  16.9289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.94842    0.77880  55.147  < 2e-16 ***
## horsepower  -0.08612    0.01119  -7.693 1.19e-13 ***
## cylinders   -1.91982    0.25261  -7.600 2.24e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.584 on 389 degrees of freedom
## Multiple R-squared:  0.6569, Adjusted R-squared:  0.6551 
## F-statistic: 372.4 on 2 and 389 DF,  p-value: < 2.2e-16

At a significance level of 0.01 cylinders is a strongly significant variable in predicting mpg.

levels(as.factor(auto_data$cylinders))

## [1] "3" "4" "5" "6" "8"

model5 <- lm(mpg    ~   horsepower  +   as.factor(cylinders),   data=auto_data)
summary(model5)

## 
## Call:
## lm(formula = mpg ~ horsepower + as.factor(cylinders), data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5917 -2.7067 -0.6102  1.9001 16.3258 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           30.77614    2.41283  12.755  < 2e-16 ***
## horsepower            -0.10303    0.01133  -9.095  < 2e-16 ***
## as.factor(cylinders)4  6.57344    2.16921   3.030  0.00261 ** 
## as.factor(cylinders)5  5.07367    3.26661   1.553  0.12120    
## as.factor(cylinders)6 -0.34406    2.18580  -0.157  0.87501    
## as.factor(cylinders)8  0.49738    2.27639   0.218  0.82716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.27 on 386 degrees of freedom
## Multiple R-squared:  0.7046, Adjusted R-squared:  0.7008 
## F-statistic: 184.1 on 5 and 386 DF,  p-value: < 2.2e-16

Cylinders category 4 is very significant because its p_value is less than 0.01 it is very significant in predicting mpg.

anova(model4,   model5)

Model with cylinders as factor is performing better the model with it as numeric continuous variable. This conclusion is from the significant p_value in model 2 and the reduced Residual Sum of Squares.

###Question 4

crime <- read.csv("C:/Users/User/Desktop/CrimeData_sub.csv", stringsAsFactors = F, na.strings = c("?"))
crime <- na.omit(crime)

head(crime)

summary(crime)

##     state                fold          population     household.size
##  Length:368         Min.   : 1.000   Min.   : 9.211   Min.   :1.60  
##  Class :character   1st Qu.: 3.000   1st Qu.:10.028   1st Qu.:2.45  
##  Mode  :character   Median : 6.000   Median :10.536   Median :2.75  
##                     Mean   : 5.576   Mean   :10.606   Mean   :2.80  
##                     3rd Qu.: 8.000   3rd Qu.:11.107   3rd Qu.:3.06  
##                     Max.   :10.000   Max.   :15.064   Max.   :5.28  
##  race.pctblack    race.pctwhite   race.pctasian     race.pcthisp   
##  Min.   : 0.110   Min.   :10.58   Min.   : 0.210   Min.   : 0.680  
##  1st Qu.: 1.278   1st Qu.:65.58   1st Qu.: 1.740   1st Qu.: 6.843  
##  Median : 3.345   Median :77.83   Median : 4.470   Median :13.940  
##  Mean   : 7.536   Mean   :74.65   Mean   : 7.451   Mean   :20.712  
##  3rd Qu.: 9.340   3rd Qu.:88.51   3rd Qu.: 9.633   3rd Qu.:26.183  
##  Max.   :54.830   Max.   :98.91   Max.   :57.460   Max.   :93.110  
##  age.pct12to21   age.pct12to29   age.pct16to24    age.pct65up    
##  Min.   : 4.58   Min.   : 9.38   Min.   : 4.64   Min.   : 3.150  
##  1st Qu.:11.34   1st Qu.:23.99   1st Qu.:10.84   1st Qu.: 7.503  
##  Median :13.51   Median :27.43   Median :12.60   Median :10.165  
##  Mean   :13.38   Mean   :27.06   Mean   :13.10   Mean   :12.093  
##  3rd Qu.:15.10   3rd Qu.:29.98   3rd Qu.:14.56   3rd Qu.:14.053  
##  Max.   :25.63   Max.   :48.68   Max.   :34.93   Max.   :48.350  
##    num.urban        pct.urban       med.income      pct.wage.inc  
##  Min.   : 0.000   Min.   :  0.0   Min.   : 14579   Min.   :39.25  
##  1st Qu.: 9.713   1st Qu.:100.0   1st Qu.: 27043   1st Qu.:73.47  
##  Median :10.507   Median :100.0   Median : 33868   Median :79.16  
##  Mean   : 9.237   Mean   : 86.1   Mean   : 36926   Mean   :77.64  
##  3rd Qu.:11.107   3rd Qu.:100.0   3rd Qu.: 42989   3rd Qu.:84.00  
##  Max.   :15.064   Max.   :100.0   Max.   :123625   Max.   :94.82  
##  pct.farmself.inc  pct.inv.inc    pct.socsec.inc  pct.pubasst.inc 
##  Min.   :0.0000   Min.   : 9.02   Min.   : 7.45   Min.   : 1.100  
##  1st Qu.:0.4800   1st Qu.:32.26   1st Qu.:18.47   1st Qu.: 4.037  
##  Median :0.6600   Median :41.02   Median :22.54   Median : 6.415  
##  Mean   :0.7777   Mean   :41.39   Mean   :25.21   Mean   : 8.034  
##  3rd Qu.:0.9900   3rd Qu.:50.09   3rd Qu.:28.89   3rd Qu.:11.283  
##  Max.   :3.0500   Max.   :79.43   Max.   :68.37   Max.   :23.930  
##    pct.retire    med.family.inc     percap.inc     white.percap  
##  Min.   : 6.34   Min.   : 18522   Min.   : 5935   Min.   : 6065  
##  1st Qu.:12.38   1st Qu.: 31156   1st Qu.:12070   1st Qu.:13395  
##  Median :14.79   Median : 38592   Median :15000   Median :16458  
##  Mean   :15.93   Mean   : 42124   Mean   :17059   Mean   :18537  
##  3rd Qu.:18.42   3rd Qu.: 49902   3rd Qu.:19404   3rd Qu.:21180  
##  Max.   :40.08   Max.   :131315   Max.   :63302   Max.   :68850  
##   black.percap   indian.percap     asian.percap     other.percap  
##  Min.   :    0   Min.   : 0.000   Min.   : 0.000   Min.   :    0  
##  1st Qu.: 7933   1st Qu.: 9.151   1st Qu.: 9.249   1st Qu.: 7244  
##  Median :11254   Median : 9.434   Median : 9.516   Median : 8912  
##  Mean   :12535   Mean   : 9.236   Mean   : 9.460   Mean   : 9976  
##  3rd Qu.:15266   3rd Qu.: 9.733   3rd Qu.: 9.740   3rd Qu.:11854  
##  Max.   :60000   Max.   :11.929   Max.   :11.553   Max.   :35293  
##   hisp.percap     num.underpov    pct.pop.underpov pct.less9thgrade
##  Min.   : 3991   Min.   : 5.257   Min.   : 1.230   Min.   : 0.620  
##  1st Qu.: 8112   1st Qu.: 7.429   1st Qu.: 5.790   1st Qu.: 4.468  
##  Median : 9971   Median : 8.201   Median : 9.605   Median : 7.405  
##  Mean   :11405   Mean   : 8.203   Mean   :11.201   Mean   : 9.923  
##  3rd Qu.:13190   3rd Qu.: 8.885   3rd Qu.:15.345   3rd Qu.:11.485  
##  Max.   :51320   Max.   :13.375   Max.   :36.510   Max.   :49.890  
##  pct.not.hsgrad  pct.bs.ormore   pct.unemployed    pct.employed  
##  Min.   : 2.95   Min.   : 1.63   Min.   : 1.610   Min.   :26.30  
##  1st Qu.:13.36   1st Qu.:13.41   1st Qu.: 4.128   1st Qu.:55.87  
##  Median :20.84   Median :19.34   Median : 5.520   Median :62.12  
##  Mean   :22.95   Mean   :23.03   Mean   : 6.197   Mean   :61.44  
##  3rd Qu.:29.04   3rd Qu.:30.82   3rd Qu.: 7.577   3rd Qu.:67.67  
##  Max.   :73.66   Max.   :71.20   Max.   :23.830   Max.   :83.81  
##  pct.employed.manuf pct.employed.profserv pct.occup.manuf pct.occup.mgmtprof
##  Min.   : 2.910     Min.   : 8.69         Min.   : 1.37   Min.   : 6.48     
##  1st Qu.: 9.238     1st Qu.:18.43         1st Qu.: 7.78   1st Qu.:21.17     
##  Median :12.975     Median :21.52         Median :11.20   Median :26.14     
##  Mean   :15.135     Mean   :22.45         Mean   :12.24   Mean   :28.42     
##  3rd Qu.:19.332     3rd Qu.:24.60         3rd Qu.:15.07   3rd Qu.:34.55     
##  Max.   :41.690     Max.   :58.64         Max.   :44.27   Max.   :60.55     
##  male.pct.divorce male.pct.nvrmarried female.pct.divorce total.pct.divorce
##  Min.   : 2.350   Min.   :12.06       Min.   : 5.05      Min.   : 4.20    
##  1st Qu.: 8.422   1st Qu.:26.57       1st Qu.:12.40      1st Qu.:10.54    
##  Median : 9.905   Median :31.00       Median :14.05      Median :11.99    
##  Mean   :10.079   Mean   :31.55       Mean   :14.05      Mean   :12.11    
##  3rd Qu.:11.693   3rd Qu.:35.44       3rd Qu.:15.68      3rd Qu.:13.65    
##  Max.   :17.010   Max.   :61.96       Max.   :20.15      Max.   :18.04    
##  ave.people.per.fam pct.fam2parents pct.kids2parents pct.youngkids2parents
##  Min.   :2.290      Min.   :44.28   Min.   :35.52    Min.   : 46.97       
##  1st Qu.:2.928      1st Qu.:67.71   1st Qu.:62.82    1st Qu.: 75.87       
##  Median :3.160      Median :73.42   Median :69.07    Median : 82.78       
##  Mean   :3.201      Mean   :72.62   Mean   :68.67    Mean   : 81.39       
##  3rd Qu.:3.400      3rd Qu.:78.34   3rd Qu.:75.46    3rd Qu.: 89.12       
##  Max.   :4.640      Max.   :91.57   Max.   :90.18    Max.   :100.00       
##  pct.teens2parents pct.workmom.youngkids  pct.workmom    num.kids.nvrmarried
##  Min.   :36.91     Min.   :38.27         Min.   :45.04   Min.   : 0.000     
##  1st Qu.:70.06     1st Qu.:53.97         1st Qu.:62.41   1st Qu.: 5.624     
##  Median :75.11     Median :59.13         Median :67.40   Median : 6.661     
##  Mean   :74.10     Mean   :58.55         Mean   :66.40   Mean   : 6.609     
##  3rd Qu.:79.98     3rd Qu.:63.42         3rd Qu.:70.91   3rd Qu.: 7.464     
##  Max.   :97.32     Max.   :77.21         Max.   :83.96   Max.   :12.265     
##  pct.kids.nvrmarried   num.immig      pct.immig.recent pct.immig.recent5
##  Min.   : 0.000      Min.   : 4.605   Min.   : 0.00    Min.   : 0.00    
##  1st Qu.: 1.510      1st Qu.: 7.862   1st Qu.:11.46    1st Qu.:18.20    
##  Median : 2.780      Median : 8.622   Median :14.70    Median :24.09    
##  Mean   : 3.527      Mean   : 8.650   Mean   :14.77    Mean   :23.32    
##  3rd Qu.: 4.902      3rd Qu.: 9.444   3rd Qu.:18.07    3rd Qu.:28.79    
##  Max.   :14.150      Max.   :14.106   Max.   :31.54    Max.   :45.61    
##  pct.immig.recent8 pct.immig.recent10 pct.pop.immig    pct.pop.immig5  
##  Min.   : 0.00     Min.   : 4.50      Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:25.64     1st Qu.:33.85      1st Qu.: 1.090   1st Qu.: 1.657  
##  Median :33.05     Median :42.92      Median : 2.150   Median : 3.365  
##  Mean   :32.04     Mean   :41.36      Mean   : 2.836   Mean   : 4.480  
##  3rd Qu.:39.35     3rd Qu.:50.76      3rd Qu.: 4.020   3rd Qu.: 6.343  
##  Max.   :55.34     Max.   :68.21      Max.   :13.710   Max.   :19.930  
##  pct.pop.immig8   pct.pop.immig10  pct.english.only pct.no.english.well
##  Min.   : 0.000   Min.   : 0.200   Min.   :10.44    Min.   : 0.140     
##  1st Qu.: 2.328   1st Qu.: 2.950   1st Qu.:67.20    1st Qu.: 1.805     
##  Median : 4.800   Median : 6.205   Median :78.77    Median : 3.775     
##  Mean   : 6.162   Mean   : 7.990   Mean   :74.05    Mean   : 6.243     
##  3rd Qu.: 8.848   3rd Qu.:11.227   3rd Qu.:86.95    3rd Qu.: 7.795     
##  Max.   :25.340   Max.   :32.630   Max.   :96.89    Max.   :38.330     
##  pct.fam.hh.large pct.occup.hh.large ave.people.per.hh
##  Min.   : 0.960   Min.   : 0.480     Min.   :1.580    
##  1st Qu.: 3.902   1st Qu.: 2.647     1st Qu.:2.360    
##  Median : 6.065   Median : 4.315     Median :2.680    
##  Mean   : 7.999   Mean   : 5.966     Mean   :2.729    
##  3rd Qu.:10.107   3rd Qu.: 7.370     3rd Qu.:3.002    
##  Max.   :34.870   Max.   :30.870     Max.   :4.520    
##  ave.people.per.ownoccup.hh ave.people.per.rented.hh pct.people.ownoccup.hh
##  Min.   :1.610              Min.   :1.580            Min.   :13.93         
##  1st Qu.:2.470              1st Qu.:2.280            1st Qu.:51.03         
##  Median :2.750              Median :2.595            Median :60.05         
##  Mean   :2.763              Mean   :2.670            Mean   :60.61         
##  3rd Qu.:2.980              3rd Qu.:3.000            3rd Qu.:69.48         
##  Max.   :4.480              Max.   :4.730            Max.   :96.14         
##  pct.people.dense.hh pct.hh.less3br    med.num.br    num.vacant.house
##  Min.   : 0.330      Min.   : 3.06   Min.   :1.000   Min.   : 4.407  
##  1st Qu.: 3.715      1st Qu.:43.45   1st Qu.:2.000   1st Qu.: 6.082  
##  Median : 6.645      Median :53.84   Median :2.000   Median : 6.753  
##  Mean   : 9.939      Mean   :53.44   Mean   :2.399   Mean   : 6.874  
##  3rd Qu.:11.932      3rd Qu.:64.40   3rd Qu.:3.000   3rd Qu.: 7.614  
##  Max.   :59.490      Max.   :95.34   Max.   :4.000   Max.   :11.321  
##  pct.house.occup pct.house.ownoccup pct.house.vacant  pct.house.vacant.6moplus
##  Min.   :58.06   Min.   :16.86      Min.   : 0.0000   Min.   : 3.12           
##  1st Qu.:90.64   1st Qu.:50.31      1st Qu.: 0.5975   1st Qu.:16.05           
##  Median :94.84   Median :60.02      Median : 1.3050   Median :21.05           
##  Mean   :92.31   Mean   :59.74      Mean   : 1.9371   Mean   :22.43           
##  3rd Qu.:96.24   3rd Qu.:68.23      3rd Qu.: 2.4350   3rd Qu.:27.96           
##  Max.   :98.09   Max.   :96.22      Max.   :13.1000   Max.   :57.40           
##  med.yr.house.built pct.house.nophone pct.house.no.plumb
##  Min.   :1939       Min.   : 0.0000   Min.   :0.0000    
##  1st Qu.:1962       1st Qu.: 0.9725   1st Qu.:0.1700    
##  Median :1970       Median : 2.4600   Median :0.3100    
##  Mean   :1968       Mean   : 3.3408   Mean   :0.4152    
##  3rd Qu.:1974       3rd Qu.: 4.5500   3rd Qu.:0.5100    
##  Max.   :1987       Max.   :17.8700   Max.   :2.5600    
##  value.ownoccup.house.lowquart value.ownoccup.med value.ownoccup.highquart
##  Min.   : 26100                Min.   : 39800     Min.   : 56200          
##  1st Qu.: 71775                1st Qu.: 91475     1st Qu.:121425          
##  Median :128600                Median :163850     Median :204450          
##  Mean   :148651                Mean   :187367     Mean   :234209          
##  3rd Qu.:193650                3rd Qu.:251850     3rd Qu.:321700          
##  Max.   :500001                Max.   :500001     Max.   :500001          
##  rent.lowquart       rent.med      rent.highquart      med.rent     
##  Min.   :  99.0   Min.   : 180.0   Min.   : 286.0   Min.   : 234.0  
##  1st Qu.: 346.8   1st Qu.: 430.5   1st Qu.: 517.0   1st Qu.: 506.8  
##  Median : 434.5   Median : 552.0   Median : 671.0   Median : 612.0  
##  Mean   : 453.0   Mean   : 562.5   Mean   : 680.6   Mean   : 625.7  
##  3rd Qu.: 549.5   3rd Qu.: 670.0   3rd Qu.: 839.0   3rd Qu.: 727.2  
##  Max.   :1001.0   Max.   :1001.0   Max.   :1001.0   Max.   :1001.0  
##  med.rent.aspct.hhinc med.owncost.aspct.hhinc.wmort
##  Min.   :14.90        Min.   :16.00                
##  1st Qu.:27.20        1st Qu.:22.80                
##  Median :28.50        Median :24.10                
##  Mean   :28.59        Mean   :24.26                
##  3rd Qu.:30.00        3rd Qu.:25.70                
##  Max.   :35.10        Max.   :32.70                
##  med.owncost.as.pct.hhinc.womort num.in.shelters  num.homeless   
##  Min.   :10.1                    Min.   :0.000   Min.   :0.0000  
##  1st Qu.:11.4                    1st Qu.:0.000   1st Qu.:0.0000  
##  Median :11.8                    Median :1.040   Median :0.3466  
##  Mean   :11.9                    Mean   :2.004   Mean   :1.4205  
##  3rd Qu.:12.4                    3rd Qu.:3.989   3rd Qu.:2.7081  
##  Max.   :14.6                    Max.   :8.433   Max.   :8.0424  
##  pct.foreignborn  pct.born.samestate pct.samehouse1985 pct.samecity1985
##  Min.   : 0.970   Min.   : 6.75      Min.   :11.83     Min.   :27.95   
##  1st Qu.: 9.062   1st Qu.:36.33      1st Qu.:39.40     1st Qu.:68.15   
##  Median :14.665   Median :44.98      Median :44.70     Median :74.38   
##  Mean   :17.599   Mean   :43.03      Mean   :44.36     Mean   :73.49   
##  3rd Qu.:23.317   3rd Qu.:52.31      3rd Qu.:49.27     3rd Qu.:79.96   
##  Max.   :59.720   Max.   :67.62      Max.   :68.37     Max.   :91.97   
##  pct.samestate1985   land.area       pop.density      pct.use.publictransit
##  Min.   :53.27     Min.   :0.7419   Min.   :   16.3   Min.   :0.0000       
##  1st Qu.:82.53     1st Qu.:1.9981   1st Qu.: 2248.2   1st Qu.:0.4700       
##  Median :87.35     Median :2.4891   Median : 3308.7   Median :0.9419       
##  Mean   :85.68     Mean   :2.6057   Mean   : 4261.6   Mean   :1.0485       
##  3rd Qu.:90.39     3rd Qu.:3.1068   3rd Qu.: 5435.2   3rd Qu.:1.5173       
##  Max.   :94.76     Max.   :8.1805   Max.   :23054.6   Max.   :3.5802       
##  pct.police.drugunits violentcrimes.perpop
##  Min.   : 0.00        Min.   :  25.7      
##  1st Qu.: 0.00        1st Qu.: 406.8      
##  Median : 0.00        Median : 688.6      
##  Mean   : 1.40        Mean   : 895.7      
##  3rd Qu.: 0.00        3rd Qu.:1219.1      
##  Max.   :18.02        Max.   :3829.2

dim(crime)

## [1] 368 103

head(crime$violentcrimes.perpop)

## [1] 2605.96  374.07  644.75  560.71  894.51  346.97

#names(crime)

#Q4.1

R2 & RMSE on test set

library(Metrics)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:Metrics':
## 
##     accuracy

train <- crime[1:294,]
test <- crime[295:368,]
ols <- lm(violentcrimes.perpop ~ ., data=train)
test$predicted <- predict(ols,test[,-103])
rmse(test$violentcrimes.perpop, test$predicted)

## [1] 514.4767

rss <- sum((test$predicted - test$violentcrimes.perpop) ^ 2)
tss <- sum((test$violentcrimes.perpop - mean(test$violentcrimes.perpop)) ^ 2)

rsq <- 1 - rss/tss
rsq

## [1] 0.3723997

R2 & RMSE on train set

train$predicted <- predict(ols,train[,-103])
rmse(train$violentcrimes.perpop, train$predicted)

## [1] 260.5575

rss <- sum((train$predicted - train$violentcrimes.perpop) ^ 2)
tss <- sum((train$violentcrimes.perpop - mean(train$violentcrimes.perpop)) ^ 2)

rsq <- 1 - rss/tss
rsq

## [1] 0.8477243

R2 and RMSE is better in the train set compared for the test set this is because the test set was not used in training the ols model. It is a fresh dataset subjected to the already trained model on train set.

#Q4.2

df = crime
set.seed(123)
x = model.matrix(violentcrimes.perpop~., df)[,-1]                # drop medv from the first column
y = as.matrix(df$violentcrimes.perpop) 


index = sample(nrow(df),nrow(df)*0.80)           #Split
df.X.std = scale(df[,c(-1,-103)])       #Standardize

X.train= as.matrix(df.X.std)[index,]
X.test=  as.matrix(df.X.std)[-index,]

Y.train= df[index, "violentcrimes.perpop"]
Y.test<- df[-index, "violentcrimes.perpop"]


Y.train = as.matrix(Y.train)
Y.test = as.matrix(Y.test)

lasso_cv = cv.glmnet(x=X.train, y=Y.train, alpha = 1, nfolds = 5)
plot(lasso_cv)

lasso_cv$lambda.min

## [1] 18.66273

coef(lasso_cv, s=lasso_cv$lambda.min)

## 102 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                     901.3983068
## fold                              .        
## population                        .        
## household.size                    .        
## race.pctblack                   163.9135143
## race.pctwhite                     .        
## race.pctasian                     .        
## race.pcthisp                      .        
## age.pct12to21                     .        
## age.pct12to29                     .        
## age.pct16to24                     .        
## age.pct65up                       .        
## num.urban                         .        
## pct.urban                         .        
## med.income                        .        
## pct.wage.inc                      .        
## pct.farmself.inc                 -2.5898177
## pct.inv.inc                     -19.8082950
## pct.socsec.inc                    .        
## pct.pubasst.inc                   8.6607916
## pct.retire                        .        
## med.family.inc                    .        
## percap.inc                        .        
## white.percap                      .        
## black.percap                      .        
## indian.percap                    -3.8635397
## asian.percap                      .        
## other.percap                      .        
## hisp.percap                       .        
## num.underpov                      .        
## pct.pop.underpov                  .        
## pct.less9thgrade                  .        
## pct.not.hsgrad                    .        
## pct.bs.ormore                     .        
## pct.unemployed                    .        
## pct.employed                      .        
## pct.employed.manuf                .        
## pct.employed.profserv             .        
## pct.occup.manuf                   .        
## pct.occup.mgmtprof                .        
## male.pct.divorce                 63.7458670
## male.pct.nvrmarried               .        
## female.pct.divorce                .        
## total.pct.divorce                 .        
## ave.people.per.fam                .        
## pct.fam2parents                   .        
## pct.kids2parents                -90.6751817
## pct.youngkids2parents           -22.1630824
## pct.teens2parents                 .        
## pct.workmom.youngkids             .        
## pct.workmom                     -30.5954777
## num.kids.nvrmarried               .        
## pct.kids.nvrmarried             172.2869146
## num.immig                         .        
## pct.immig.recent                  .        
## pct.immig.recent5                -2.3122375
## pct.immig.recent8                 .        
## pct.immig.recent10                .        
## pct.pop.immig                     .        
## pct.pop.immig5                    .        
## pct.pop.immig8                    .        
## pct.pop.immig10                   .        
## pct.english.only                -74.3025386
## pct.no.english.well               .        
## pct.fam.hh.large                  .        
## pct.occup.hh.large                .        
## ave.people.per.hh                 .        
## ave.people.per.ownoccup.hh        .        
## ave.people.per.rented.hh          .        
## pct.people.ownoccup.hh            .        
## pct.people.dense.hh               .        
## pct.hh.less3br                    .        
## med.num.br                        .        
## num.vacant.house                 30.9022950
## pct.house.occup                 -20.8185101
## pct.house.ownoccup                .        
## pct.house.vacant                 12.3694990
## pct.house.vacant.6moplus         21.7570921
## med.yr.house.built              -40.5500195
## pct.house.nophone                12.9380079
## pct.house.no.plumb                .        
## value.ownoccup.house.lowquart     .        
## value.ownoccup.med                .        
## value.ownoccup.highquart          .        
## rent.lowquart                     .        
## rent.med                          .        
## rent.highquart                    .        
## med.rent                          .        
## med.rent.aspct.hhinc              0.9966696
## med.owncost.aspct.hhinc.wmort     .        
## med.owncost.as.pct.hhinc.womort   .        
## num.in.shelters                   .        
## num.homeless                     50.2948217
## pct.foreignborn                   .        
## pct.born.samestate                .        
## pct.samehouse1985                 8.9388874
## pct.samecity1985                  .        
## pct.samestate1985                 .        
## land.area                         .        
## pop.density                       .        
## pct.use.publictransit             .        
## pct.police.drugunits              .

lasso_cv$lambda.1se

## [1] 119.9654

coef(lasso_cv, s=lasso_cv$lambda.1se)

## 102 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                      902.596062
## fold                               .       
## population                         .       
## household.size                     .       
## race.pctblack                     68.472269
## race.pctwhite                      .       
## race.pctasian                      .       
## race.pcthisp                       .       
## age.pct12to21                      .       
## age.pct12to29                      .       
## age.pct16to24                      .       
## age.pct65up                        .       
## num.urban                          .       
## pct.urban                          .       
## med.income                         .       
## pct.wage.inc                       .       
## pct.farmself.inc                   .       
## pct.inv.inc                        .       
## pct.socsec.inc                     .       
## pct.pubasst.inc                    .       
## pct.retire                         .       
## med.family.inc                     .       
## percap.inc                         .       
## white.percap                       .       
## black.percap                       .       
## indian.percap                      .       
## asian.percap                       .       
## other.percap                       .       
## hisp.percap                        .       
## num.underpov                       .       
## pct.pop.underpov                   .       
## pct.less9thgrade                   .       
## pct.not.hsgrad                     .       
## pct.bs.ormore                      .       
## pct.unemployed                     .       
## pct.employed                       .       
## pct.employed.manuf                 .       
## pct.employed.profserv              .       
## pct.occup.manuf                    .       
## pct.occup.mgmtprof                 .       
## male.pct.divorce                   .       
## male.pct.nvrmarried                .       
## female.pct.divorce                 .       
## total.pct.divorce                  .       
## ave.people.per.fam                 .       
## pct.fam2parents                    .       
## pct.kids2parents                -160.864450
## pct.youngkids2parents              .       
## pct.teens2parents                  .       
## pct.workmom.youngkids              .       
## pct.workmom                        .       
## num.kids.nvrmarried                .       
## pct.kids.nvrmarried              227.180944
## num.immig                          .       
## pct.immig.recent                   .       
## pct.immig.recent5                  .       
## pct.immig.recent8                  .       
## pct.immig.recent10                 .       
## pct.pop.immig                      .       
## pct.pop.immig5                     .       
## pct.pop.immig8                     .       
## pct.pop.immig10                    .       
## pct.english.only                   .       
## pct.no.english.well                .       
## pct.fam.hh.large                   .       
## pct.occup.hh.large                 .       
## ave.people.per.hh                  .       
## ave.people.per.ownoccup.hh         .       
## ave.people.per.rented.hh           .       
## pct.people.ownoccup.hh             .       
## pct.people.dense.hh                .       
## pct.hh.less3br                     .       
## med.num.br                         .       
## num.vacant.house                   .       
## pct.house.occup                    .       
## pct.house.ownoccup                 .       
## pct.house.vacant                   .       
## pct.house.vacant.6moplus           .       
## med.yr.house.built                 .       
## pct.house.nophone                  .       
## pct.house.no.plumb                 .       
## value.ownoccup.house.lowquart      .       
## value.ownoccup.med                 .       
## value.ownoccup.highquart           .       
## rent.lowquart                      .       
## rent.med                           .       
## rent.highquart                     .       
## med.rent                           .       
## med.rent.aspct.hhinc               .       
## med.owncost.aspct.hhinc.wmort      .       
## med.owncost.as.pct.hhinc.womort    .       
## num.in.shelters                    .       
## num.homeless                       8.501961
## pct.foreignborn                    .       
## pct.born.samestate                 .       
## pct.samehouse1985                  .       
## pct.samecity1985                   .       
## pct.samestate1985                  .       
## land.area                          .       
## pop.density                        .       
## pct.use.publictransit              .       
## pct.police.drugunits               .

cv_lasso_predict_train = predict(lasso_cv, newx = X.train, s=lasso_cv$lambda.min) #Train Prediction
cv_lasso_predict_test= predict(lasso_cv, newx = X.test, s=lasso_cv$lambda.min)    #Test Prediction

cv_lasso_mse = mean((Y.train-cv_lasso_predict_train)^2)                #Mean squared error
cv_lasso_mpse = mean((Y.test-cv_lasso_predict_test)^2)                #Mean squared prediction error

cv_sst = sum((Y.train - mean(Y.train))^2)
cv_sse = sum((Y.train-cv_lasso_predict_train)^2)                      #Sum squared error

cv_rsq = 1 - cv_sse / cv_sst                                          #R_squared

cv_sst1 = sum((Y.test - mean(Y.test))^2)
cv_sse1 = sum((Y.test-cv_lasso_predict_test)^2)                      #Sum squared error

cv_rsq1 = 1 - cv_sse1 / cv_sst1   


cv_adj_rsq = 1 - (dim(X.train)[1]-1)*(1-cv_rsq)/(dim(X.train)[1]-10-1)

cv_summary_stats_min = c("CV min model", cv_lasso_mse, cv_rsq,cv_rsq1, cv_adj_rsq,
                         cv_lasso_mpse)

comparison_table = c("model type", "MSE", "R-Squared_train","R-Squared_test", "Adj R-Squared", "Test MSPE")
data.frame(cbind(comparison_table, cv_summary_stats_min))

#Refitting OLS

dataf = subset(crime, select=c(household.size, race.pctblack, pct.pubasst.inc, indian.percap, hisp.percap, pct.less9thgrade, pct.not.hsgrad, pct.english.only, pct.youngkids2parents, med.rent, med.owncost.as.pct.hhinc.womort, num.homeless, land.area, violentcrimes.perpop))

R2 & RMSE on test set

train <- dataf[1:294,]
test <- dataf[295:368,]
ols <- lm(violentcrimes.perpop ~ ., data=train)
test$predicted <- predict(ols,test[,-103])
rmse(test$violentcrimes.perpop, test$predicted)

## [1] 374.5579

rss <- sum((test$predicted - test$violentcrimes.perpop) ^ 2)
tss <- sum((test$violentcrimes.perpop - mean(test$violentcrimes.perpop)) ^ 2)

rsq <- 1 - rss/tss
rsq

## [1] 0.6673485

R2 & RMSE on train set

train$predicted <- predict(ols,train[,-103])
rmse(train$violentcrimes.perpop, train$predicted)

## [1] 359.8285

rss <- sum((train$predicted - train$violentcrimes.perpop) ^ 2)
tss <- sum((train$violentcrimes.perpop - mean(train$violentcrimes.perpop)) ^ 2)

rsq <- 1 - rss/tss
rsq

## [1] 0.7095878

Comparing the latest fitted model on only significant variable we can say that the lasso is performing better in terms of R2 on both train and test set.

My final model is lasso_cv and the selection method is R2 diagnostic approach to check on the one performing better.

#Ridge

#perform k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(x=X.train, y=Y.train, alpha = 1, nfolds = 5)

#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda

## [1] 24.67104

#produce plot of test MSE by lambda value
plot(cv_model)

#find coefficients of best model
best_model <- glmnet(x, y, alpha = 0, lambda = best_lambda)
coef(best_model)

## 103 x 1 sparse Matrix of class "dgCMatrix"
##                                            s0
## (Intercept)                      1.686561e+04
## stateFL                          1.932883e+02
## fold                            -1.970589e+00
## population                      -5.666766e+01
## household.size                  -1.808574e+02
## race.pctblack                    2.019507e+01
## race.pctwhite                    7.502493e-01
## race.pctasian                    1.711680e+00
## race.pcthisp                     1.103466e+00
## age.pct12to21                    2.838102e+01
## age.pct12to29                   -1.337490e+01
## age.pct16to24                   -5.070627e+00
## age.pct65up                      2.617624e+00
## num.urban                       -1.104918e+00
## pct.urban                        2.645805e-01
## med.income                       5.995235e-04
## pct.wage.inc                    -6.249200e-01
## pct.farmself.inc                -2.065658e+01
## pct.inv.inc                     -8.259336e+00
## pct.socsec.inc                   5.055720e-01
## pct.pubasst.inc                  1.606025e+01
## pct.retire                      -4.353621e+00
## med.family.inc                  -3.431551e-04
## percap.inc                      -8.980663e-04
## white.percap                     4.229752e-03
## black.percap                     7.488192e-04
## indian.percap                   -1.700544e+01
## asian.percap                     3.175131e+01
## other.percap                     1.843177e-03
## hisp.percap                     -2.496081e-03
## num.underpov                    -1.066530e+01
## pct.pop.underpov                -3.019594e-01
## pct.less9thgrade                -6.669992e+00
## pct.not.hsgrad                   5.213730e+00
## pct.bs.ormore                   -2.082677e+00
## pct.unemployed                  -1.676191e+01
## pct.employed                     9.291583e+00
## pct.employed.manuf               1.789508e+00
## pct.employed.profserv           -1.683663e+00
## pct.occup.manuf                  7.823162e+00
## pct.occup.mgmtprof               4.200800e+00
## male.pct.divorce                 1.940005e+01
## male.pct.nvrmarried              9.625540e+00
## female.pct.divorce              -1.165572e+01
## total.pct.divorce                3.863266e+00
## ave.people.per.fam               1.421264e+02
## pct.fam2parents                  2.784116e+00
## pct.kids2parents                -9.787103e+00
## pct.youngkids2parents           -5.933778e+00
## pct.teens2parents               -1.263470e+00
## pct.workmom.youngkids           -3.719548e+00
## pct.workmom                     -5.949641e+00
## num.kids.nvrmarried             -1.557932e+01
## pct.kids.nvrmarried              4.673231e+01
## num.immig                        3.168416e+01
## pct.immig.recent                -6.735256e+00
## pct.immig.recent5                3.735824e+00
## pct.immig.recent8                3.409682e+00
## pct.immig.recent10              -3.211240e+00
## pct.pop.immig                   -1.278752e+01
## pct.pop.immig5                  -9.521568e+00
## pct.pop.immig8                  -1.012270e+01
## pct.pop.immig10                 -6.575508e+00
## pct.english.only                -8.329693e+00
## pct.no.english.well              9.989198e+00
## pct.fam.hh.large                -3.255665e+00
## pct.occup.hh.large              -5.753114e+00
## ave.people.per.hh               -3.503799e+01
## ave.people.per.ownoccup.hh      -8.235360e+01
## ave.people.per.rented.hh        -2.944460e+01
## pct.people.ownoccup.hh          -1.499289e+00
## pct.people.dense.hh             -8.495610e+00
## pct.hh.less3br                  -6.006948e-01
## med.num.br                       7.292999e+01
## num.vacant.house                 3.173217e+01
## pct.house.occup                 -7.336851e+00
## pct.house.ownoccup               8.720716e-01
## pct.house.vacant                 9.385207e+00
## pct.house.vacant.6moplus         6.059624e-01
## med.yr.house.built              -7.058868e+00
## pct.house.nophone                1.298178e+01
## pct.house.no.plumb               6.612169e+00
## value.ownoccup.house.lowquart    2.576123e-04
## value.ownoccup.med               1.333217e-04
## value.ownoccup.highquart        -5.063316e-04
## rent.lowquart                   -4.818555e-01
## rent.med                         1.700063e-01
## rent.highquart                  -1.656207e-02
## med.rent                         4.341449e-01
## med.rent.aspct.hhinc             1.477212e+01
## med.owncost.aspct.hhinc.wmort    1.248259e+01
## med.owncost.as.pct.hhinc.womort -7.156414e+01
## num.in.shelters                  2.206628e+00
## num.homeless                     3.546885e+01
## pct.foreignborn                  5.509345e+00
## pct.born.samestate              -1.234857e+00
## pct.samehouse1985                8.040709e-01
## pct.samecity1985                -7.049156e-01
## pct.samestate1985                2.135202e+00
## land.area                        5.250434e+01
## pop.density                      8.998056e-03
## pct.use.publictransit           -6.172411e+00
## pct.police.drugunits            -3.034929e+00

#produce Ridge trace plot
plot(best_model, xvar = "lambda")

#use fitted best model to make predictions
y_predicted <- predict(best_model, s = best_lambda, newx = x)

#find SST and SSE
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

#find R-Squared
rsq <- 1 - sse/sst
rsq

## [1] 0.784115

This being the R2 from train dataset and being higher thatn the ones achieved previous make Ridge regression the best model achieved.

Business Statistics

Domnick Konde

2022-10-29