discussion 12

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Introduction

50 Startups data set

This dataset has data collected from New York, California and Florida about 50 business Startups “17 in each state”. The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending.

columns:

R&D Spend

Administration

Marketing Spend

State

Profit (target variable)

data source: https://www.kaggle.com/farhanmd29/50-startups/data

library(caret)
library(dplyr)
library(tidyr)
startups_base<-read.csv("./50_Startups.csv")
startups<-read.csv("./50_Startups.csv")
head(startups)

##   R.D.Spend Administration Marketing.Spend      State   Profit
## 1  165349.2      136897.80        471784.1   New York 192261.8
## 2  162597.7      151377.59        443898.5 California 191792.1
## 3  153441.5      101145.55        407934.5    Florida 191050.4
## 4  144372.4      118671.85        383199.6   New York 182902.0
## 5  142107.3       91391.77        366168.4    Florida 166187.9
## 6  131876.9       99814.71        362861.4   New York 156991.1

Data Exploration

Checking missing values:

sapply(startups, function(y) sum(length(which(is.na(y)))))/nrow(startups)*100

##       R.D.Spend  Administration Marketing.Spend           State 
##               0               0               0               0 
##          Profit 
##               0

There is no missing values in the data set.

par(mfrow = c(2,2))
hist(startups$R.D.Spend)
hist(startups$Administration)
hist(startups$Marketing.Spend)
hist(startups$Profit)

Data does not look normally distributed.

cor(startups %>%  select(-State))

##                 R.D.Spend Administration Marketing.Spend    Profit
## R.D.Spend       1.0000000     0.24195525      0.72424813 0.9729005
## Administration  0.2419552     1.00000000     -0.03215388 0.2007166
## Marketing.Spend 0.7242481    -0.03215388      1.00000000 0.7477657
## Profit          0.9729005     0.20071657      0.74776572 1.0000000

We have strong correlation between predictor variables : R.D.Spend and Marketing.Spend (0.72424813)

Data Transformation

We have a categorical variable - “State”, I am going to tranform it to a dummy variable and drop 1 of the States to avoid dummy variables “trap”.

dummy_state<-dummyVars(~ State, data = startups)
startups_trans <- data.frame(predict(dummy_state, newdata = startups), startups$R.D.Spend,startups$Administration, startups$Marketing.Spend, startups$Profit )
# dropping one of the states
startups_trans<-startups_trans %>% select(-State.Florida)
head(startups_trans)

##   State.California State.New.York startups.R.D.Spend
## 1                0              1           165349.2
## 2                1              0           162597.7
## 3                0              0           153441.5
## 4                0              1           144372.4
## 5                0              0           142107.3
## 6                0              1           131876.9
##   startups.Administration startups.Marketing.Spend startups.Profit
## 1               136897.80                 471784.1        192261.8
## 2               151377.59                 443898.5        191792.1
## 3               101145.55                 407934.5        191050.4
## 4               118671.85                 383199.6        182902.0
## 5                91391.77                 366168.4        166187.9
## 6                99814.71                 362861.4        156991.1

pre-processing the data: scaling, centering, applying CoxCox transformation

pp <- preProcess(startups_trans, method = c( "BoxCox", "center","scale"))
startups_trans<- predict(pp, startups_trans)
pp$method

## $BoxCox
## [1] "startups.Administration" "startups.Profit"        
## 
## $center
## [1] "State.California"         "State.New.York"          
## [3] "startups.R.D.Spend"       "startups.Administration" 
## [5] "startups.Marketing.Spend" "startups.Profit"         
## 
## $scale
## [1] "State.California"         "State.New.York"          
## [3] "startups.R.D.Spend"       "startups.Administration" 
## [5] "startups.Marketing.Spend" "startups.Profit"         
## 
## $ignore
## character(0)

Model Building

Model 1 built on the transformed data.

set.seed(123)
model_1<- train(startups.Profit ~., data = startups_trans, method = "lm",  trControl = trainControl("cv", number = 10))
summary(model_1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8331 -0.1197  0.0016  0.1653  0.4280 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.927e-16  3.310e-02   0.000    1.000    
## State.California         -2.389e-03  4.000e-02  -0.060    0.953    
## State.New.York           -2.692e-03  3.960e-02  -0.068    0.946    
## startups.R.D.Spend        9.189e-01  5.288e-02  17.376   <2e-16 ***
## startups.Administration  -2.039e-02  3.638e-02  -0.561    0.578    
## startups.Marketing.Spend  8.057e-02  5.232e-02   1.540    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2341 on 44 degrees of freedom
## Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
## F-statistic: 170.1 on 5 and 44 DF,  p-value: < 2.2e-16

model_1

## Linear Regression 
## 
## 50 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 46, 46, 46, 46, 45, 45, ... 
## Resampling results:
## 
##   RMSE       Rsquared  MAE      
##   0.2169233  0.969957  0.1774682
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

varImp(model_1)

## lm variable importance
## 
##                            Overall
## startups.R.D.Spend       100.00000
## startups.Marketing.Spend   8.54832
## startups.Administration    2.89219
## State.New.York             0.04771
## State.California           0.00000

R.D.Spend is the most important variable. Administration and Marketing.Spend have little importance in predicting profit of the startups. State is not important variable. The results are quite obvious.

Model 2 build on the original data.

set.seed(123)
model_2<- train(Profit ~., data = startups_base, method = "lm",  trControl = trainControl("cv", number = 10))
summary(model_2)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33504  -4736     90   6672  17338 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.013e+04  6.885e+03   7.281 4.44e-09 ***
## R.D.Spend        8.060e-01  4.641e-02  17.369  < 2e-16 ***
## Administration  -2.700e-02  5.223e-02  -0.517    0.608    
## Marketing.Spend  2.698e-02  1.714e-02   1.574    0.123    
## StateFlorida     1.988e+02  3.371e+03   0.059    0.953    
## `StateNew York` -4.189e+01  3.256e+03  -0.013    0.990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
## F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16

model_2

## Linear Regression 
## 
## 50 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 46, 46, 45, 46, 46, 44, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   8771.876  0.9662062  6980.377
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Model 3 built only using the most important variable: R.D.Spend

set.seed(123)
model_3<- train(startups.Profit ~ startups.R.D.Spend+startups.Administration+startups.Marketing.Spend, data = startups_trans, method = "lm",  trControl = trainControl("cv", number = 10))
summary(model_3)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83393 -0.11888  0.00299  0.16366  0.42675 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               1.945e-16  3.238e-02   0.000    1.000    
## startups.R.D.Spend        9.186e-01  5.147e-02  17.846   <2e-16 ***
## startups.Administration  -2.029e-02  3.555e-02  -0.571    0.571    
## startups.Marketing.Spend  8.130e-02  5.024e-02   1.618    0.112    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2289 on 46 degrees of freedom
## Multiple R-squared:  0.9508, Adjusted R-squared:  0.9476 
## F-statistic: 296.3 on 3 and 46 DF,  p-value: < 2.2e-16

model_3

## Linear Regression 
## 
## 50 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 46, 46, 46, 46, 45, 45, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE     
##   0.2078013  0.9719574  0.168609
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Model 1 has RMSE = 0.2169233 and Adjusted Rsquared = 0.9452

Model 2 has RMSE = 8771.876 and does not look like a valid model.

Model 3 has the lowest RMSE of the 10 folds cross-validation - 0.2078013 and Adjusted Rsquared = 0.9476

Model 3 is the best as it is built only on the most important variables: R.D.Spend, Administration and Marketing.Spend

Model 3 coefficients interpretation:

1.95 + startups.R.D.Spend * 9.19 - 2.03 * startups.Administration + 8.13 * startups.Marketing.Spend = Profit

1 dollar spent on the R&D results in 9.19 dollars increase in startups profit

1 dollar spent on the Administration results in 2.03 dollars decrease in startups profit

1 dollar spent on the Marketing results in 8.13 dollars increase in startups profit

Residual analysis

model 3 residual analysis

residuals<-resid(model_3)
plot(residuals)

qqnorm(residuals)
qqline(residuals)

The residuals look randomly distributed around 0 and almost normal, that means that there is no useful information is hidden in residuals to be extracted by the model.