Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
50 Startups data set
This dataset has data collected from New York, California and Florida about 50 business Startups “17 in each state”. The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending.
columns:
R&D Spend
Administration
Marketing Spend
State
Profit (target variable)
data source: https://www.kaggle.com/farhanmd29/50-startups/data
library(caret)
library(dplyr)
library(tidyr)
startups_base<-read.csv("./50_Startups.csv")
startups<-read.csv("./50_Startups.csv")
head(startups)
## R.D.Spend Administration Marketing.Spend State Profit
## 1 165349.2 136897.80 471784.1 New York 192261.8
## 2 162597.7 151377.59 443898.5 California 191792.1
## 3 153441.5 101145.55 407934.5 Florida 191050.4
## 4 144372.4 118671.85 383199.6 New York 182902.0
## 5 142107.3 91391.77 366168.4 Florida 166187.9
## 6 131876.9 99814.71 362861.4 New York 156991.1
Checking missing values:
sapply(startups, function(y) sum(length(which(is.na(y)))))/nrow(startups)*100
## R.D.Spend Administration Marketing.Spend State
## 0 0 0 0
## Profit
## 0
There is no missing values in the data set.
par(mfrow = c(2,2))
hist(startups$R.D.Spend)
hist(startups$Administration)
hist(startups$Marketing.Spend)
hist(startups$Profit)
Data does not look normally distributed.
cor(startups %>% select(-State))
## R.D.Spend Administration Marketing.Spend Profit
## R.D.Spend 1.0000000 0.24195525 0.72424813 0.9729005
## Administration 0.2419552 1.00000000 -0.03215388 0.2007166
## Marketing.Spend 0.7242481 -0.03215388 1.00000000 0.7477657
## Profit 0.9729005 0.20071657 0.74776572 1.0000000
We have strong correlation between predictor variables : R.D.Spend and Marketing.Spend (0.72424813)
We have a categorical variable - “State”, I am going to tranform it to a dummy variable and drop 1 of the States to avoid dummy variables “trap”.
dummy_state<-dummyVars(~ State, data = startups)
startups_trans <- data.frame(predict(dummy_state, newdata = startups), startups$R.D.Spend,startups$Administration, startups$Marketing.Spend, startups$Profit )
# dropping one of the states
startups_trans<-startups_trans %>% select(-State.Florida)
head(startups_trans)
## State.California State.New.York startups.R.D.Spend
## 1 0 1 165349.2
## 2 1 0 162597.7
## 3 0 0 153441.5
## 4 0 1 144372.4
## 5 0 0 142107.3
## 6 0 1 131876.9
## startups.Administration startups.Marketing.Spend startups.Profit
## 1 136897.80 471784.1 192261.8
## 2 151377.59 443898.5 191792.1
## 3 101145.55 407934.5 191050.4
## 4 118671.85 383199.6 182902.0
## 5 91391.77 366168.4 166187.9
## 6 99814.71 362861.4 156991.1
pre-processing the data: scaling, centering, applying CoxCox transformation
pp <- preProcess(startups_trans, method = c( "BoxCox", "center","scale"))
startups_trans<- predict(pp, startups_trans)
pp$method
## $BoxCox
## [1] "startups.Administration" "startups.Profit"
##
## $center
## [1] "State.California" "State.New.York"
## [3] "startups.R.D.Spend" "startups.Administration"
## [5] "startups.Marketing.Spend" "startups.Profit"
##
## $scale
## [1] "State.California" "State.New.York"
## [3] "startups.R.D.Spend" "startups.Administration"
## [5] "startups.Marketing.Spend" "startups.Profit"
##
## $ignore
## character(0)
Model 1 built on the transformed data.
set.seed(123)
model_1<- train(startups.Profit ~., data = startups_trans, method = "lm", trControl = trainControl("cv", number = 10))
summary(model_1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8331 -0.1197 0.0016 0.1653 0.4280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.927e-16 3.310e-02 0.000 1.000
## State.California -2.389e-03 4.000e-02 -0.060 0.953
## State.New.York -2.692e-03 3.960e-02 -0.068 0.946
## startups.R.D.Spend 9.189e-01 5.288e-02 17.376 <2e-16 ***
## startups.Administration -2.039e-02 3.638e-02 -0.561 0.578
## startups.Marketing.Spend 8.057e-02 5.232e-02 1.540 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2341 on 44 degrees of freedom
## Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
## F-statistic: 170.1 on 5 and 44 DF, p-value: < 2.2e-16
model_1
## Linear Regression
##
## 50 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 46, 46, 46, 46, 45, 45, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2169233 0.969957 0.1774682
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
varImp(model_1)
## lm variable importance
##
## Overall
## startups.R.D.Spend 100.00000
## startups.Marketing.Spend 8.54832
## startups.Administration 2.89219
## State.New.York 0.04771
## State.California 0.00000
R.D.Spend is the most important variable. Administration and Marketing.Spend have little importance in predicting profit of the startups. State is not important variable. The results are quite obvious.
Model 2 build on the original data.
set.seed(123)
model_2<- train(Profit ~., data = startups_base, method = "lm", trControl = trainControl("cv", number = 10))
summary(model_2)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33504 -4736 90 6672 17338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.013e+04 6.885e+03 7.281 4.44e-09 ***
## R.D.Spend 8.060e-01 4.641e-02 17.369 < 2e-16 ***
## Administration -2.700e-02 5.223e-02 -0.517 0.608
## Marketing.Spend 2.698e-02 1.714e-02 1.574 0.123
## StateFlorida 1.988e+02 3.371e+03 0.059 0.953
## `StateNew York` -4.189e+01 3.256e+03 -0.013 0.990
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
## F-statistic: 169.9 on 5 and 44 DF, p-value: < 2.2e-16
model_2
## Linear Regression
##
## 50 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 46, 46, 45, 46, 46, 44, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 8771.876 0.9662062 6980.377
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Model 3 built only using the most important variable: R.D.Spend
set.seed(123)
model_3<- train(startups.Profit ~ startups.R.D.Spend+startups.Administration+startups.Marketing.Spend, data = startups_trans, method = "lm", trControl = trainControl("cv", number = 10))
summary(model_3)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83393 -0.11888 0.00299 0.16366 0.42675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.945e-16 3.238e-02 0.000 1.000
## startups.R.D.Spend 9.186e-01 5.147e-02 17.846 <2e-16 ***
## startups.Administration -2.029e-02 3.555e-02 -0.571 0.571
## startups.Marketing.Spend 8.130e-02 5.024e-02 1.618 0.112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2289 on 46 degrees of freedom
## Multiple R-squared: 0.9508, Adjusted R-squared: 0.9476
## F-statistic: 296.3 on 3 and 46 DF, p-value: < 2.2e-16
model_3
## Linear Regression
##
## 50 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 46, 46, 46, 46, 45, 45, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2078013 0.9719574 0.168609
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Model 1 has RMSE = 0.2169233 and Adjusted Rsquared = 0.9452
Model 2 has RMSE = 8771.876 and does not look like a valid model.
Model 3 has the lowest RMSE of the 10 folds cross-validation - 0.2078013 and Adjusted Rsquared = 0.9476
Model 3 is the best as it is built only on the most important variables: R.D.Spend, Administration and Marketing.Spend
Model 3 coefficients interpretation:
1.95 + startups.R.D.Spend * 9.19 - 2.03 * startups.Administration + 8.13 * startups.Marketing.Spend = Profit
1 dollar spent on the R&D results in 9.19 dollars increase in startups profit
1 dollar spent on the Administration results in 2.03 dollars decrease in startups profit
1 dollar spent on the Marketing results in 8.13 dollars increase in startups profit
model 3 residual analysis
residuals<-resid(model_3)
plot(residuals)
qqnorm(residuals)
qqline(residuals)
The residuals look randomly distributed around 0 and almost normal, that means that there is no useful information is hidden in residuals to be extracted by the model.