multiple linear regression

Multiple Linear regression on 50 start up

dataset=read.csv('50_Startups.csv')

Encoding categorical data

dataset$State=factor(dataset$State, 
                     levels=c('New York', 'California','Florida'),                                labels=c(1,2,3))

Splitting the dataset

library(caTools)

## Warning: package 'caTools' was built under R version 3.4.2

set.seed(123)
split=sample.split(dataset$Profit, SplitRatio=0.8)
training_set=subset(dataset, split==TRUE)
test_set=subset(dataset, split==FALSE)

Fitting Multiple Linear Regression to the Training set

regressor= lm(formula= Profit~R.D.Spend+Administration+Marketing.Spend+State ,data=dataset)
summary(regressor)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + 
##     State, data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33504  -4736     90   6672  17338 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.008e+04  6.953e+03   7.204 5.76e-09 ***
## R.D.Spend        8.060e-01  4.641e-02  17.369  < 2e-16 ***
## Administration  -2.700e-02  5.223e-02  -0.517    0.608    
## Marketing.Spend  2.698e-02  1.714e-02   1.574    0.123    
## State2           4.189e+01  3.256e+03   0.013    0.990    
## State3           2.407e+02  3.339e+03   0.072    0.943    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
## F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16

Predicting the Test set results

y_pred= predict(regressor, newdata=test_set)
y_pred

##         4         5         8        11        16        20        21 
## 173584.98 172277.13 160155.64 135664.64 146143.64 115594.19 116570.73 
##        24        31        32 
## 110123.80  99629.01  97617.30

building the optimal model using Backward Elmination

remove the highest p value

regressor1= lm(formula= Profit~R.D.Spend+Administration+Marketing.Spend ,data=training_set)
summary(regressor1)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, 
##     data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33117  -4858    -36   6020  17957 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.970e+04  7.120e+03   6.980 3.48e-08 ***
## R.D.Spend        7.983e-01  5.356e-02  14.905  < 2e-16 ***
## Administration  -2.895e-02  5.603e-02  -0.517    0.609    
## Marketing.Spend  3.283e-02  1.987e-02   1.652    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9629 on 36 degrees of freedom
## Multiple R-squared:  0.9499, Adjusted R-squared:  0.9457 
## F-statistic: 227.6 on 3 and 36 DF,  p-value: < 2.2e-16

remove the Administration

regressor2= lm(formula= Profit~R.D.Spend+Marketing.Spend ,data=training_set)
summary(regressor2)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33294  -4763   -354   6351  17693 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.638e+04  3.019e+03  15.364   <2e-16 ***
## R.D.Spend       7.879e-01  4.916e-02  16.026   <2e-16 ***
## Marketing.Spend 3.538e-02  1.905e-02   1.857   0.0713 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9533 on 37 degrees of freedom
## Multiple R-squared:  0.9495, Adjusted R-squared:  0.9468 
## F-statistic: 348.1 on 2 and 37 DF,  p-value: < 2.2e-16

Conclusion

We will keep the R D Spend and Marketing Spend, although Marketing Spend is a little greater than 0.05

multiple linear regression

fangya tan

October 26, 2017

Multiple Linear regression on 50 start up

Encoding categorical data

Splitting the dataset

Fitting Multiple Linear Regression to the Training set

Predicting the Test set results

building the optimal model using Backward Elmination

remove the highest p value

remove the Administration

Conclusion