Dataset importing

library(RCurl)
## Loading required package: bitops
url = 'https://raw.githubusercontent.com/salma71/Machine-learning-tutorial/master/Part%202%20-%20Regression/Section%205%20-%20Multiple%20Linear%20Regression/50_Startups.csv'
dataset = read.csv(url, header = TRUE)
head(dataset)

Exploring the dataset

str(dataset)
## 'data.frame':    50 obs. of  5 variables:
##  $ R.D.Spend      : num  165349 162598 153442 144372 142107 ...
##  $ Administration : num  136898 151378 101146 118672 91392 ...
##  $ Marketing.Spend: num  471784 443899 407935 383200 366168 ...
##  $ State          : Factor w/ 3 levels "California","Florida",..: 3 1 2 3 2 3 1 2 3 1 ...
##  $ Profit         : num  192262 191792 191050 182902 166188 ...

The dataset has 50 instances (observations) with 5 attributes(variables). State attribute is dichotomous term with 3 levels “New York”, “California”, and “Florida”. Here we want to know which attribute has the greatest impact on profit, is it R.D spend, or Marketing spend…etc.

Data processing

The first step I would make is to convert the categorical variable into dummy variable for easy interpretation.

# encode the categorical variables
dataset$State = factor(dataset$State,
                       levels = c('New York', 'California', 'Florida'), 
                       labels = c(1, 2, 3))

head(dataset)

Getting Summary statistics

summary(dataset)
##    R.D.Spend      Administration   Marketing.Spend  State 
##  Min.   :     0   Min.   : 51283   Min.   :     0   1:17  
##  1st Qu.: 39936   1st Qu.:103731   1st Qu.:129300   2:17  
##  Median : 73051   Median :122700   Median :212716   3:16  
##  Mean   : 73722   Mean   :121345   Mean   :211025         
##  3rd Qu.:101603   3rd Qu.:144842   3rd Qu.:299469         
##  Max.   :165349   Max.   :182646   Max.   :471784         
##      Profit      
##  Min.   : 14681  
##  1st Qu.: 90139  
##  Median :107978  
##  Mean   :112013  
##  3rd Qu.:139766  
##  Max.   :192262

Visualizing

It is a good approach to perform EDA before going into building a model. I would make a univariate for each variable. The best is the boxplot and histogram to investigate if the distribution is a Guassian distribution. It would help picking the right algorithm to build the model.

library(ggplot2)
ggplot(dataset, aes(x = State, y = Profit)) +
        geom_boxplot(outlier.colour="red", outlier.shape=8,
                outlier.size=4)

It seems that New York has some outlier points.

hist(dataset$R.D.Spend)

hist(dataset$Administration)

hist(dataset$Marketing.Spend)

Multivariate plot

scatter plot is the best candidate for that purpose

library(GGally)
ggpairs(dataset)

Investigating the correlation between variables

ggcorr(dataset)
## Warning in ggcorr(dataset): data in column(s) 'State' are not numeric and
## were ignored

It is obvious that both R.D Spend and Marketing Spend both have a great effect on the profit.

Building model

This dataset is an ideal candidate for polynomial linear regression.

regressor = lm(formula = Profit ~., data = dataset)
summary(regressor)
## 
## Call:
## lm(formula = Profit ~ ., data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33504  -4736     90   6672  17338 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.008e+04  6.953e+03   7.204 5.76e-09 ***
## R.D.Spend        8.060e-01  4.641e-02  17.369  < 2e-16 ***
## Administration  -2.700e-02  5.223e-02  -0.517    0.608    
## Marketing.Spend  2.698e-02  1.714e-02   1.574    0.123    
## State2           4.189e+01  3.256e+03   0.013    0.990    
## State3           2.407e+02  3.339e+03   0.072    0.943    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
## F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16

As we can see that R.D Spend has the greatest effect on the Profit. Eventhough Marketing spend shows a strong correlation to profit, it doesn’t acctually affect the profit - P-value > 0.05 significance level, which confirms that correlation doesn’t imply causation.

Intecepts interprets how the variables correlated to the dependnt variable “Profit”. The positive sign means positive correlation and negative sign means negative correlation.

The standard error is an estimate of the standard deviation of the coefficient, the amount it varies across cases.

The Student’s t distribution (t-test) describes how the mean of a sample with a certain number of observations is expected to behave.

The P value tells you how confident you can be that each individual variable has some correlation with the dependent variable.

The R-squared is the fraction of the variation in the dependent variable that is accounted for the independent variables.

The adjusted-R squared same as R-squared but it penalize the model for taking unnecessary independent variables, more robust that R-squared.

Now, we can make a backword elimination to eliminate independent varaibles with p-value > 0.05 significance level. I wrote a function to automate the elimination.

backwardElimination <- function(x, sl) {
    numVars = length(x)
    for (i in c(1:numVars)){
      regressor = lm(formula = Profit ~ ., data = x)
      maxVar = max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
      if (maxVar > sl){
        j = which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
        x = x[, -j]
      }
      numVars = numVars - 1
    }
    return(summary(regressor))
  }
  
SL = 0.05
backwardElimination(dataset, SL)
## 
## Call:
## lm(formula = Profit ~ ., data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34351  -4626   -375   6249  17188 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.903e+04  2.538e+03   19.32   <2e-16 ***
## R.D.Spend   8.543e-01  2.931e-02   29.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9416 on 48 degrees of freedom
## Multiple R-squared:  0.9465, Adjusted R-squared:  0.9454 
## F-statistic: 849.8 on 1 and 48 DF,  p-value: < 2.2e-16

After elimination, we got only R.D Spend that has an influence on the profit. The model can be read as to increase the profit by one unit, we need to increase the R.D Spend by 8.563e-01 units, and by this variable, it explains 94.5% of the model variablility.

Conclusion

To sum up, we can include that the model can be presented by a linear regression model.