Introduction

Overview:

This data processing report is a part of task series to complete my course education in Full Stack Academy - Algoritma Data Science Education, Jakarta, Indonesia. This fourth task is about practising competencies for linear regression algorithm. The data itself can be downloaded from Kaggle.com.

Data source:
https://www.kaggle.com/root64shivansh/profit-in-startup-of-a-company

A Glimpse of The Data

The Data is telling about some variables that will create impact to Profit achievement for Start up companies in 3 regions:

1.California
2.New York
3.Florida

These variables are:
1. R.D. Spend which means Research and Development Spend/ Cost
2. Marketing Spend which means Marketing Cost
3. Administration which is Administration Cost
4. State, States of the Regions

Read, Head and Glimpse

We will start with reading data and take a look to a short glimpse of the data

## Observations: 50
## Variables: 5
## $ R.D.Spend       <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107....
## $ Administration  <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.7...
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3...
## $ State           <fct> New York, California, Florida, New York, Florida, N...
## $ Profit          <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1...

Check missing value

Second step, we check missing values with ColSums and anyNA command

##       R.D.Spend  Administration Marketing.Spend           State          Profit 
##               0               0               0               0               0
## [1] FALSE

Exploratories

Correlations

Check correlations with some of commands like ggcorr and ggpairs

From those graph we could see preliminary correlation data:

  1. Research & Development has 100% correlation with Profit.
  2. Marketing has 70% correlation.
  3. Administration variable does not have any correlation with any variables.
  4. On the other side Research & Development has 70% correlation with Marketing spendings.

Plot all variables

Then we plot the data to get some pictures to describe data variance

xy plot Marketing Spend with States

xy plot RD Spend with States

xy plot Administration Spend with States

From all those plot we could have some preliminary insights that R.D Spend and Marketing Spend are variables that has most correlations with Profit.

Models

It is time to develop some models for the regressions

Based on all variables

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend + Administration + 
##     factor(State), data = start)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33504  -4736     90   6672  17338 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.013e+04  6.885e+03   7.281 4.44e-09 ***
## R.D.Spend              8.060e-01  4.641e-02  17.369  < 2e-16 ***
## Marketing.Spend        2.698e-02  1.714e-02   1.574    0.123    
## Administration        -2.700e-02  5.223e-02  -0.517    0.608    
## factor(State)Florida   1.988e+02  3.371e+03   0.059    0.953    
## factor(State)New York -4.189e+01  3.256e+03  -0.013    0.990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
## F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16

Based on Spend Variables

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend + Administration, 
##     data = start)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33534  -4795     63   6606  17275 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.012e+04  6.572e+03   7.626 1.06e-09 ***
## R.D.Spend        8.057e-01  4.515e-02  17.846  < 2e-16 ***
## Marketing.Spend  2.723e-02  1.645e-02   1.655    0.105    
## Administration  -2.682e-02  5.103e-02  -0.526    0.602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared:  0.9507, Adjusted R-squared:  0.9475 
## F-statistic:   296 on 3 and 46 DF,  p-value: < 2.2e-16

Based on Marketing Spend

## 
## Call:
## lm(formula = Profit ~ Marketing.Spend, data = start)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83739 -18802   4925  15879  64642 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.000e+04  7.685e+03   7.808 4.29e-10 ***
## Marketing.Spend 2.465e-01  3.159e-02   7.803 4.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27040 on 48 degrees of freedom
## Multiple R-squared:  0.5592, Adjusted R-squared:   0.55 
## F-statistic: 60.88 on 1 and 48 DF,  p-value: 4.381e-10

Significancy with Stepwise

Significancy with Stepwise Backward

## Start:  AIC=920.87
## Profit ~ R.D.Spend + Administration + Marketing.Spend + State
## 
##                   Df  Sum of Sq        RSS     AIC
## - State            2 5.1666e+05 3.9209e+09  916.88
## - Administration   1 2.3816e+07 3.9442e+09  919.17
## <none>                          3.9203e+09  920.87
## - Marketing.Spend  1 2.2071e+08 4.1410e+09  921.61
## - R.D.Spend        1 2.6878e+10 3.0799e+10 1021.94
## 
## Step:  AIC=916.88
## Profit ~ R.D.Spend + Administration + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## - Administration   1 2.3539e+07 3.9444e+09  915.18
## <none>                          3.9209e+09  916.88
## - Marketing.Spend  1 2.3349e+08 4.1543e+09  917.77
## - R.D.Spend        1 2.7147e+10 3.1068e+10 1018.37
## 
## Step:  AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## <none>                          3.9444e+09  915.18
## - Marketing.Spend  1 3.1165e+08 4.2560e+09  916.98
## - R.D.Spend        1 3.1149e+10 3.5094e+10 1022.46
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = start)
## 
## Coefficients:
##     (Intercept)        R.D.Spend  Marketing.Spend  
##       4.698e+04        7.966e-01        2.991e-02

Significancy with Step wise Forward

## Start:  AIC=1061.42
## Profit ~ 1
## 
##                   Df  Sum of Sq        RSS     AIC
## + R.D.Spend        1 7.5349e+10 4.2560e+09  916.98
## + Marketing.Spend  1 4.4511e+10 3.5094e+10 1022.46
## + Administration   1 3.2071e+09 7.6398e+10 1061.36
## <none>                          7.9605e+10 1061.42
## + State            2 1.9006e+09 7.7704e+10 1064.21
## 
## Step:  AIC=916.98
## Profit ~ R.D.Spend
## 
##                   Df Sum of Sq        RSS    AIC
## + Marketing.Spend  1 311651716 3944394850 915.18
## <none>                         4256046566 916.98
## + Administration   1 101704903 4154341663 917.77
## + State            2  14451679 4241594888 920.81
## 
## Step:  AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
## 
##                  Df Sum of Sq        RSS    AIC
## <none>                        3944394850 915.18
## + Administration  1  23538549 3920856301 916.88
## + State           2    239050 3944155801 919.17
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = start)
## 
## Coefficients:
##     (Intercept)        R.D.Spend  Marketing.Spend  
##       4.698e+04        7.966e-01        2.991e-02

Significancy with Stepwise Both

## Start:  AIC=1061.42
## Profit ~ 1
## 
##                   Df  Sum of Sq        RSS     AIC
## + R.D.Spend        1 7.5349e+10 4.2560e+09  916.98
## + Marketing.Spend  1 4.4511e+10 3.5094e+10 1022.46
## + Administration   1 3.2071e+09 7.6398e+10 1061.36
## <none>                          7.9605e+10 1061.42
## + State            2 1.9006e+09 7.7704e+10 1064.21
## 
## Step:  AIC=916.98
## Profit ~ R.D.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## + Marketing.Spend  1 3.1165e+08 3.9444e+09  915.18
## <none>                          4.2560e+09  916.98
## + Administration   1 1.0170e+08 4.1543e+09  917.77
## + State            2 1.4452e+07 4.2416e+09  920.81
## - R.D.Spend        1 7.5349e+10 7.9605e+10 1061.42
## 
## Step:  AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
## 
##                   Df  Sum of Sq        RSS     AIC
## <none>                          3.9444e+09  915.18
## + Administration   1 2.3539e+07 3.9209e+09  916.88
## - Marketing.Spend  1 3.1165e+08 4.2560e+09  916.98
## + State            2 2.3905e+05 3.9442e+09  919.17
## - R.D.Spend        1 3.1149e+10 3.5094e+10 1022.46
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = start)
## 
## Coefficients:
##     (Intercept)        R.D.Spend  Marketing.Spend  
##       4.698e+04        7.966e-01        2.991e-02

Based on Reg Subsets

Best Models based on Significancy

After review the models with Stepwise and Reg Subset we could create some best models:

## 
## Call:
## lm(formula = Profit ~ Marketing.Spend + R.D.Spend, data = start)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33645  -4632   -414   6484  17097 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
## Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 .  
## R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
## F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16

Based on signif. codes

Residuals Plot

This residual plot do not show any non-random pattern, and at this point suffice to say we can put the model specifications into production use knowing that it cannot be improved upon any further

Assumptions

homoscedasticity & heterosscedasticity

## 
##  studentized Breusch-Pagan test
## 
## data:  lm.fit
## BP = 2.8431, df = 2, p-value = 0.2413

Breusch-Pagan hypothesis test:

H0: Variation of residual is constant (Homoscedasticity)

H1: Variation of residual is not constant (Heteroscedasticity)

The test has a p-value above the significance level of 0.05, therefore we fail to reject the null hypothesis. We can conclude that the residuals has a constant variance and the assumption is passed.

Assumption of Normality

## 
##  Shapiro-Wilk normality test
## 
## data:  lm.fit$residuals
## W = 0.93717, p-value = 0.01042

Shapiro-Wilk hypothesis test:

H0: Residuals are normally distributed

H1: Residuals are not normally distributed

The test has a p-value below the significance level of 0.05, therefore we reject the null hypothesis. We can conclude that the residuals are not normally distributed.

Assumption of Muliticollinearity

## Marketing.Spend       R.D.Spend 
##        2.103206        2.103206

Exactly how large a VIF has to be before it causes issues is a subject of debate. What is known is that the more your VIF increases, the less reliable your regression results are going to be. In general, a VIF above 10 indicates high correlation and is cause for concern. Some authors suggest a more conservative level of 2.5 or above.

Conclusions

Based on those regression steps above we can have preliminary conclusion that best fit model that we could use is lm.fit with Marketing Spend & R.D Spend is the most predictor. The coefficient of each variable is:

R.D Spend : 7.966e-01 Marketing Spend : 2.991e-02 Intercept : 4.698e+04 With the consideration of the assumptions.