Overview:
This data processing report is a part of task series to complete my course education in Full Stack Academy - Algoritma Data Science Education, Jakarta, Indonesia. This fourth task is about practising competencies for linear regression algorithm. The data itself can be downloaded from Kaggle.com.
Data source:
https://www.kaggle.com/root64shivansh/profit-in-startup-of-a-company
We will use these set of libraries for processing data
The Data is telling about some variables that will create impact to Profit achievement for Start up companies in 3 regions:
1.California
2.New York
3.Florida
These variables are:
1. R.D. Spend which means Research and Development Spend/ Cost
2. Marketing Spend which means Marketing Cost
3. Administration which is Administration Cost
4. State, States of the Regions
Read, Head and Glimpse
We will start with reading data and take a look to a short glimpse of the data
## Observations: 50
## Variables: 5
## $ R.D.Spend <dbl> 165349.20, 162597.70, 153441.51, 144372.41, 142107....
## $ Administration <dbl> 136897.80, 151377.59, 101145.55, 118671.85, 91391.7...
## $ Marketing.Spend <dbl> 471784.1, 443898.5, 407934.5, 383199.6, 366168.4, 3...
## $ State <fct> New York, California, Florida, New York, Florida, N...
## $ Profit <dbl> 192261.8, 191792.1, 191050.4, 182902.0, 166187.9, 1...
Check missing value
Second step, we check missing values with ColSums and anyNA command
## R.D.Spend Administration Marketing.Spend State Profit
## 0 0 0 0 0
## [1] FALSE
Correlations
Check correlations with some of commands like ggcorr and ggpairs
From those graph we could see preliminary correlation data:
Then we plot the data to get some pictures to describe data variance
xy plot Marketing Spend with States
xyplot(Profit ~ Marketing.Spend | State , data = start,
panel = function(x, y)
{ panel.xyplot(x, y)
panel.abline(lm( x ~ y))
})xy plot RD Spend with States
xyplot(Profit ~ R.D.Spend | State , data = start,
panel = function(x, y)
{ panel.xyplot(x, y)
panel.abline(lm( x ~ y))
})xy plot Administration Spend with States
xyplot(Profit ~ Administration | State , data = start,
panel = function(x, y)
{ panel.xyplot(x, y)
panel.abline(lm( x ~ y))
})From all those plot we could have some preliminary insights that R.D Spend and Marketing Spend are variables that has most correlations with Profit.
It is time to develop some models for the regressions
Based on all variables
starts_up <- lm(formula = Profit ~ R.D.Spend + Marketing.Spend + Administration +factor(State), start)
summary(starts_up)##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend + Administration +
## factor(State), data = start)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33504 -4736 90 6672 17338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.013e+04 6.885e+03 7.281 4.44e-09 ***
## R.D.Spend 8.060e-01 4.641e-02 17.369 < 2e-16 ***
## Marketing.Spend 2.698e-02 1.714e-02 1.574 0.123
## Administration -2.700e-02 5.223e-02 -0.517 0.608
## factor(State)Florida 1.988e+02 3.371e+03 0.059 0.953
## factor(State)New York -4.189e+01 3.256e+03 -0.013 0.990
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9439 on 44 degrees of freedom
## Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
## F-statistic: 169.9 on 5 and 44 DF, p-value: < 2.2e-16
Based on Spend Variables
starts_up_Spend <- lm(formula = Profit ~ R.D.Spend + Marketing.Spend + Administration ,data = start)
summary(starts_up_Spend)##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend + Administration,
## data = start)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33534 -4795 63 6606 17275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.012e+04 6.572e+03 7.626 1.06e-09 ***
## R.D.Spend 8.057e-01 4.515e-02 17.846 < 2e-16 ***
## Marketing.Spend 2.723e-02 1.645e-02 1.655 0.105
## Administration -2.682e-02 5.103e-02 -0.526 0.602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
## F-statistic: 296 on 3 and 46 DF, p-value: < 2.2e-16
Based on Marketing Spend
##
## Call:
## lm(formula = Profit ~ Marketing.Spend, data = start)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83739 -18802 4925 15879 64642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.000e+04 7.685e+03 7.808 4.29e-10 ***
## Marketing.Spend 2.465e-01 3.159e-02 7.803 4.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27040 on 48 degrees of freedom
## Multiple R-squared: 0.5592, Adjusted R-squared: 0.55
## F-statistic: 60.88 on 1 and 48 DF, p-value: 4.381e-10
Significancy with Stepwise Backward
## Start: AIC=920.87
## Profit ~ R.D.Spend + Administration + Marketing.Spend + State
##
## Df Sum of Sq RSS AIC
## - State 2 5.1666e+05 3.9209e+09 916.88
## - Administration 1 2.3816e+07 3.9442e+09 919.17
## <none> 3.9203e+09 920.87
## - Marketing.Spend 1 2.2071e+08 4.1410e+09 921.61
## - R.D.Spend 1 2.6878e+10 3.0799e+10 1021.94
##
## Step: AIC=916.88
## Profit ~ R.D.Spend + Administration + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## - Administration 1 2.3539e+07 3.9444e+09 915.18
## <none> 3.9209e+09 916.88
## - Marketing.Spend 1 2.3349e+08 4.1543e+09 917.77
## - R.D.Spend 1 2.7147e+10 3.1068e+10 1018.37
##
## Step: AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3.9444e+09 915.18
## - Marketing.Spend 1 3.1165e+08 4.2560e+09 916.98
## - R.D.Spend 1 3.1149e+10 3.5094e+10 1022.46
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = start)
##
## Coefficients:
## (Intercept) R.D.Spend Marketing.Spend
## 4.698e+04 7.966e-01 2.991e-02
Significancy with Step wise Forward
lm.none <- lm(Profit ~ 1, start)
step(lm.none, scope=list(lower=lm.none, upper=lm.all), direction="forward")## Start: AIC=1061.42
## Profit ~ 1
##
## Df Sum of Sq RSS AIC
## + R.D.Spend 1 7.5349e+10 4.2560e+09 916.98
## + Marketing.Spend 1 4.4511e+10 3.5094e+10 1022.46
## + Administration 1 3.2071e+09 7.6398e+10 1061.36
## <none> 7.9605e+10 1061.42
## + State 2 1.9006e+09 7.7704e+10 1064.21
##
## Step: AIC=916.98
## Profit ~ R.D.Spend
##
## Df Sum of Sq RSS AIC
## + Marketing.Spend 1 311651716 3944394850 915.18
## <none> 4256046566 916.98
## + Administration 1 101704903 4154341663 917.77
## + State 2 14451679 4241594888 920.81
##
## Step: AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3944394850 915.18
## + Administration 1 23538549 3920856301 916.88
## + State 2 239050 3944155801 919.17
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = start)
##
## Coefficients:
## (Intercept) R.D.Spend Marketing.Spend
## 4.698e+04 7.966e-01 2.991e-02
Significancy with Stepwise Both
## Start: AIC=1061.42
## Profit ~ 1
##
## Df Sum of Sq RSS AIC
## + R.D.Spend 1 7.5349e+10 4.2560e+09 916.98
## + Marketing.Spend 1 4.4511e+10 3.5094e+10 1022.46
## + Administration 1 3.2071e+09 7.6398e+10 1061.36
## <none> 7.9605e+10 1061.42
## + State 2 1.9006e+09 7.7704e+10 1064.21
##
## Step: AIC=916.98
## Profit ~ R.D.Spend
##
## Df Sum of Sq RSS AIC
## + Marketing.Spend 1 3.1165e+08 3.9444e+09 915.18
## <none> 4.2560e+09 916.98
## + Administration 1 1.0170e+08 4.1543e+09 917.77
## + State 2 1.4452e+07 4.2416e+09 920.81
## - R.D.Spend 1 7.5349e+10 7.9605e+10 1061.42
##
## Step: AIC=915.18
## Profit ~ R.D.Spend + Marketing.Spend
##
## Df Sum of Sq RSS AIC
## <none> 3.9444e+09 915.18
## + Administration 1 2.3539e+07 3.9209e+09 916.88
## - Marketing.Spend 1 3.1165e+08 4.2560e+09 916.98
## + State 2 2.3905e+05 3.9442e+09 919.17
## - R.D.Spend 1 3.1149e+10 3.5094e+10 1022.46
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = start)
##
## Coefficients:
## (Intercept) R.D.Spend Marketing.Spend
## 4.698e+04 7.966e-01 2.991e-02
Based on Reg Subsets
regs <- regsubsets(Profit ~ R.D.Spend + Marketing.Spend +Administration +
State,data = start, nbest=10)
plot(regs, scale="adjr", main="All possible regression: ranked by Adjusted R-squared")Best Models based on Significancy
After review the models with Stepwise and Reg Subset we could create some best models:
##
## Call:
## lm(formula = Profit ~ Marketing.Spend + R.D.Spend, data = start)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
## R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
Based on signif. codes
This residual plot do not show any non-random pattern, and at this point suffice to say we can put the model specifications into production use knowing that it cannot be improved upon any further
homoscedasticity & heterosscedasticity
##
## studentized Breusch-Pagan test
##
## data: lm.fit
## BP = 2.8431, df = 2, p-value = 0.2413
Breusch-Pagan hypothesis test:
H0: Variation of residual is constant (Homoscedasticity)
H1: Variation of residual is not constant (Heteroscedasticity)
The test has a p-value above the significance level of 0.05, therefore we fail to reject the null hypothesis. We can conclude that the residuals has a constant variance and the assumption is passed.
Assumption of Normality
##
## Shapiro-Wilk normality test
##
## data: lm.fit$residuals
## W = 0.93717, p-value = 0.01042
Shapiro-Wilk hypothesis test:
H0: Residuals are normally distributed
H1: Residuals are not normally distributed
The test has a p-value below the significance level of 0.05, therefore we reject the null hypothesis. We can conclude that the residuals are not normally distributed.
Assumption of Muliticollinearity
## Marketing.Spend R.D.Spend
## 2.103206 2.103206
Exactly how large a VIF has to be before it causes issues is a subject of debate. What is known is that the more your VIF increases, the less reliable your regression results are going to be. In general, a VIF above 10 indicates high correlation and is cause for concern. Some authors suggest a more conservative level of 2.5 or above.
Based on those regression steps above we can have preliminary conclusion that best fit model that we could use is lm.fit with Marketing Spend & R.D Spend is the most predictor. The coefficient of each variable is:
R.D Spend : 7.966e-01 Marketing Spend : 2.991e-02 Intercept : 4.698e+04 With the consideration of the assumptions.