Intro

this LBB is used to learn about Linear Regression by predicting profits of 50 companies using the “50 Startups Data” dataset provided in kaggle. By learning Linear Regression through this dataset we can predict which companies to invest for maximizing profit

Data Preparation

First lets load the necessary library we will be using

library(dplyr)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)

lets load in the data we will be using into an object. We will name the object “profit”

profit <- read.csv("50_Startups.csv")
profit

Data Cleansing

Lets check the structure of our data

str(profit)
#> 'data.frame':    50 obs. of  5 variables:
#>  $ R.D.Spend      : num  165349 162598 153442 144372 142107 ...
#>  $ Administration : num  136898 151378 101146 118672 91392 ...
#>  $ Marketing.Spend: num  471784 443899 407935 383200 366168 ...
#>  $ State          : chr  "New York" "California" "Florida" "New York" ...
#>  $ Profit         : num  192262 191792 191050 182902 166188 ...

It seems that the type of data on State is wrong, lets change that. While also changing the type of data, lets change the name so we can understand better

profit <- profit %>% mutate(State = as.factor(State))
str(profit)
#> 'data.frame':    50 obs. of  5 variables:
#>  $ R.D.Spend      : num  165349 162598 153442 144372 142107 ...
#>  $ Administration : num  136898 151378 101146 118672 91392 ...
#>  $ Marketing.Spend: num  471784 443899 407935 383200 366168 ...
#>  $ State          : Factor w/ 3 levels "California","Florida",..: 3 1 2 3 2 3 1 2 3 1 ...
#>  $ Profit         : num  192262 191792 191050 182902 166188 ...

Exploratory Data Analysis

Since this is linear regression that means that the data provided have correlation to one another, lets analyze that correlation

ggcorr(profit,
       hjust = 0.8,
       layout.exp = 2,
       label = TRUE)

Insight
-R&D Spending is very correlated to Profit
-Marketing Spending is quite correlated to Profit
-R&D Spending and Marketing Spending is quite correlated to one another

Modeling

Making a Linear Regression model to predit “profit” based on all variable provided

profit
model_profit_all <- lm(Profit ~ ., profit)
model_profit <- lm(Profit ~ R.D.Spend, profit)
summary(model_profit_all)
#> 
#> Call:
#> lm(formula = Profit ~ ., data = profit)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -33504  -4736     90   6672  17338 
#> 
#> Coefficients:
#>                    Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)     50125.34383  6884.81973   7.281        0.00000000444 ***
#> R.D.Spend           0.80602     0.04641  17.369 < 0.0000000000000002 ***
#> Administration     -0.02700     0.05223  -0.517                0.608    
#> Marketing.Spend     0.02698     0.01714   1.574                0.123    
#> StateFlorida      198.78879  3371.00712   0.059                0.953    
#> StateNew York     -41.88702  3256.03913  -0.013                0.990    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9439 on 44 degrees of freedom
#> Multiple R-squared:  0.9508, Adjusted R-squared:  0.9452 
#> F-statistic: 169.9 on 5 and 44 DF,  p-value: < 0.00000000000000022
summary(model_profit)
#> 
#> Call:
#> lm(formula = Profit ~ R.D.Spend, data = profit)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -34351  -4626   -375   6249  17188 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value            Pr(>|t|)    
#> (Intercept) 49032.89914  2537.89695   19.32 <0.0000000000000002 ***
#> R.D.Spend       0.85429     0.02931   29.15 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9416 on 48 degrees of freedom
#> Multiple R-squared:  0.9465, Adjusted R-squared:  0.9454 
#> F-statistic: 849.8 on 1 and 48 DF,  p-value: < 0.00000000000000022

Insight
-For everytime R.D Spending increase by 1, the profit increases by 0.80602 times
-From the level of significance, R.d Spend is highly significant
-The model have an R-squared value of 0.9452 which indicates that the model is very good because the closer the value to 1 the better it is.

Evaluation

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error: \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is often considered better than MAE or mean absolute error because it squares the differences between the actual values and the predicted values, meaning that predictions with higher errors are penalized more significantly. This metric is frequently used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE() function from the caret package to evaluate model performance. Below is the performance of the first model (with complete variables).

pred_prof_all <- predict(model_profit_all,
                     newdata = profit)
pred_prof_RD <- predict(model_profit, 
                        newdata = profit)

#RMSE
RMSE(y_pred = model_profit_all$fitted.values, y_true= profit$Profit)
#> [1] 8854.761
RMSE(y_pred = model_profit$fitted.values, y_true= profit$Profit)
#> [1] 9226.101

Lets also compare the value of MAPE

MAPE(y_pred = model_profit_all$fitted.values, y_true= profit$Profit)
#> [1] 0.1060236
MAPE(y_pred = model_profit$fitted.values, y_true= profit$Profit)
#> [1] 0.1107014

Insight
The best model is to use model_profit_all which uses all variables to predit the outcome of profit rather than model_profit which only uses R.D.Spend to predict the profit

Assumptions

Linear regression is a parametric model, meaning it relies on specific assumptions to create a model equation. If these assumptions are not met, the model can produce misleading results or biased estimators. In this section, we will examine the second model, where some variables have been removed, to see if it adheres to the linearity assumption.

1. Linearity


The linear regression model assumes a straight-line relationship between the predictors and the response variable. If the actual relationship deviates significantly from linearity, almost all conclusions drawn from the model fit may be unreliable, and the model’s predictive accuracy can be significantly compromised.

Residual plots are a useful graphical tool for detecting non-linearity. These plots show the relationship between residuals (errors) and predicted (fitted) values. If a pattern emerges in the residual plot, it indicates that the model can be further improved or that it does not meet the linearity assumption.

plot(model_profit_all, which = 1)
abline(h = 5000, col = "green")
abline(h = -5000, col = "green")

There could be a pattern showing here where the residuals becomes more positive as the fitted value increases before it decreases again. The pattern could indicate that our model may not be linear enough

2. Normality Test


The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.

shapiro.test(model_profit_all$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_profit_all$residuals
#> W = 0.9373, p-value = 0.01055

the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.

3. Heterocedasticity

The errors produced by the model are expected to spread randomly or with constant variation. If visualized, the errors should not form a pattern. This condition is also known as homoscedasticity.

plot(x = model_profit_all$fitted.values, y = model_profit_all$residuals)
abline(h = 0, col = "red")

Based on the plot, we can assume that the errors produced by the model is categorized as “homoscedasticity” due to the random cloud pattern it shown in the graph. Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity.

bptest(model_profit_all)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_profit_all
#> BP = 3.1508, df = 5, p-value = 0.6767

With p-value > 0.05, we can conclude that “homoscedasticity” is present in our model.

4. Multicollinearity

Multicollinearity refers to a correlation between independent variables (predictors) in a regression model. To check for multicollinearity, we can measure the Variance Inflation Factor (VIF). As a rule of thumb, a VIF value exceeding 5 or 10 indicates a problematic amount of collinearity.

vif(model_profit_all)
#>                     GVIF Df GVIF^(1/(2*Df))
#> R.D.Spend       2.495511  1        1.579719
#> Administration  1.177766  1        1.085249
#> Marketing.Spend 2.416797  1        1.554605
#> State           1.062673  2        1.015313

Tuning model

Based on the value of MAPE, we see that the error is roughly 10%, we will try to improve this model using backward elimination.

model_backward <- step(object = model_profit_all,
                       direction = "backward")
#> Start:  AIC=920.87
#> Profit ~ R.D.Spend + Administration + Marketing.Spend + State
#> 
#>                   Df   Sum of Sq         RSS     AIC
#> - State            2      516657  3920856301  916.88
#> - Administration   1    23816156  3944155801  919.17
#> <none>                            3920339644  920.87
#> - Marketing.Spend  1   220708706  4141048350  921.61
#> - R.D.Spend        1 26878168212 30798507857 1021.94
#> 
#> Step:  AIC=916.88
#> Profit ~ R.D.Spend + Administration + Marketing.Spend
#> 
#>                   Df   Sum of Sq         RSS     AIC
#> - Administration   1    23538549  3944394850  915.18
#> <none>                            3920856301  916.88
#> - Marketing.Spend  1   233485362  4154341663  917.77
#> - R.D.Spend        1 27147076244 31067932545 1018.37
#> 
#> Step:  AIC=915.18
#> Profit ~ R.D.Spend + Marketing.Spend
#> 
#>                   Df   Sum of Sq         RSS     AIC
#> <none>                            3944394850  915.18
#> - Marketing.Spend  1   311651716  4256046566  916.98
#> - R.D.Spend        1 31149105710 35093500560 1022.46

Lets see the summary of the new model

summary(model_backward)
#> 
#> Call:
#> lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = profit)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -33645  -4632   -414   6484  17097 
#> 
#> Coefficients:
#>                    Estimate  Std. Error t value            Pr(>|t|)    
#> (Intercept)     46975.86422  2689.93292  17.464 <0.0000000000000002 ***
#> R.D.Spend           0.79658     0.04135  19.266 <0.0000000000000002 ***
#> Marketing.Spend     0.02991     0.01552   1.927                0.06 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9161 on 47 degrees of freedom
#> Multiple R-squared:  0.9505, Adjusted R-squared:  0.9483 
#> F-statistic: 450.8 on 2 and 47 DF,  p-value: < 0.00000000000000022

Evaluation

lets evaluate the model that we have tuned

pred_prof_backward <- predict(model_backward,
                              newdata = profit)
#RMSE 
RMSE(y_pred = pred_prof_backward, y_true = profit$Profit)
#> [1] 8881.886
#MAPE
MAPE(y_pred = pred_prof_backward, y_true = profit$Profit)
#> [1] 0.1060871

R-value comparison

summary(model_backward)$adj.r.squared
#> [1] 0.9483418
summary(model_profit_all)$adj.r.squared
#> [1] 0.9451562
summary(model_profit)$adj.r.squared
#> [1] 0.9454215

Insight
-although the tuned model have a higher R-value compared to other models, it seems that there haven’t been much increase in the value of MAPE

Assumptions

1. Linearity

plot(model_backward, which = 1)
abline(h = 5000, col = "green")
abline(h = -5000, col = "green")

There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.

2. Normality Test

shapiro.test(model_backward$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward$residuals
#> W = 0.93717, p-value = 0.01042

With p-value > 0.05, we can conclude that our residuals are normally distributed.

3. Heterocedasticity

bptest(model_backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 2.8431, df = 2, p-value = 0.2413
plot(x = model_backward$fitted.values, y = model_backward$residuals)
abline(h = 0, col = "red")

With p-value > 0.05, we can conclude that heterocesdasticity is not present.

4. Multicollinearity

vif(model_backward)
#>       R.D.Spend Marketing.Spend 
#>        2.103206        2.103206

Conclusion

Variables that are useful to predict the profit of a company seems to be R&D Spending and also the Marketing Spending. Our final model have satisfied the classical assumptions of a linear regression. The R-value of our model is quite high with value of 0.9483418 meaning that the variables we chosen are highly connected to profit and can give us an accurate description of the company profits. The accuracy of our model is also measured using RMSE (Root Mean Squared Error) and also MAPE (mean absolute percentage error) and is predicted to be quite accurate with MAPE value of 10.6% error

We have already learn how to build a linear regression model and what need to be concerned when building the model.