this LBB is used to learn about Linear Regression by predicting profits of 50 companies using the “50 Startups Data” dataset provided in kaggle. By learning Linear Regression through this dataset we can predict which companies to invest for maximizing profit
First lets load the necessary library we will be using
lets load in the data we will be using into an object. We will name the object “profit”
Lets check the structure of our data
#> 'data.frame': 50 obs. of 5 variables:
#> $ R.D.Spend : num 165349 162598 153442 144372 142107 ...
#> $ Administration : num 136898 151378 101146 118672 91392 ...
#> $ Marketing.Spend: num 471784 443899 407935 383200 366168 ...
#> $ State : chr "New York" "California" "Florida" "New York" ...
#> $ Profit : num 192262 191792 191050 182902 166188 ...
It seems that the type of data on State is wrong, lets change that. While also changing the type of data, lets change the name so we can understand better
#> 'data.frame': 50 obs. of 5 variables:
#> $ R.D.Spend : num 165349 162598 153442 144372 142107 ...
#> $ Administration : num 136898 151378 101146 118672 91392 ...
#> $ Marketing.Spend: num 471784 443899 407935 383200 366168 ...
#> $ State : Factor w/ 3 levels "California","Florida",..: 3 1 2 3 2 3 1 2 3 1 ...
#> $ Profit : num 192262 191792 191050 182902 166188 ...
Since this is linear regression that means that the data provided have correlation to one another, lets analyze that correlation
Insight
-R&D Spending is very correlated to
Profit
-Marketing Spending is quite correlated to Profit
-R&D Spending and Marketing Spending is quite correlated to one
another
Making a Linear Regression model to predit “profit” based on all variable provided
model_profit_all <- lm(Profit ~ ., profit)
model_profit <- lm(Profit ~ R.D.Spend, profit)
summary(model_profit_all)#>
#> Call:
#> lm(formula = Profit ~ ., data = profit)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -33504 -4736 90 6672 17338
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 50125.34383 6884.81973 7.281 0.00000000444 ***
#> R.D.Spend 0.80602 0.04641 17.369 < 0.0000000000000002 ***
#> Administration -0.02700 0.05223 -0.517 0.608
#> Marketing.Spend 0.02698 0.01714 1.574 0.123
#> StateFlorida 198.78879 3371.00712 0.059 0.953
#> StateNew York -41.88702 3256.03913 -0.013 0.990
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9439 on 44 degrees of freedom
#> Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
#> F-statistic: 169.9 on 5 and 44 DF, p-value: < 0.00000000000000022
#>
#> Call:
#> lm(formula = Profit ~ R.D.Spend, data = profit)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -34351 -4626 -375 6249 17188
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 49032.89914 2537.89695 19.32 <0.0000000000000002 ***
#> R.D.Spend 0.85429 0.02931 29.15 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9416 on 48 degrees of freedom
#> Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
#> F-statistic: 849.8 on 1 and 48 DF, p-value: < 0.00000000000000022
Insight
-For everytime R.D Spending increase by
1, the profit increases by 0.80602 times
-From the level of
significance, R.d Spend is highly significant
-The model have an
R-squared value of 0.9452 which indicates that the model is very good
because the closer the value to 1 the better it is.
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error: \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is often considered better than MAE or mean absolute error because it squares the differences between the actual values and the predicted values, meaning that predictions with higher errors are penalized more significantly. This metric is frequently used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE() function from the caret package to evaluate model performance. Below is the performance of the first model (with complete variables).
pred_prof_all <- predict(model_profit_all,
newdata = profit)
pred_prof_RD <- predict(model_profit,
newdata = profit)
#RMSE
RMSE(y_pred = model_profit_all$fitted.values, y_true= profit$Profit)#> [1] 8854.761
#> [1] 9226.101
Lets also compare the value of MAPE
#> [1] 0.1060236
#> [1] 0.1107014
Insight
The best model is to use
model_profit_all which uses all variables to predit the outcome of
profit rather than model_profit which only uses R.D.Spend to predict the
profit
Linear regression is a parametric model, meaning it relies on
specific assumptions to create a model equation. If these assumptions
are not met, the model can produce misleading results or biased
estimators. In this section, we will examine the second model, where
some variables have been removed, to see if it adheres to the linearity
assumption.
The linear regression model assumes a straight-line relationship
between the predictors and the response variable. If the actual
relationship deviates significantly from linearity, almost all
conclusions drawn from the model fit may be unreliable, and the model’s
predictive accuracy can be significantly compromised.
Residual plots are a useful graphical tool for detecting non-linearity. These plots show the relationship between residuals (errors) and predicted (fitted) values. If a pattern emerges in the residual plot, it indicates that the model can be further improved or that it does not meet the linearity assumption.
There could be a pattern showing here where the residuals becomes more
positive as the fitted value increases before it decreases again. The
pattern could indicate that our model may not be linear enough
The second assumption in linear regression is that the residuals
follow normal distribution. We can easily check this by using the
Saphiro-Wilk normality test.
#>
#> Shapiro-Wilk normality test
#>
#> data: model_profit_all$residuals
#> W = 0.9373, p-value = 0.01055
the null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.
The errors produced by the model are expected to spread randomly or with constant variation. If visualized, the errors should not form a pattern. This condition is also known as homoscedasticity.
Based on the plot, we can assume that the errors produced by the model
is categorized as “homoscedasticity” due to the random cloud pattern it
shown in the graph. Second way to detect heterocesdasticity is using the
Breusch-Pagan test, with null hypothesis is there is no
heterocesdasticity.
#>
#> studentized Breusch-Pagan test
#>
#> data: model_profit_all
#> BP = 3.1508, df = 5, p-value = 0.6767
With p-value > 0.05, we can conclude that “homoscedasticity” is present in our model.
Multicollinearity refers to a correlation between independent variables (predictors) in a regression model. To check for multicollinearity, we can measure the Variance Inflation Factor (VIF). As a rule of thumb, a VIF value exceeding 5 or 10 indicates a problematic amount of collinearity.
#> GVIF Df GVIF^(1/(2*Df))
#> R.D.Spend 2.495511 1 1.579719
#> Administration 1.177766 1 1.085249
#> Marketing.Spend 2.416797 1 1.554605
#> State 1.062673 2 1.015313
Based on the value of MAPE, we see that the error is roughly 10%, we will try to improve this model using backward elimination.
#> Start: AIC=920.87
#> Profit ~ R.D.Spend + Administration + Marketing.Spend + State
#>
#> Df Sum of Sq RSS AIC
#> - State 2 516657 3920856301 916.88
#> - Administration 1 23816156 3944155801 919.17
#> <none> 3920339644 920.87
#> - Marketing.Spend 1 220708706 4141048350 921.61
#> - R.D.Spend 1 26878168212 30798507857 1021.94
#>
#> Step: AIC=916.88
#> Profit ~ R.D.Spend + Administration + Marketing.Spend
#>
#> Df Sum of Sq RSS AIC
#> - Administration 1 23538549 3944394850 915.18
#> <none> 3920856301 916.88
#> - Marketing.Spend 1 233485362 4154341663 917.77
#> - R.D.Spend 1 27147076244 31067932545 1018.37
#>
#> Step: AIC=915.18
#> Profit ~ R.D.Spend + Marketing.Spend
#>
#> Df Sum of Sq RSS AIC
#> <none> 3944394850 915.18
#> - Marketing.Spend 1 311651716 4256046566 916.98
#> - R.D.Spend 1 31149105710 35093500560 1022.46
Lets see the summary of the new model
#>
#> Call:
#> lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = profit)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -33645 -4632 -414 6484 17097
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 46975.86422 2689.93292 17.464 <0.0000000000000002 ***
#> R.D.Spend 0.79658 0.04135 19.266 <0.0000000000000002 ***
#> Marketing.Spend 0.02991 0.01552 1.927 0.06 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9161 on 47 degrees of freedom
#> Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
#> F-statistic: 450.8 on 2 and 47 DF, p-value: < 0.00000000000000022
lets evaluate the model that we have tuned
#> [1] 8881.886
#> [1] 0.1060871
R-value comparison
#> [1] 0.9483418
#> [1] 0.9451562
#> [1] 0.9454215
Insight
-although the tuned model have a higher
R-value compared to other models, it seems that there haven’t been much
increase in the value of MAPE
There is little to no discernible pattern in our residual plot, we can
conclude that our model is linear.
#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward$residuals
#> W = 0.93717, p-value = 0.01042
With p-value > 0.05, we can conclude that our residuals are
normally distributed.
#>
#> studentized Breusch-Pagan test
#>
#> data: model_backward
#> BP = 2.8431, df = 2, p-value = 0.2413
With p-value > 0.05, we can conclude that heterocesdasticity is not
present.
Variables that are useful to predict the profit of a company seems to be R&D Spending and also the Marketing Spending. Our final model have satisfied the classical assumptions of a linear regression. The R-value of our model is quite high with value of 0.9483418 meaning that the variables we chosen are highly connected to profit and can give us an accurate description of the company profits. The accuracy of our model is also measured using RMSE (Root Mean Squared Error) and also MAPE (mean absolute percentage error) and is predicted to be quite accurate with MAPE value of 10.6% error
We have already learn how to build a linear regression model and what need to be concerned when building the model.