Dataset:
Cars and salespersons
we would like to analyze the relationship between how many total cars have been sold by each sales person and how many weeks each salesperson has worked to sell these cars.
library(tidyverse)
library(ggplot2)
library(dplyr)SalesCars <- read_csv('https://raw.githubusercontent.com/GabrielSantos33/Non-linear-regression/main/CarsSoldWeek.csv', show_col_types = FALSE)SalesCars| J_period | Cars_sold |
|---|---|
| 168 | 272 |
| 428 | 300 |
| 296 | 311 |
| 392 | 365 |
| 80 | 167 |
| 56 | 149 |
| 352 | 366 |
| 444 | 310 |
| 168 | 192 |
| 200 | 229 |
| 4 | 88 |
| 52 | 118 |
| 20 | 62 |
| 228 | 319 |
| 72 | 193 |
ggplot(SalesCars, aes(x = J_period, y = Cars_sold)) +
geom_point() +
labs(title = "Scatter Plot",
x = "Job period",
y = "Total cars sold")Does it look linear? Does it look non linear?, both?
What model best represents this data
Does this linear regression fits the data quite well
ggplot(SalesCars, aes(x = J_period, y = Cars_sold)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x",se = FALSE) +
labs(title = "Scatter Plot Example",
x = "Job period",
y = "Total cars sold")lm_model <- lm(Cars_sold ~ J_period, data = SalesCars)coeficient <- coef(lm_model)r_squared <- summary(lm_model)$r.squaredpredicted_y <- predict(lm_model)SalesCars$r_squared <- r_squared
SalesCars$predicted_y <- predicted_ycat("R-squared value: ", round(r_squared, 2), "\n")## R-squared value: 0.8
cat("Predicted Y-values: ", paste(round(predicted_y, 2), collapse = ", "))## Predicted Y-values: 212.32, 363.71, 286.85, 342.75, 161.08, 147.1, 319.46, 373.03, 212.32, 230.95, 116.83, 144.77, 126.14, 247.26, 156.42
intercept <- coef(lm_model)[1]
slope <- coef(lm_model)[2]
# Display the linear equation
cat("Linear equation: y =", round(intercept, 2), "+", round(slope, 2), "* x")## Linear equation: y = 114.5 + 0.58 * x
For each week a person has added a bit more than a half car per week
For every week the sales is expected to increase by 0.58 per week
NOTE: In rare cases where setting the intercept to ZERO makes sense, but for this case we will not set it this up.
print(summary(lm_model))##
## Call:
## lm(formula = Cars_sold ~ J_period, data = SalesCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.142 -27.800 1.896 30.364 71.743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.49632 19.78813 5.786 6.32e-05 ***
## J_period 0.58228 0.08026 7.255 6.41e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.94 on 13 degrees of freedom
## Multiple R-squared: 0.8019, Adjusted R-squared: 0.7867
## F-statistic: 52.63 on 1 and 13 DF, p-value: 6.412e-06
Linear equation: y = 114.5 + 0.58 * x Multiple R = 0.8955 very high Adjusted R-squared: 0.7867 R-squared: 0.8019 this model explains the 80% of the variance standard error: 45.94 how well the observation fits around the regression line.
anova_results <- anova(lm_model)print("ANOVA Results:")## [1] "ANOVA Results:"
print(anova_results)## Analysis of Variance Table
##
## Response: Cars_sold
## Df Sum Sq Mean Sq F value Pr(>F)
## J_period 1 111097 111097 52.633 6.412e-06 ***
## Residuals 13 27440 2111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression = mean square of 111097 Residuals = 27440 total sum of squares = 138537 ——— Mean Sq = 111097 Mean Sq error or residual = 2111 = low relative to the total sum of squares F Statistics = 52.633 very high you can see that our significance value is very small- the model is statistically significant P-value = 6.32e-05 Jperiod = 6.41e-06
Let’s see the residuals
residuals <- residuals(lm_model)
# Create a residuals plot
plot(residuals, main = "Residuals Plot", xlab = "Predicted Cars_sold", ylab = "Residuals")#x = J_period, y = Cars_sold)Simply by transforming the exploratory valiables does not make a non-linear model
ggplot(SalesCars, aes(x = J_period, y = Cars_sold)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ log(x)",se = FALSE) +
labs(title = "Non linear Scatter Plot Example - transformation",
x = "Job period",
y = "Total cars sold")Lets use an Exponential model not a good model
ggplot(SalesCars, aes(x = J_period, y = Cars_sold)) +
geom_point() +
geom_smooth(method = "nls", formula = y ~ a * exp(b * x), se = FALSE, color = "red",
method.args = list(start = list(a = 5, b = 0))) + # Add a red
labs(title = "Exponetial model",
x = "X", y = "Y") + # Set title and axis labels
theme_minimal() Non Linear model using modelsq quadratic regression model not a good one
library(ggplot2)
y <- SalesCars$Cars_sold
x <- SalesCars$J_period
# Fit non-linear regression model
modelsq <- nls(y ~ a * x^2 + b * x + c, start = list(a = 1, b = 1, c = 1))
p <- ggplot(SalesCars, aes(x = J_period, y = Cars_sold)) +
geom_point() +
labs(x = "X", y = "Y")
# Add fitted line to the plot
p <- p +
stat_function(fun = function(x) predict(modelsq, newdata = data.frame(x = x)),
color = "red")
print(p)print(summary(modelsq))##
## Formula: y ~ a * x^2 + b * x + c
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a -0.0018521 0.0005004 -3.702 0.00303 **
## b 1.4094525 0.2306406 6.111 5.25e-05 ***
## c 63.8509693 19.6280432 3.253 0.00692 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.68 on 12 degrees of freedom
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 3.252e-09
residuals <- residuals(modelsq)
# Create a residuals plot
plot(residuals, main = "Residuals Plot", xlab = "Predicted Cars_sold", ylab = "Residuals")#x = J_period, y = Cars_sold)