Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.One variable is considered to be an explanatory variable(X), and the other is considered to be a dependent variable(Y).Linear regression plots one independent variable (X) against one dependent variable(Y). A linear regression is where the relationships between variables can be described with a straight line.
Linear regression is a basic and commonly used type of predictive analysis. It is a way to model a relationship between two sets of variables. The result is a linear regression equation that can be used to make predictions about data.
Linear regression equation: \(Y = a + bx\)
Where, Y = dependent variable X = Independent variable a = y-intercept b = slope of the line
Formula for and b:
\[a = \frac {(\sum Y)(\sum X^2) - (\sum X)(\sum XY)} {n(\sum X^2)-(\sum X)^2}\] \[b = \frac {n(\sum XY)-(\sum X) (\sum Y)} {n(\sum X^2)-(\sum X)^2}\]
Here, n = number of observations.
Lets do linear regression analysis using R. To perform this analysis, we select marketing dataset from datarium package. Load the data and see the features and observations.
## youtube facebook newspaper sales
## 1 276.12 45.36 83.04 26.52
## 2 53.40 47.16 54.12 12.48
## 3 20.64 55.08 83.16 11.16
## 4 181.80 49.56 70.20 22.20
## 5 216.96 12.96 70.08 15.48
## 6 10.44 58.68 90.00 8.64
## [1] 200 4
Dataset consists of 200 observations and 4 features. It is containing the impact of three advertising medias (youtube, facebook and newspaper) on sales. The first three columns are the advertising budget in thousands of dollars along with the fourth column as sales.
We will perform 4 operations:
Load necessary libraries.
| youtube | newspaper | sales | ||
|---|---|---|---|---|
| Min. : 0.84 | Min. : 0.00 | Min. : 0.36 | Min. : 1.92 | |
| 1st Qu.: 89.25 | 1st Qu.:11.97 | 1st Qu.: 15.30 | 1st Qu.:12.45 | |
| Median :179.70 | Median :27.48 | Median : 30.90 | Median :15.48 | |
| Mean :176.45 | Mean :27.92 | Mean : 36.66 | Mean :16.83 | |
| 3rd Qu.:262.59 | 3rd Qu.:43.83 | 3rd Qu.: 54.12 | 3rd Qu.:20.88 | |
| Max. :355.68 | Max. :59.52 | Max. :136.80 | Max. :32.40 |
Plot a correlation plot to too see the relationship between the variables.
Above plot clearly shows that Youtube and Facebook have positive effects on sales. Here we will analyze the linear relationship between Youtube and sales.
We will split our data in to 2 sets i.e training and test. Training set will have 70% of the data and test set will have 30% of the total data. Set the code using seed to reproduce the result.
set.seed(101)
marketing_new <- marketing %>% dplyr::select(c(youtube,sales))
sample = sample.split(marketing_new$youtube, SplitRatio = 0.7)
train = subset(marketing_new, sample == TRUE)
test = subset(marketing_new, sample == FALSE)
print(dim(train))## [1] 140 2
## [1] 60 2
We will build a linear regression model to make the prediction.
##
## Call:
## lm(formula = sales ~ youtube, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0194 -2.0196 -0.1421 2.2338 8.1146
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.110811 0.567391 14.29 <2e-16 ***
## youtube 0.050744 0.002877 17.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.635 on 138 degrees of freedom
## Multiple R-squared: 0.6927, Adjusted R-squared: 0.6905
## F-statistic: 311.1 on 1 and 138 DF, p-value: < 2.2e-16
Models shows low p-values, \(R^2\) is 0.6119 means 61% of variability explained by model and low standard error. Lets do the residual analysis to check the linearity.
Histogram of residual shows uniformly distributed. Residual vs fitted plot shows randomly residuals are distributed no other pattern. QQ-plots shows most values are on the line with minimal variation. Which meets the linearity criteria.
So, we can write equation: \[sales = 8.11 + 0.05 * \space youtube\]
Below plot shows the sales and youtube scatter plot along with regression line.
plot(x = marketing$youtube, y = marketing$sales, col = "darkblue", main = "Regression Plot",
xlab = "Youtube", ylab = "Sales")
abline(lm_model, col = "darkred")We will apply the model with our test data to predict the sales value. Since we have actual sales value so we can compare and find out the accuracy of the model.
| sales | |
|---|---|
| 11 | 12.135788 |
| 14 | 14.047804 |
| 17 | 12.239305 |
| 22 | 22.566629 |
| 25 | 11.904397 |
| 26 | 24.119381 |
| 36 | 25.812185 |
| 38 | 12.659461 |
| 40 | 21.994242 |
| 43 | 25.988773 |
| 48 | 22.718860 |
| 49 | 21.945528 |
| 54 | 19.229734 |
| 56 | 20.222278 |
| 61 | 11.368546 |
| 62 | 24.021953 |
| 64 | 14.364444 |
| 65 | 16.093784 |
| 69 | 22.566629 |
| 70 | 21.312249 |
| 72 | 14.796779 |
| 74 | 15.990267 |
| 75 | 21.105215 |
| 80 | 15.174311 |
| 82 | 22.712770 |
| 83 | 12.695996 |
| 84 | 12.275840 |
| 86 | 19.875192 |
| 92 | 9.852329 |
| 96 | 18.054514 |
| 104 | 19.552463 |
| 109 | 8.908499 |
| 110 | 23.662689 |
| 114 | 20.873825 |
| 115 | 12.872584 |
| 116 | 12.683818 |
| 117 | 16.587011 |
| 126 | 13.420614 |
| 138 | 24.777017 |
| 141 | 12.580301 |
| 143 | 21.537550 |
| 145 | 13.968644 |
| 154 | 18.541652 |
| 158 | 17.232468 |
| 160 | 16.130319 |
| 161 | 18.614722 |
| 163 | 19.582909 |
| 166 | 22.390041 |
| 168 | 20.703326 |
| 172 | 18.127584 |
| 174 | 18.365064 |
| 175 | 21.653245 |
| 176 | 24.971872 |
| 177 | 23.236444 |
| 178 | 18.474670 |
| 179 | 24.959694 |
| 181 | 17.646536 |
| 182 | 21.415765 |
| 195 | 17.226379 |
| 198 | 18.888738 |
We will plot a line graph to show the values between actual vs predicted.
# x-axis
x <- seq(dim(test)[1])
df <- data.frame(x,pred,test$sales)
g <- ggplot(df, aes(x=x))
g <- g + geom_line(aes(y=pred, colour="Predicted"))
g <- g + geom_point(aes(x=x, y=pred, colour="Predicted"))
g <- g + geom_line(aes(y=test$sales, colour="Actual"))
g <- g + geom_point(aes(x=x, y=test$sales, colour="Actual"))
g <- g + scale_colour_manual("", values = c(Predicted="darkred", Actual="darkgreen"))
gSo, from above explanation we learn: