Linear Regression Analysis

Subhalaxmi Rout

12/20/2020


What is Linear Regression ?

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.One variable is considered to be an explanatory variable(X), and the other is considered to be a dependent variable(Y).Linear regression plots one independent variable (X) against one dependent variable(Y). A linear regression is where the relationships between variables can be described with a straight line.

What is linear regression used for?

Linear regression is a basic and commonly used type of predictive analysis. It is a way to model a relationship between two sets of variables. The result is a linear regression equation that can be used to make predictions about data.

How do you perform a linear regression analysis?

Linear regression equation: \(Y = a + bx\)

Where, Y = dependent variable X = Independent variable a = y-intercept b = slope of the line

Formula for and b:

\[a = \frac {(\sum Y)(\sum X^2) - (\sum X)(\sum XY)} {n(\sum X^2)-(\sum X)^2}\] \[b = \frac {n(\sum XY)-(\sum X) (\sum Y)} {n(\sum X^2)-(\sum X)^2}\]

Here, n = number of observations.

Linear regression analysis:

Lets do linear regression analysis using R. To perform this analysis, we select marketing dataset from datarium package. Load the data and see the features and observations.

data("marketing", package = "datarium")
head(marketing)
##   youtube facebook newspaper sales
## 1  276.12    45.36     83.04 26.52
## 2   53.40    47.16     54.12 12.48
## 3   20.64    55.08     83.16 11.16
## 4  181.80    49.56     70.20 22.20
## 5  216.96    12.96     70.08 15.48
## 6   10.44    58.68     90.00  8.64
dim(marketing)
## [1] 200   4

Dataset consists of 200 observations and 4 features. It is containing the impact of three advertising medias (youtube, facebook and newspaper) on sales. The first three columns are the advertising budget in thousands of dollars along with the fourth column as sales.

We will perform 4 operations:

  • Exploratory data analysis
  • Data Preparation
  • Model building
  • Model accuracy analysis

Load necessary libraries.

library(ggplot2)
library(kableExtra)
library(dplyr)
library(caTools)
library(ggplot2)
kable(summary(marketing))
youtube facebook newspaper sales
Min. : 0.84 Min. : 0.00 Min. : 0.36 Min. : 1.92
1st Qu.: 89.25 1st Qu.:11.97 1st Qu.: 15.30 1st Qu.:12.45
Median :179.70 Median :27.48 Median : 30.90 Median :15.48
Mean :176.45 Mean :27.92 Mean : 36.66 Mean :16.83
3rd Qu.:262.59 3rd Qu.:43.83 3rd Qu.: 54.12 3rd Qu.:20.88
Max. :355.68 Max. :59.52 Max. :136.80 Max. :32.40

Exploratory data analysis

Plot a correlation plot to too see the relationship between the variables.

plot(marketing, col="blue")

Above plot clearly shows that Youtube and Facebook have positive effects on sales. Here we will analyze the linear relationship between Youtube and sales.

Data Preparation

We will split our data in to 2 sets i.e training and test. Training set will have 70% of the data and test set will have 30% of the total data. Set the code using seed to reproduce the result.

set.seed(101)

marketing_new <- marketing %>% dplyr::select(c(youtube,sales))

sample = sample.split(marketing_new$youtube, SplitRatio = 0.7)

train = subset(marketing_new, sample == TRUE)
test = subset(marketing_new, sample == FALSE)

print(dim(train))
## [1] 140   2
print(dim(test))
## [1] 60  2

Model building

We will build a linear regression model to make the prediction.

lm_model <- lm(sales ~ youtube, data = train) 
summary(lm_model)
## 
## Call:
## lm(formula = sales ~ youtube, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0194 -2.0196 -0.1421  2.2338  8.1146 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.110811   0.567391   14.29   <2e-16 ***
## youtube     0.050744   0.002877   17.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.635 on 138 degrees of freedom
## Multiple R-squared:  0.6927, Adjusted R-squared:  0.6905 
## F-statistic: 311.1 on 1 and 138 DF,  p-value: < 2.2e-16

Models shows low p-values, \(R^2\) is 0.6119 means 61% of variability explained by model and low standard error. Lets do the residual analysis to check the linearity.

plot(lm_model)

hist(resid(lm_model), col = "steelblue", main = "Histogram of Residuals", xlab = "Residual")

Histogram of residual shows uniformly distributed. Residual vs fitted plot shows randomly residuals are distributed no other pattern. QQ-plots shows most values are on the line with minimal variation. Which meets the linearity criteria.

So, we can write equation: \[sales = 8.11 + 0.05 * \space youtube\]

Below plot shows the sales and youtube scatter plot along with regression line.

plot(x = marketing$youtube, y = marketing$sales, col = "darkblue", main = "Regression Plot",
     xlab = "Youtube", ylab = "Sales")
abline(lm_model, col = "darkred")

Model accuracy:

We will apply the model with our test data to predict the sales value. Since we have actual sales value so we can compare and find out the accuracy of the model.

pred <- predict(lm_model, test)
kable(pred, col.names = "sales")
sales
11 12.135788
14 14.047804
17 12.239305
22 22.566629
25 11.904397
26 24.119381
36 25.812185
38 12.659461
40 21.994242
43 25.988773
48 22.718860
49 21.945528
54 19.229734
56 20.222278
61 11.368546
62 24.021953
64 14.364444
65 16.093784
69 22.566629
70 21.312249
72 14.796779
74 15.990267
75 21.105215
80 15.174311
82 22.712770
83 12.695996
84 12.275840
86 19.875192
92 9.852329
96 18.054514
104 19.552463
109 8.908499
110 23.662689
114 20.873825
115 12.872584
116 12.683818
117 16.587011
126 13.420614
138 24.777017
141 12.580301
143 21.537550
145 13.968644
154 18.541652
158 17.232468
160 16.130319
161 18.614722
163 19.582909
166 22.390041
168 20.703326
172 18.127584
174 18.365064
175 21.653245
176 24.971872
177 23.236444
178 18.474670
179 24.959694
181 17.646536
182 21.415765
195 17.226379
198 18.888738

We will plot a line graph to show the values between actual vs predicted.

# x-axis
x <- seq(dim(test)[1])
df <- data.frame(x,pred,test$sales)
g <- ggplot(df, aes(x=x))
g <- g + geom_line(aes(y=pred, colour="Predicted"))
g <- g + geom_point(aes(x=x, y=pred, colour="Predicted"))
g <- g + geom_line(aes(y=test$sales, colour="Actual"))
g <- g + geom_point(aes(x=x, y=test$sales, colour="Actual"))
g <- g + scale_colour_manual("", values = c(Predicted="darkred", Actual="darkgreen"))
g

So, from above explanation we learn:

  • Linear Regression
  • Calculate prediction variable by hand using linear regression equation
  • Linear regression analysis using R