Linear Regression Analysis

Subhalaxmi Rout

12/20/2020

What is Linear Regression ?

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.One variable is considered to be an explanatory variable(X), and the other is considered to be a dependent variable(Y).Linear regression plots one independent variable (X) against one dependent variable(Y). A linear regression is where the relationships between variables can be described with a straight line.

What is linear regression used for?

Linear regression is a basic and commonly used type of predictive analysis. It is a way to model a relationship between two sets of variables. The result is a linear regression equation that can be used to make predictions about data.

How do you perform a linear regression analysis?

Linear regression equation: \(Y = a + bx\)

Where, Y = dependent variable X = Independent variable a = y-intercept b = slope of the line

Formula for and b:

\[a = \frac {(\sum Y)(\sum X^2) - (\sum X)(\sum XY)} {n(\sum X^2)-(\sum X)^2}\] \[b = \frac {n(\sum XY)-(\sum X) (\sum Y)} {n(\sum X^2)-(\sum X)^2}\]

Here, n = number of observations.

Linear regression analysis:

Lets do linear regression analysis using R. To perform this analysis, we select marketing dataset from datarium package. Load the data and see the features and observations.

data("marketing", package = "datarium")
head(marketing)

##   youtube facebook newspaper sales
## 1  276.12    45.36     83.04 26.52
## 2   53.40    47.16     54.12 12.48
## 3   20.64    55.08     83.16 11.16
## 4  181.80    49.56     70.20 22.20
## 5  216.96    12.96     70.08 15.48
## 6   10.44    58.68     90.00  8.64

dim(marketing)

## [1] 200   4

Dataset consists of 200 observations and 4 features. It is containing the impact of three advertising medias (youtube, facebook and newspaper) on sales. The first three columns are the advertising budget in thousands of dollars along with the fourth column as sales.

We will perform 4 operations:

Exploratory data analysis
Data Preparation
Model building
Model accuracy analysis

Load necessary libraries.

library(ggplot2)
library(kableExtra)
library(dplyr)
library(caTools)
library(ggplot2)

kable(summary(marketing))

youtube	facebook	newspaper	sales
Min. : 0.84	Min. : 0.00	Min. : 0.36	Min. : 1.92
1st Qu.: 89.25	1st Qu.:11.97	1st Qu.: 15.30	1st Qu.:12.45
Median :179.70	Median :27.48	Median : 30.90	Median :15.48
Mean :176.45	Mean :27.92	Mean : 36.66	Mean :16.83
3rd Qu.:262.59	3rd Qu.:43.83	3rd Qu.: 54.12	3rd Qu.:20.88
Max. :355.68	Max. :59.52	Max. :136.80	Max. :32.40

Exploratory data analysis

Plot a correlation plot to too see the relationship between the variables.

plot(marketing, col="blue")

Above plot clearly shows that Youtube and Facebook have positive effects on sales. Here we will analyze the linear relationship between Youtube and sales.

Data Preparation

We will split our data in to 2 sets i.e training and test. Training set will have 70% of the data and test set will have 30% of the total data. Set the code using seed to reproduce the result.

set.seed(101)

marketing_new <- marketing %>% dplyr::select(c(youtube,sales))

sample = sample.split(marketing_new$youtube, SplitRatio = 0.7)

train = subset(marketing_new, sample == TRUE)
test = subset(marketing_new, sample == FALSE)

print(dim(train))

## [1] 140   2

print(dim(test))

## [1] 60  2

Model building

We will build a linear regression model to make the prediction.

lm_model <- lm(sales ~ youtube, data = train) 
summary(lm_model)

## 
## Call:
## lm(formula = sales ~ youtube, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0194 -2.0196 -0.1421  2.2338  8.1146 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.110811   0.567391   14.29   <2e-16 ***
## youtube     0.050744   0.002877   17.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.635 on 138 degrees of freedom
## Multiple R-squared:  0.6927, Adjusted R-squared:  0.6905 
## F-statistic: 311.1 on 1 and 138 DF,  p-value: < 2.2e-16

Models shows low p-values, \(R^2\) is 0.6119 means 61% of variability explained by model and low standard error. Lets do the residual analysis to check the linearity.

plot(lm_model)

hist(resid(lm_model), col = "steelblue", main = "Histogram of Residuals", xlab = "Residual")

Histogram of residual shows uniformly distributed. Residual vs fitted plot shows randomly residuals are distributed no other pattern. QQ-plots shows most values are on the line with minimal variation. Which meets the linearity criteria.

So, we can write equation: \[sales = 8.11 + 0.05 * \space youtube\]

Below plot shows the sales and youtube scatter plot along with regression line.

plot(x = marketing$youtube, y = marketing$sales, col = "darkblue", main = "Regression Plot",
     xlab = "Youtube", ylab = "Sales")
abline(lm_model, col = "darkred")

Model accuracy:

We will apply the model with our test data to predict the sales value. Since we have actual sales value so we can compare and find out the accuracy of the model.

pred <- predict(lm_model, test)
kable(pred, col.names = "sales")

	sales
11	12.135788
14	14.047804
17	12.239305
22	22.566629
25	11.904397
26	24.119381
36	25.812185
38	12.659461
40	21.994242
43	25.988773
48	22.718860
49	21.945528
54	19.229734
56	20.222278
61	11.368546
62	24.021953
64	14.364444
65	16.093784
69	22.566629
70	21.312249
72	14.796779
74	15.990267
75	21.105215
80	15.174311
82	22.712770
83	12.695996
84	12.275840
86	19.875192
92	9.852329
96	18.054514
104	19.552463
109	8.908499
110	23.662689
114	20.873825
115	12.872584
116	12.683818
117	16.587011
126	13.420614
138	24.777017
141	12.580301
143	21.537550
145	13.968644
154	18.541652
158	17.232468
160	16.130319
161	18.614722
163	19.582909
166	22.390041
168	20.703326
172	18.127584
174	18.365064
175	21.653245
176	24.971872
177	23.236444
178	18.474670
179	24.959694
181	17.646536
182	21.415765
195	17.226379
198	18.888738

We will plot a line graph to show the values between actual vs predicted.

# x-axis
x <- seq(dim(test)[1])
df <- data.frame(x,pred,test$sales)
g <- ggplot(df, aes(x=x))
g <- g + geom_line(aes(y=pred, colour="Predicted"))
g <- g + geom_point(aes(x=x, y=pred, colour="Predicted"))
g <- g + geom_line(aes(y=test$sales, colour="Actual"))
g <- g + geom_point(aes(x=x, y=test$sales, colour="Actual"))
g <- g + scale_colour_manual("", values = c(Predicted="darkred", Actual="darkgreen"))
g

So, from above explanation we learn:

Linear Regression
Calculate prediction variable by hand using linear regression equation
Linear regression analysis using R