library(datarium)
View(marketing)
x<-marketing[,c("youtube","sales")]
View(x)
For the model building check the data is lineraly related or not
library(ggplot2)
ggplot(marketing,aes(youtube,sales))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess'
From the above graph we see the data is linearly related
model<-lm(sales~youtube,data=marketing)
model
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Coefficients:
## (Intercept) youtube
## 8.43911 0.04754
In the above "lm" is indicating the model equation
i.e y=mx+b
y=0.04754*x+8.43911
y is dependent i.e sales
x is the youtube
summary(model)
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0632 -2.3454 -0.2295 2.4805 8.6548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.439112 0.549412 15.36 <2e-16 ***
## youtube 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
p value of the both intercept and slope are lesser than 0.05 we conclude that intercept(8.439112) and slope(0.04753) are not equals to zero.
Therefore, the regression coefficients(i.e. slope and intercept) are significant.
As we have r square as 0.6119 it indicates that sales is 61.19% related to the youtube
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
regr.eval(marketing$sales,model$fitted.values)
## mae mse rmse mape
## 3.059767 15.138220 3.890787 0.205766
In the above values the mean absolute percentage error is 20.57
i.e 79.43 accuracy from the values we can say that model is good
plot(model)
here in residuals vs fitted plot the red line is almost lying near to zero residual value and is almost horizontal and all the fitted values are scattered around it without any systematic relationship.
IN normal q-q plot drawn, the e=residuals are almost linearly distributed.(but lets check normaly futher using other tests)
In sales-location plot,all the residuals are scattered(i.e none of the points are clustered at one spot)
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.99053, p-value = 0.2133
library(nortest)
ad.test(model$residuals)
##
## Anderson-Darling normality test
##
## data: model$residuals
## A = 0.49121, p-value = 0.217
library(moments)
skewness(model$residuals)
## [1] -0.08863202
kurtosis(model$residuals)
## [1] 2.779015
Here the probability value of both shapiro wilk test and anderson darling test is more than 0.05 hence, we accept null hypothesis saying that the residual data is normally distributed.
And we also have skewness nearly equal to zero and kurtosis nearly equal to 3 where we can say that residual data is normally distributed.
therefore, NORMALITY IS MET on residuals
Here we check wther the residuals are correlated (dependent) or not correlated (independent) by using durbin watson test
library(car)
## Loading required package: carData
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.02342385 1.934689 0.61
## Alternative hypothesis: rho != 0
Here the probability value is grater than 0.05 so we accept null hypothesis saying that there is no correlation among residuals(i.e residuals are independent)
Therefore INDEPENDENCY IS MET on residuals
Here, we chek the availibility of influential observations by using cooks distance. any observation far from cooks distance is referred as influential observations. These observations influence the model to commit an error.
plot(model,4)
Here we see 179th,36th observations far from cooks distance which are influential observation.
Hence,the model is ready to deploy.
Lets,predict the amount of sales on the following youtube given data
you_tube<-data.frame(youtube=c(69,10.32,10.44,257.64,71.52))
you_tube
## youtube
## 1 69.00
## 2 10.32
## 3 10.44
## 4 257.64
## 5 71.52
pred_sales<-predict(model,you_tube)
you_tube$sales<-pred_sales
you_tube
## youtube sales
## 1 69.00 11.719140
## 2 10.32 8.929690
## 3 10.44 8.935395
## 4 257.64 20.686452
## 5 71.52 11.838933
These are the outcomes(sales) given by the model we developed for the given predictors(youtube)