Data preprocessing

library(datarium)
View(marketing)
x<-marketing[,c("facebook","sales")]
View(x)

select the method

As the variables in the dataset are continous,we can use Regression technique.

One of the variable is independent and other is dependent variable.So Check whether the data is linear or not # Check whether the data is linear or not

library(ggplot2)
ggplot(x,aes(facebook,sales))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess'

As you can see approximately variables are linearly related, So we can use Simple Linear Regression model.

Train the model

model<-lm(sales ~ facebook,data = x)
model
## 
## Call:
## lm(formula = sales ~ facebook, data = x)
## 
## Coefficients:
## (Intercept)     facebook  
##     11.1740       0.2025
y<-model$fitted.values
errors<-model$residuals

Test the model

Checking the coefficients Significance

summary(model)
## 
## Call:
## lm(formula = sales ~ facebook, data = x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8766  -2.5589   0.9248   3.3330   9.8173 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.17397    0.67548  16.542   <2e-16 ***
## facebook     0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.13 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16

Coefficients of this model are significant and so you can use the model

Checking the Residual squares

R square is the value that tells us ,how much varience of dependent variable is being explained by independent variable. If R SQUARE is close to 1,that tells us model is better one. From above summary we can say R Square value=0.33 which is near to 0 and hence this model is not best.

Checking the Accuracy of model

Accuracy is How well the model is predicting the Outcome.

 library(DMwR)
## Loading required package: lattice
## Loading required package: grid
regr.eval(x$sales,model$fitted.values)
##        mae        mse       rmse       mape 
##  3.9842626 26.0530528  5.1042191  0.3381669

For lesser data we can use “mape” as accuracy measure.Here it represents 33.8% as errors

Hence,Accuracy of this model is 76.2%

checking the assumptions of residuals

1.linearity

plot(model)

In Residuals vs Fitted plot ,red line is drawn along the dotted line and all fitted values scattered around it without any systematic relationship then linearity assumption is met on the residuals.

2.Normality of residuals

Statistical tests are used to check the normality of residuals Shapiro wilk Test,Anderson Darling Test

NUll Hypothesis :data is normally distributed Alternate Hypothesis :data is not normally distributed

shapiro.test(model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.96072, p-value = 2.367e-05

As p-value is < 0.05 we can not accept null hypothesis.Hence Residual data is not normally distributed.

Similarly Anderson Darling Test

library(nortest)
ad.test(model$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model$residuals
## A = 2.439, p-value = 3.467e-06

3.HOMOSCADESCITY of Residuals

Checking for constant error rate

plot(model)

4.Independence of Errors

To check the correlation between errors we use Durbin Watson Test Null Hypothesis : No correlation between errors Alternate Hypothesis: correlation between errors

library(car)
## Loading required package: carData
durbinWatsonTest(model)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.02274019      1.945713     0.7
##  Alternative hypothesis: rho != 0

As p-value >0.05 we can accept the null hypothesis,so there is no correlation between errors

Checking for outliers

boxplot(model$residuals)

Prediction of marketing data

facebook=2000
new_data<- data.frame(facebook)
pred_sales<-predict(model,newdata = new_data)
pred_sales
##        1 
## 416.1655

the predicted sales is 416.165

Conclusion

Predicted model doesnot follow the assumptions of residuals and this model may not best suitable.