library(datarium)
View(marketing)
x<-marketing[,c("newspaper","sales")]
View(x)
One of the variable is independent and other is dependent variable.So Check whether the data is linear or not # Check whether the data is linear or not
library(ggplot2)
ggplot(x,aes(newspaper,sales))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess'
As you can see clearly variables in sample data are not linearly related.So we cannot prefer Simple Linear Regression Method.
model<-lm(sales ~ newspaper,data = x)
model
##
## Call:
## lm(formula = sales ~ newspaper, data = x)
##
## Coefficients:
## (Intercept) newspaper
## 14.82169 0.05469
y<-model$fitted.values
errors<-model$residuals
summary(model)
##
## Call:
## lm(formula = sales ~ newspaper, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.473 -4.065 -1.007 4.207 15.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.82169 0.74570 19.88 < 2e-16 ***
## newspaper 0.05469 0.01658 3.30 0.00115 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.111 on 198 degrees of freedom
## Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
## F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
Coefficients of this model are significant and so you can use the model
R square is the value that tells us ,how much varience of dependent variable is being explained by independent variable. If R SQUARE is close to 1,that tells us model is better one. From above summary we can say R Square value=0.05 which is near to 0 and hence this model is not suited.x cannot explain y properly.
Accuracy is How well the model is predicting the Outcome.
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
regr.eval(x$sales,model$fitted.values)
## mae mse rmse mape
## 4.9758717 36.9705927 6.0803448 0.3860048
For lesser data we can use “mape” as accuracy measure.Here it represents 38.6% as errors
Hence,Accuracy of this model is 61.4%
plot(model)
In Residuals vs Fitted plot ,red line is not drawn along the dotted line and all fitted values are not scattered around it, then linearity assumption is not met on the residuals.
Statistical tests are used to check the normality of residuals Shapiro wilk Test,Anderson Darling Test
NUll Hypothesis :data is normally distributed Alternate Hypothesis :data is not normally distributed
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.98197, p-value = 0.0114
As p-value is < 0.05 we can not accept null hypothesis.Hence Residual data is not normally distributed.
Similarly Anderson Darling Test
library(nortest)
ad.test(model$residuals)
##
## Anderson-Darling normality test
##
## data: model$residuals
## A = 1.1601, p-value = 0.004848
Checking for constant error rate
plot(model)
As you can see it has high varience
To check the correlation between errors we use Durbin Watson Test Null Hypothesis : No correlation between errors Alternate Hypothesis: correlation between errors
library(car)
## Loading required package: carData
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.004787825 1.983434 0.932
## Alternative hypothesis: rho != 0
As p-value >0.05 we can accept the null hypothesis,so there is no correlation between errors
boxplot(model$residuals)
newspaper=2000
new_data<- data.frame(newspaper)
pred_sales<-predict(model,newdata = new_data)
pred_sales
## 1
## 124.2079