DEVELOPING SEVERAL MODELS USING IN-BUILT DATASETS IN R

First of all, we can run the code below:

data()

Then, you will see the follow after running the code above….

Recall that this contains several data enclosed inside a dataset, then run:

library(datasets)

From the second image, you will see a datum called EuStockMarkets, then extract it by running the code:

View(EuStockMarkets)

Now, we want to extract the names of variables inside the EuStockMarkets:

DAX = EuStockMarkets[ , 1]

SMI = EuStockMarkets[ , 2]

CAC = EuStockMarkets[ , 3]

FTSE = EuStockMarkets[ , 4]

SIMPLE LINEAR REGRESSION MODEL

#Here, we intend to regress DAX on SMI, look at the commands:

reg = lm(DAX ~ SMI, data=EuStockMarkets) summary(reg)

The summary(reg) above will produce the following results below:

Now, to authenticate the valididty of the estimated results, run the ANOVA test:

anova(reg)

MULTIPLE LINEAR REGRESSION MODEL

reg_2 = lm(DAX ~ SMI + CAC + FTSE, data=EuStockMarkets) summary(reg_2)

##Anyway, now….in order to authenticate the validity of our results of parameter estimation….we to perform ANOVA test on this multiple linear regression model too……see the command below:

anova(reg_2)

BASIC ASSUMPTIONS FOR VALIDITY OF ANALYSIS OF VARIANCE (ANOVA) TO HOLD

NORMALITY ASSUMPTION

#We have several assumption tests, but we will only deal with 2 tests and 2 plots:

Normal Q-Q Plot (It’s used to Confirm Normality Assumption)

#First of all, we will find the residuals of the model using the command:

res=resid(reg_2)

If you want to know these residual values, run the command below:

print(res)

Or, run the command below:

View(res)

#After that, we can plot a normal Quartile-Quartile Plot as follows:

qqnorm(res, col=3,lwd=1, pch=19, col.main=“blue”, col.lab=“purple”)

#After the qqplot command, we then add a line as follows:

qqline(res,col=2,lwd=2)

#After the line is added, then we show the legend as follows:

legend(x=“topleft”, legend=c(“Line of Scatter Points”,“Line of Best Fit”), col=c(3:2),lwd=2,bg=“brown”)

#Therefore, we have the following plot to detect whether or not the data follows a normal distribiution:

Run the following command to create an empty graphical space for the next plot:

windows()

#After that, let’s use Histogram of the Standardized Residuals to check whether or not the data follows a normal distribution:

Histogram of the Standardized Residuals (It’s used to Confirm Normality Assumption)

res=resid(reg_2)

hist(res, prob=TRUE, col=c(1:8), main=“Histogram of the Standardized Residuals”, col.main=6, col.lab=“purple”, xlab=“Assigned Range of Values of the Residuals”, sub=“Figure XI”)

See the plot below:

Having explored the two plots/graphs, let’s use statistical tests to confirm the normality assumption:

Shapiro-Wilk’s Test (1965) for Normality Assumption:

Look at the hypothesis here:

#Ho: The dataset is NOT normal Vs H1: The dataset is normal

#The command is:

res=resid(reg_2)

shapiro.test(res)

The Decision Rule is: Reject Ho if p-value is less than or equal to 0.05

Now, the p-value (= 0.0000) is less than 0.05, it means that we shall reject

the null hypothesis (Ho) and conclude that the data set is normal.

We can also use another statistical test to confirm normality as follows:

Kolmogorov-Smirnov Test for Normality Assumption

#Ho: The dataset is NOT normal Vs H1: The dataset is normal

#The command is:

res=resid(reg_2)

ks.test(res, pnorm, mean(res), sd(res))

The Decision Rule is: Reject Ho if p-value is less than or equal to 0.05

Now, the p-value (= 0.0000) is less than 0.05, it means that we shall reject

the null hypothesis (Ho) and conclude that the data set is normal.

HOMOSCEDASTICITY ASSUMPTION

#This is another crucial assumption that should be tested and confirm that the data set is homoscedastic.

Bartlett’s Test for Homoscedasticity Assumption

length(DAX) length(SMI) length(CAC) length(FTSE)

##Since all the four variables are of the same length, it means we are good to go:

y1=gl(4,1860)

y2=c(DAX, SMI, CAC, FTSE)

#Set the hypotheses as follows:

Ho: The data set is NOT homoscedastic Vs H1: The data set is homoscedastic

bartlett.test(y2,y1)

OR

bartlett.test(y2~y1)

#NB: Both produce the same results as follows:

The Decision Rule is: Reject Ho if p-value is less than or equal to 0.05

Now, the p-value (= 0.0000) is less than 0.05, it means that we shall reject

the null hypothesis (Ho) and conclude that the data set is homoscedastic.

##For the same Homoscedastic Assumption, we can still use another statistical test such as:

Breusch-Pagan Test for Heteroscedasticity Assumption

#First of all, let’s call the library as follows:

library(lmtest)

After that, let’s run the command below:

bptest(reg_2, studentize=FALSE)

The Decision Rule is: Reject Ho if p-value is less than or equal to 0.05

Now, the p-value (= 0.0000) is less than 0.05, it means that we shall reject

the null hypothesis (Ho) and conclude that the data set is homoscedastic.

##For the same Homoscedastic Assumption, we can still use another statistical PLOT such as:

Residual Plot for Heteroscedasticity Assumption

plot(res, col=c(3:9), main=“Residual Plot for Heteroscedasticity Assumption”, col.main=“blue”, sub=“Figure III”, col.sub=“purple”, pch=19, ylab=“Residual Values”, col.lab=2)

##NB: Since it givesa structureless shape, it is an indication that homoscedasticity assumption is NOT violated.

AUTOCORRELATION ASSUMPTION

This is another important assumption we should look at but because this data does not contain time in its ntire structure, I do not think autocorrelation test is required.

But for the sake of research, in case you have such data, use the command below:

Durbin-Watson Test for Autocorrelation

Before the code can run, we need to call the library “car” as follows:

library(car)

durbinWatsonTest(reg_2, max.lag=2)

The Decision Rule is: Reject Ho if p-value is less than or equal to 0.05

Now, the p-value (= 0.0000) is less than 0.05, it means that we shall reject

the null hypothesis (Ho) and conclude that it is significant, thereby making us to belive that there is autocorrelation problem in the data set.