INTRODUCTION

In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although, it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.

Parametric Tests

Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following

One sample t-test
Paired and unpaired t-test
Analysis of Variance (ANOVA)
Pearson correlation

One sample t-test

The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example: 1. A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.

Run the command below to eliminate # symbol in the Rmarkdown file

knitr::opts_chunk$set(comment = NA)

Activate the required package

library(haven)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2

Warning: package 'ggplot2' was built under R version 4.2.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(agricolae)
library(stargazer)


Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(ggplot2)
library(ggthemes)
library(plotrix)
library(tseries)
library(tseries)
library(tidyverse)
library(vars)

Loading required package: MASS

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Loading required package: strucchange
Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich

Attaching package: 'strucchange'

The following object is masked from 'package:stringr':

    boundary

Loading required package: urca
Loading required package: lmtest

Attaching package: 'lmtest'

The following object is masked from 'package:highcharter':

    unemployment

library(olsrr)


Attaching package: 'olsrr'

The following object is masked from 'package:MASS':

    cement

The following object is masked from 'package:datasets':

    rivers

library(car)

Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

library(grid)
library(rmarkdown)
library(ggtext)
library(RColorBrewer)
library(rhandsontable)
library(forecast)
library(quantmod)

Loading required package: xts

Attaching package: 'xts'

The following objects are masked from 'package:dplyr':

    first, last

Loading required package: TTR

library(xts)
library(PerformanceAnalytics)


Attaching package: 'PerformanceAnalytics'

The following objects are masked from 'package:agricolae':

    kurtosis, skewness

The following object is masked from 'package:graphics':

    legend

library(rugarch)

Loading required package: parallel

Attaching package: 'rugarch'

The following object is masked from 'package:purrr':

    reduce

The following object is masked from 'package:stats':

    sigma

library(dplyr)
library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

library(gapminder)
library(plyr)

------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'

The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from 'package:purrr':

    compact

library(plotrix)
library(survival)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(survminer)

Warning: package 'survminer' was built under R version 4.2.2

Loading required package: ggpubr

Attaching package: 'ggpubr'

The following object is masked from 'package:plyr':

    mutate

The following object is masked from 'package:forecast':

    gghistogram


Attaching package: 'survminer'

The following object is masked from 'package:survival':

    myeloma

library(magrittr)
library(ggpubr)
library(stargazer)

Set up the script to remove scientific notation

Use the command below to ensure that the output values are not written in scientific notation.

options(scipen=999)

theme_set(theme_dark())

theme_set(theme_economist())
theme_set(theme_base())

Load the Excel data set

Superstore_data<-read.csv("C:\\Users\\user\\Downloads\\Superstore_data.csv")
attach(Superstore_data)

View the first few observations of the your data set

head(Superstore_data,10)

tail(Superstore_data,10)

Declare some variables times series

Sales<-ts(Superstore_data$Sales,start=1265,frequency = 1)
Profit<-ts(Superstore_data$Profit,start = 1265,frequency = 1)

Plot the series

ts.plot(Sales, main="A time series plot of Sales from Superstore company",xlab="Time in years",ylab="Sales")

A time series plot of Sales from Superstore company

ts.plot(Profit, main="A time series plot of Profit from Superstore company",xlab="Time in years",ylab="Profit")

A time series plot of Profit from Superstore company

The One-Sample t-Test

What is the one-sample t-test?

The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.

When can I use the test?

You can use the test for continuous data. Your data should be a random sample from a normal population.

What if my data isn’t nearly normally distributed?

If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.

The company assumes that from 2016 to 2019, they made an average Sales of approximately 200. Test this hypothesis to determine whether the company management were right on their assumption.

t.test(Superstore_data$Sales,mu=200,alternative="two.sided",conf.level=0.95)


    One Sample t-test

data:  Superstore_data$Sales
t = 4.7893, df = 9993, p-value = 0.000001698
alternative hypothesis: true mean is not equal to 200
95 percent confidence interval:
 217.6375 242.0785
sample estimates:
mean of x 
  229.858

Interpretation

In the above output, the p-value of the test is approximately close to 0.0001, which is significantly less than the significance level alpha = 0.05. We will therefore the null hypothesis and conclude that the true of sales from 2016 to 2019 is not equal to 200 at a 5% level of significance given the p-value of approximately 0.0001

Independent T-Test

The Independent Samples t Test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test.

Sales per Region (West and East Regions)

data22<-Superstore_data %>%
  dplyr::select(Region, Sales)%>%
  filter(Region == "West"|
           Region == "East")

View the data set

head(data22,5)

tail(data22,5)

The the independent sample test

t.test(Sales ~ Region,alternative="two.sided",data = data22)


    Welch Two Sample t-test

data:  Sales by Region
t = 0.79611, df = 5603.9, p-value = 0.426
alternative hypothesis: true difference in means between group East and group West is not equal to 0
95 percent confidence interval:
 -17.31977  41.00552
sample estimates:
mean in group East mean in group West 
          238.3361           226.4932

Interpretation

In the above output, the p-value of the test is 0.426, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from West region and East region is equal to zero with a p-value of 0.426 at a 5% level of significance.

Sales per Category (Technology and Furniture)

data23<-Superstore_data %>%
  dplyr::select(Category, Sales)%>%
  filter(Category == "Technology"|
           Category == "Furniture")

head(data23,5)

tail(data23,5)

t.test(Sales ~ Category,alternative="two.sided",data = data23)


    Welch Two Sample t-test

data:  Sales by Category
t = -3.6721, df = 2497.7, p-value = 0.0002456
alternative hypothesis: true difference in means between group Furniture and group Technology is not equal to 0
95 percent confidence interval:
 -157.8094  -47.9394
sample estimates:
 mean in group Furniture mean in group Technology 
                349.8349                 452.7093

Interpretation

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and furniture category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.

Sales per Category (Technology and Office Supplies)

data24<-Superstore_data %>%
  dplyr::select(Category, Sales)%>%
  filter(Category == "Technology"|
           Category == "Office Supplies")

head(data24,5)

t.test(Sales ~ Category,alternative="two.sided",data = data24)


    Welch Two Sample t-test

data:  Sales by Category
t = -12.694, df = 1982.1, p-value < 0.00000000000000022
alternative hypothesis: true difference in means between group Office Supplies and group Technology is not equal to 0
95 percent confidence interval:
 -384.8897 -281.8807
sample estimates:
mean in group Office Supplies      mean in group Technology 
                     119.3241                      452.7093

Interpretation

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and office supplies category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.

ONE-WAY ANALYSIS OF VARIANCE

One-Way ANOVA (“analysis of variance”) compares the means of two or more independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. One-Way ANOVA is a parametric test.

The company assumes that from 2016 to 2019, the average sales malde from all the fours categories (Technology, Furniture, and Office Supplies)

anova_model<-aov(Sales~Category)
summary(anova_model)

              Df     Sum Sq  Mean Sq F value              Pr(>F)    
Category       2  195881745 97940872   265.5 <0.0000000000000002 ***
Residuals   9991 3685743767   368906                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 265.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the three categories at a 5% level of significance.

Post Hoc Test

Tukey HSD is the commonly used post hoc test to determine which two categories are differing in sales from 2016 to 2019.

tukey_hsd<-TukeyHSD(anova_model)
tukey_hsd

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Sales ~ Category)

$Category
                                diff        lwr       upr     p adj
Office Supplies-Furniture  -230.5108 -266.45583 -194.5657 0.0000000
Technology-Furniture        102.8744   57.56303  148.1858 0.0000003
Technology-Office Supplies  333.3852  295.51937  371.2510 0.0000000

From the results above a significant difference in the average Sales exists across all the categories. This is indicated by the p-value less than 0.05 for all the pairs. This results can be visualized as shown below.

plot(tukey_hsd, las=1)

Tukeys honesty significant difference

The company assumes that from 2016 to 2019, the average sales malde from all the fours categories (Technology, Furniture, and Office Supplies)

anova_model2<-aov(Sales~Sub.Category)
summary(anova_model2)

               Df     Sum Sq  Mean Sq F value              Pr(>F)    
Sub.Category   16  776253968 48515873   155.9 <0.0000000000000002 ***
Residuals    9977 3105371543   311253                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 155.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the seventeen sub categories at a 5% level of significance.

Post Hoc Test

tukey_hsd2<-TukeyHSD(anova_model2)
tukey_hsd2

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Sales ~ Sub.Category)

$Sub.Category
                                 diff          lwr          upr     p adj
Appliances-Accessories     14.7811064   -98.352371   127.914584 1.0000000
Art-Accessories          -181.9057697  -279.299223   -84.512317 0.0000000
Binders-Accessories       -82.4140438  -167.571615     2.743527 0.0707858
Bookcases-Accessories     287.8850290   142.479391   433.290667 0.0000000
Chairs-Accessories        316.3578159   212.227967   420.487665 0.0000000
Copiers-Accessories      1982.9670138  1738.872771  2227.061256 0.0000000
Envelopes-Accessories    -151.1068795  -290.643768   -11.569991 0.0187882
Fasteners-Accessories    -202.0378297  -350.263756   -53.811903 0.0003125
Furnishings-Accessories  -120.1489362  -213.413396   -26.884476 0.0010412
Labels-Accessories       -181.6715489  -304.305120   -59.037978 0.0000400
Machines-Accessories     1429.5787092  1236.717765  1622.439653 0.0000000
Paper-Accessories        -158.6905119  -245.436934   -71.944090 0.0000000
Phones-Accessories        155.2369304    60.389846   250.084015 0.0000021
Storage-Accessories        48.6159493   -47.347208   144.579107 0.9455486
Supplies-Accessories       29.6755961  -126.561825   185.913017 0.9999997
Tables-Accessories        432.8201673   304.435984   561.204351 0.0000000
Art-Appliances           -196.6868761  -309.258580   -84.115172 0.0000002
Binders-Appliances        -97.1951502  -199.365184     4.974884 0.0839321
Bookcases-Appliances      273.1039226   117.124105   429.083741 0.0000002
Chairs-Appliances         301.5767095   183.128706   420.024713 0.0000000
Copiers-Appliances       1968.1859073  1717.648678  2218.723136 0.0000000
Envelopes-Appliances     -165.8879859  -316.411895   -15.364077 0.0147424
Fasteners-Appliances     -216.8189361  -375.431134   -58.206739 0.0002927
Furnishings-Appliances   -134.9300426  -243.949139   -25.910946 0.0022497
Labels-Appliances        -196.4526554  -331.455976   -61.449334 0.0000641
Machines-Appliances      1414.7976027  1213.844256  1615.750949 0.0000000
Paper-Appliances         -173.4716183  -276.969665   -69.973572 0.0000009
Phones-Appliances         140.4558240    30.079770   250.831878 0.0013283
Storage-Appliances         33.8348429   -77.501726   145.171412 0.9998021
Supplies-Appliances        14.8944897  -151.229065   181.018045 1.0000000
Tables-Appliances         418.0390609   277.791414   558.286708 0.0000000
Binders-Art                99.4917259    15.081912   183.901540 0.0052743
Bookcases-Art             469.7907987   324.821821   614.759777 0.0000000
Chairs-Art                498.2635856   394.744359   601.782813 0.0000000
Copiers-Art              2164.8727835  1921.038405  2408.707162 0.0000000
Envelopes-Art              30.7988902  -108.282913   169.880694 0.9999974
Fasteners-Art             -20.1320600  -167.929659   127.665539 1.0000000
Furnishings-Art            61.7568335   -30.825370   154.339037 0.6491218
Labels-Art                  0.2342208  -121.881288   122.349730 1.0000000
Machines-Art             1611.4844789  1418.952537  1804.016420 0.0000000
Paper-Art                  23.2152578   -62.797221   109.227737 0.9999589
Phones-Art                337.1427001   242.966407   431.318994 0.0000000
Storage-Art               230.5217190   135.221496   325.821942 0.0000000
Supplies-Art              211.5813658    55.750251   367.412481 0.0003411
Tables-Art                614.7259370   486.836518   742.615356 0.0000000
Bookcases-Binders         370.2990728   233.250425   507.347720 0.0000000
Chairs-Binders            398.7718597   306.671058   490.872661 0.0000000
Copiers-Binders          2065.3810576  1826.170834  2304.591281 0.0000000
Envelopes-Binders         -68.6928357  -199.498322    62.112651 0.9265977
Fasteners-Binders        -119.6237859  -259.661129    20.413557 0.2055723
Furnishings-Binders       -37.7348924  -117.345140    41.875355 0.9698971
Labels-Binders            -99.2575051  -211.856461    13.341450 0.1629283
Machines-Binders         1511.9927530  1325.351719  1698.633787 0.0000000
Paper-Binders             -76.2764681  -148.140741    -4.412195 0.0244962
Phones-Binders            237.6509742   156.192387   319.109562 0.0000000
Storage-Binders           131.0299931    48.274572   213.785414 0.0000059
Supplies-Binders          112.0896399   -36.401652   260.580932 0.4181442
Tables-Binders            515.2342111   396.398061   634.070362 0.0000000
Chairs-Bookcases           28.4727869  -121.105104   178.050678 0.9999997
Copiers-Bookcases        1695.0819848  1428.412678  1961.751291 0.0000000
Envelopes-Bookcases      -438.9919085  -615.063096  -262.920721 0.0000000
Fasteners-Bookcases      -489.9228587  -672.956859  -306.888859 0.0000000
Furnishings-Bookcases    -408.0339652  -550.261879  -265.806052 0.0000000
Labels-Bookcases         -469.5565779  -632.558150  -306.555006 0.0000000
Machines-Bookcases       1141.6936801   920.954324  1362.433037 0.0000000
Paper-Bookcases          -446.5755409  -584.617062  -308.534020 0.0000000
Phones-Bookcases         -132.6480986  -275.918784    10.622587 0.1080946
Storage-Bookcases        -239.2690797  -383.281050   -95.257110 0.0000013
Supplies-Bookcases       -258.2094329  -447.789631   -68.629235 0.0003179
Tables-Bookcases          144.9351383   -22.435763   312.306039 0.1863844
Copiers-Chairs           1666.6091979  1420.006796  1913.211600 0.0000000
Envelopes-Chairs         -467.4646954  -611.344120  -323.585271 0.0000000
Fasteners-Chairs         -518.3956456  -670.716593  -366.074698 0.0000000
Furnishings-Chairs       -436.5067521  -536.151146  -336.862358 0.0000000
Labels-Chairs            -498.0293648  -625.582250  -370.476480 0.0000000
Machines-Chairs          1113.2208933   917.195157  1309.246630 0.0000000
Paper-Chairs             -475.0483278  -568.620158  -381.476498 0.0000000
Phones-Chairs            -161.1208855  -262.248108   -59.993663 0.0000049
Storage-Chairs           -267.7418666  -369.916586  -165.567147 0.0000000
Supplies-Chairs          -286.6822198  -446.809910  -126.554529 0.0000001
Tables-Chairs             116.4623514   -16.628761   249.553464 0.1723836
Envelopes-Copiers       -2134.0738932 -2397.589097 -1870.558690 0.0000000
Fasteners-Copiers       -2185.0048435 -2453.222376 -1916.787311 0.0000000
Furnishings-Copiers     -2103.1159499 -2345.330687 -1860.901213 0.0000000
Labels-Copiers          -2164.6385627 -2419.606624 -1909.670502 0.0000000
Machines-Copiers         -553.3883046  -848.625604  -258.151005 0.0000000
Paper-Copiers           -2141.6575257 -2381.437969 -1901.877082 0.0000000
Phones-Copiers          -1827.7300833 -2070.558600 -1584.901566 0.0000000
Storage-Copiers         -1934.3510645 -2177.617681 -1691.084447 0.0000000
Supplies-Copiers        -1953.2914176 -2226.018114 -1680.564721 0.0000000
Tables-Copiers          -1550.1468465 -1807.930124 -1292.363569 0.0000000
Fasteners-Envelopes       -50.9309502  -229.338317   127.476417 0.9999128
Furnishings-Envelopes      30.9579433  -105.264385   167.180272 0.9999962
Labels-Envelopes          -30.5646695  -188.353313   127.223974 0.9999996
Machines-Envelopes       1580.6855886  1363.767155  1797.604023 0.0000000
Paper-Envelopes            -7.5836324  -139.429015   124.261751 1.0000000
Phones-Envelopes          306.3438099   169.033094   443.654526 0.0000000
Storage-Envelopes         199.7228288    61.638829   337.806829 0.0000749
Supplies-Envelopes        180.7824756    -4.334771   365.899722 0.0644947
Tables-Envelopes          583.9270468   421.628674   746.225420 0.0000000
Furnishings-Fasteners      81.8888935   -63.221082   226.998869 0.8728069
Labels-Fasteners           20.3662808  -145.156039   185.888600 1.0000000
Machines-Fasteners       1631.6165388  1409.009286  1854.223792 0.0000000
Paper-Fasteners            43.3473178   -97.661856   184.356492 0.9997704
Phones-Fasteners          357.2747601   211.142577   503.406943 0.0000000
Storage-Fasteners         250.6537790   103.794754   397.512804 0.0000005
Supplies-Fasteners        231.7134258    39.961562   423.465289 0.0034553
Tables-Fasteners          634.8579970   465.031191   804.684803 0.0000000
Labels-Furnishings        -61.5226128  -180.371140    57.325914 0.9347411
Machines-Furnishings     1549.7276453  1359.251076  1740.204214 0.0000000
Paper-Furnishings         -38.5415757  -119.849148    42.765997 0.9698815
Phones-Furnishings        275.3858666   185.486205   365.285528 0.0000000
Storage-Furnishings       168.7648855    77.688504   259.841267 0.0000000
Supplies-Furnishings      149.8245323    -3.459881   303.108945 0.0639163
Tables-Furnishings        552.9691034   428.195395   677.742812 0.0000000
Machines-Labels          1611.2502581  1404.799158  1817.701358 0.0000000
Paper-Labels               22.9810370   -90.824299   136.786373 0.9999993
Phones-Labels             336.9084794   216.814007   457.002952 0.0000000
Storage-Labels            230.2874982   109.309647   351.265350 0.0000000
Supplies-Labels           211.3471451    38.613696   384.080594 0.0027717
Tables-Labels             614.4917162   466.474111   762.509322 0.0000000
Paper-Machines          -1588.2692211 -1775.640525 -1400.897917 0.0000000
Phones-Machines         -1274.3417787 -1465.598238 -1083.085319 0.0000000
Storage-Machines        -1380.9627599 -1572.775146 -1189.150374 0.0000000
Supplies-Machines       -1399.9031130 -1627.923278 -1171.882949 0.0000000
Tables-Machines          -996.7585419 -1206.676532  -786.840552 0.0000000
Phones-Paper              313.9274423   230.809266   397.045619 0.0000000
Storage-Paper             207.3064612   122.916951   291.695972 0.0000000
Supplies-Paper            188.3661080    38.957964   337.774252 0.0015901
Tables-Paper              591.5106792   471.530846   711.490513 0.0000000
Storage-Phones           -106.6209811  -199.317352   -13.924610 0.0078092
Supplies-Phones          -125.5613343  -279.813794    28.691125 0.2813586
Tables-Phones             277.5832369   151.622179   403.544295 0.0000000
Supplies-Storage          -18.9403532  -173.881566   136.000860 1.0000000
Tables-Storage            384.2042180   257.400644   511.007792 0.0000000
Tables-Supplies           403.1445712   226.282053   580.007090 0.0000000

Make an interpretation of the results above. ## Plot

plot(tukey_hsd2, las=1)

A plot of Tukey honesty significant difference

Two-Way Analysis of Variance

What Is a 2-Way ANOVA?

ANOVA stands for analysis of variance and tests for differences in the effects of independent variables on a dependent variable. A two-way ANOVA test is a statistical test used to determine the effect of two nominal predictor variables on a continuous outcome variable.

Convert the string variables of interest into factors.

Superstore_data$Category<-as.factor(Superstore_data$Category)
Superstore_data$Sub.Category<- as.factor((Superstore_data$Sub.Category))

Run the multiple linear model.

linear_model<- lm(Sales~Category+Sub.Category, data = Superstore_data)

Convert the linear model into its equivalent two-factor anova.

Two_factor_anova<-anova(linear_model)
Two_factor_anova

From the results above, the F statistics for category is given as 314.67 with its associated p-value of approximately 0.0001, which is significantly less than 0.01. This is an indication that there is a significant difference in the mean Sales across various categories at a 1% level of significance. On the other, F statistics for sub category is 133.19 with its associated p-value of approximately 0.0001, which is less than 0.01. This is a clear indication that there exists a significant difference in the average Sales across sub categories at a 1% level of significant.

Visualize the results using stargazer

stargazer(Two_factor_anova, type="text")


=================================================================================
Statistic N       Mean            St. Dev.            Min              Max       
---------------------------------------------------------------------------------
Df        3     3,331.000         5,755.608            2              9,977      
Sum Sq    3 1,293,875,171.000 1,580,537,112.000 195,881,745.000 3,105,371,543.000
Mean Sq   3  46,569,095.000    49,015,303.000     311,253.000    97,940,872.000  
F value   2      223.927           128.325          133.188          314.666     
Pr(> F)   2       0.000             0.000              0                0        
---------------------------------------------------------------------------------

CHI-SQUARE TEST OF GOODNESS OF FIT AND INDEPENDENCE

The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. It is often used to evaluate whether sample data is representative of the full population.

CHI-SQUARE TEST OF GOODNESS OF FIT

Illustration one

Let us enter some data using the command below: The data entered is gotten from linkedin regarding the number of jobs opened from various professions. There are fours different categories of professions.

jobs<-c(11091,11282,15378,12696)

Input the following categories that have the following names

names(jobs)<-c("Project Management","Supply chain","Services","Quality control")

Run the following command to view the available jobs from various categories of professions

jobs

Project Management       Supply chain           Services    Quality control 
             11091              11282              15378              12696

Proportion

It is important to calculate the proportion of job vacancies in various categories.

jobs/sum(jobs)

Project Management       Supply chain           Services    Quality control 
         0.2198545          0.2236407          0.3048348          0.2516701

These are different proportions that we get. Remember these proportions may change on a daily basis. In other words, new jobs will keep opening as others closes, therefore, this proportion is prone to variation. We would therefore like to find out if these proportion in job vacancies are real or out of chances. In other words, we would to established if these proportions are statistically significant or not, that is by chances. From the results above, let us assume that the proportion are all the same in the fours categories. Assuming that the proportion is approximately 25% in all the four categories. We shall create the object probability as shown below.

probability<-c(0.25,0.25,0.25,0.25)

For the chi square test of goodness, we shall have our null hypothesis as follows; * Null hypothesis: Proportions in job categories is not statistically different from 0.25 * Alternative hypothesis: Proportions in job categories is statistically different from 0.25

The alternative hypothesis indicates that fluctuations in job categories will be below or above 0.25. Run the following code chunk to run the chi square test

chisq.test(jobs,p=probability)


    Chi-squared test for given probabilities

data:  jobs
X-squared = 930.89, df = 3, p-value < 0.00000000000000022

Interpretation

From the result, the square test value if 930.89 and the associated probability of approximately 0.001. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that the proportion of job is statistically different from 0.25, that is, it is below or above 0.25 at a 5% level of significance. The degree of freedom is 3 because we gave four categories. (n-1)

Illustration two

Students in a social studies class hypothesize that the literacy rates across the world for every region are 82%. The table below shows the actual literacy rates across the world broken down by region. What are the test statistic and the degrees of freedom?

illit_lavel<-c(99,99.5,67.3,62.5,91,93.8,61.9,91.9,84.5,66.4)

Input the following categories that have the following names

names(illit_lavel)<-c("Developed Regions","Commonwealth of Independent States","Northern Africa", "Sub-saharan Africa","Latin American and the caribbean","Eastern Asia","Southern Asia","South-Eastern Asia","Western Asia","Oceania")

Run the following command to view the illiteracy level from various regions

illit_lavel

                 Developed Regions Commonwealth of Independent States 
                              99.0                               99.5 
                   Northern Africa                 Sub-saharan Africa 
                              67.3                               62.5 
  Latin American and the caribbean                       Eastern Asia 
                              91.0                               93.8 
                     Southern Asia                 South-Eastern Asia 
                              61.9                               91.9 
                      Western Asia                            Oceania 
                              84.5                               66.4

Proportion

It is important to calculate the proportion of illiteracy level in various regions.

illit_lavel/sum(illit_lavel)

                 Developed Regions Commonwealth of Independent States 
                        0.12105649                         0.12166789 
                   Northern Africa                 Sub-saharan Africa 
                        0.08229396                         0.07642455 
  Latin American and the caribbean                       Eastern Asia 
                        0.11127415                         0.11469797 
                     Southern Asia                 South-Eastern Asia 
                        0.07569088                         0.11237466 
                      Western Asia                            Oceania 
                        0.10332600                         0.08119345

chisq.test(illit_lavel)


    Chi-squared test for given probabilities

data:  illit_lavel
X-squared = 26.449, df = 9, p-value = 0.001725

Confirm the p-value (The area to the right)

1-pchisq(26.449,9)

[1] 0.001724323

Interpretation

From the result, the square test value if 26.449 and the associated probability of approximately 0.002. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that the illiteracy level across the world is statistically different from 82%, that is, it is below or above 82% at a 5% level of significance. The degree of freedom is 9 because we gave ten regions, (n-1)

Illustration three

Employers want to know which days of the week employees are absent in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of 60 managers were asked on which day of the week they had the highest number of employee absences. The results were distributed as in the table below. For the population of employees, do the days for the highest number of absences occur with equal frequencies during a five-day work week? Test at a 5% significance level.

absenteeism<-c(15,12,9,9,15)

Input the following categories that have the following names

names(absenteeism)<-c("Monday","Tuesday","Wednesday","Thursday","Friday")

Run the following command to view the illiteracy level from various regions

absenteeism

   Monday   Tuesday Wednesday  Thursday    Friday 
       15        12         9         9        15

Proportion

It is important to calculate the proportion of illiteracy level in various regions.

absenteeism/sum(absenteeism)

   Monday   Tuesday Wednesday  Thursday    Friday 
     0.25      0.20      0.15      0.15      0.25

Probabilities

prob<-c(0.20,0.20,0.20,0.20,0.20)

Calculate the chi-square test

chisq.test(absenteeism,p=prob)


    Chi-squared test for given probabilities

data:  absenteeism
X-squared = 3, df = 4, p-value = 0.5578

Confirm the p-value

1- pchisq(3,4)

[1] 0.5578254

Interpretation

From the result, the square test value if 3 and the associated probability of approximately 0.5578. This probability is far above 0.05. This is an indication that there is sufficient evidence to support the null hypothesis and therefore conclude that at a 5% level of significance, from the sample data, there is not sufficient evidence to conclude that the absent days do not occur with equal frequencies. In other words, absent days occurs with equal variation. ## For further practice consider the example below.

knitr::include_graphics("chi.png")

Distribution of voters

CHI-SQUARE TEST OF INDEPENDENCE

Using Chi-Square Statistic in Research

The Chi Square statistic is commonly used for testing relationships between categorical variables.
The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent. An example research question that could be answered using a Chi-Square analysis would be: * Is there a significant relationship between voter intent and political party membership?

How does the Chi-Square statistic work?

The Chi-Square statistic is most commonly used to evaluate Tests of Independence when using a cross tabulation (also known as a bivariate table). Cross tabulation presents the distributions of two categorical variables simultaneously, with the intersections of the categories of the variables appearing in the cells of the table. The Test of Independence assesses whether an association exists between the two variables by comparing the observed pattern of responses in the cells to the pattern that would be expected if the variables were truly independent of each other. Calculating the Chi-Square statistic and comparing it against a critical value from the Chi-Square distribution allows the researcher to assess whether the observed cell counts are significantly different from the expected cell counts.

In other words the Chi-square test of independence checks whether two variables are likely to be related or not. We have counts for two categorical or nominal variables. We also have an idea that the two variables are not related. The test gives us a way to decide if our idea is plausible or not. For instance, assume that we have two variables, gender and voting. Because we have two variables, we shall use rbind function to store our variable in the object voters

voters<-rbind(c(1486,2131),c(2792,3591))
voters

     [,1] [,2]
[1,] 1486 2131
[2,] 2792 3591

Provide the dimension names

dimnames(voters)<-list(gender=c("male","female"),
                       voting=c("voted","didn't vote")
                       )

View the data

Run the code chunk to look at the data we have just entered

voters

        voting
gender   voted didn't vote
  male    1486        2131
  female  2792        3591

State the null and alternative hypothesis

Null hypothesis: There is no statistically significant dependence of voting on gender
Alternative hypothesis: There is statistically significant dependence of voting on gender

Run the code chunk below to run the chi square test of independence

chisq.test(voters)


    Pearson's Chi-squared test with Yates' continuity correction

data:  voters
X-squared = 6.5523, df = 1, p-value = 0.01047

From the result, the square test value if 6.5523 and the associated probability of approximately 0.01. This probability is below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of voting on gender at a 5% level of significance. We can reject the null hypothesis with 95% confidence.

Second illustration

Consider the information below

knitr::include_graphics("chi2.png")

Automobile owners

Let us enter the information in the Rstudio

Buyer<-rbind(c(70,130,150),c(30,20,100))

Give names to the dimension columns and rows using the following code chunk

dimnames(Buyer)<-list(Gender=c("Male","Female"),
                      Reason=c("Styling","Engineering","Fuel economy"))

View the entered data

Buyer

        Reason
Gender   Styling Engineering Fuel economy
  Male        70         130          150
  Female      30          20          100

Null and alternative hypothesis

Null hypothesis:There is no statistically significant dependence on the reason of purchase and gender
Alternative hypothesis:There is a statistically significant dependence on the reason of purchase and gender

Run the chi square test using the code chunk below

chisq.test(Buyer)


    Pearson's Chi-squared test

data:  Buyer
X-squared = 31.746, df = 2, p-value = 0.0000001278

Interpretation

From the result, the square test value if 6.5523 and the associated probability of approximately 0.001. This probability is far below 0.1. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of reason for purchase on gender at a 10% level of significance. We can reject the null hypothesis with 90% confidence. The output above has the degree of freedom of 2 calculated as follows;

\[ (Number of columns-1)(Number of rows-1) = (3-1)(2-1)= (2*1) = 2 \]

Third illustration

A random sample of 400 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from this survey is summarized in the following contingency table.

Education<-rbind(c(60,56,46,42),c(40,46,53,57))

Give names to the dimension columns and rows using the following code chunk

dimnames(Education)<-list(Gender=c("Male","Female"),
                      Educ_level=c("High school","Bachelors","Masterd","PhD"))

View the data set

Education

        Educ_level
Gender   High school Bachelors Masterd PhD
  Male            60        56      46  42
  Female          40        46      53  57

Null and alternative hypothesis

Null hypothesis:There is no statistically significant dependence of education level on gender
Alternative hypothesis:There is a statistically significant dependence of education level on gender

Run the chi square test of dependence

chisq.test(Education)


    Pearson's Chi-squared test

data:  Education
X-squared = 7.5911, df = 3, p-value = 0.05526

Interpretation

From the result, the square test value if 7.5911 and the associated probability of approximately 0.05526. This probability is slightly above 0.05. This is an indication that there is sufficient evidence to support the null hypothesis and therefore conclude that there is no statistically significance dependence of education level on gender at a 5% level of significance. We therefore retain the null hypothesis with 95% confidence.

EMPIRICAL DATA FROM THE FIELD

Import the data set using the code chunk below

Ch_sqt <- read_sav("C:/Users/user/Downloads/Effects of employees voice mechanism on academic staff turnover intentions1.sav")

Attach the data set

attach(Ch_sqt)

View the data set

head(Ch_sqt,10)

Generate the chi square table (two-way table)

TBL<-table(Ch_sqt$Q4, Ch_sqt$Q6) 
TBL

   
      1   2   3   4
  1 155  16   5  52
  2  53   3   2  67

The two-way table above has no column and row names and therefore it is very important to pride columns names and row names for easier row dimension and identification.

Provide names to the columns and row dimension

dimnames(TBL)<-list(Employment_terms=c("Permanent","Contractual"),
                      Trade_union=c("UASU","KUSU","Any other","Not a member of any"))

View the data entered

TBL

                Trade_union
Employment_terms UASU KUSU Any other Not a member of any
     Permanent    155   16         5                  52
     Contractual   53    3         2                  67

Run the following code chunk to run the chi square test

chisq.test(TBL)

Warning in chisq.test(TBL): Chi-squared approximation may be incorrect


    Pearson's Chi-squared test

data:  TBL
X-squared = 35.018, df = 3, p-value = 0.0000001208

From the result, the square test value of 35.018 and the associated probability of approximately 0.001. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of trade union membership on terms of employment at a 5% level of significance. We can reject the null hypothesis with 95% confidence

Create a one way table

TAB<-table(Ch_sqt$Q6)
TAB


  1   2   3   4 
208  19   7 119

Give column dimension

dimnames(TAB)<-list(Trade_Union=c("UASU","KUSU","Member of any other","Not a member of any"))

View the data set

TAB

Trade_Union
               UASU                KUSU Member of any other Not a member of any 
                208                  19                   7                 119

Alternatively

My_TAB<-count(Ch_sqt,'Q6')
My_TAB

Provide row dimension

rownames(My_TAB)<-c("UASU","KUSU","Member of any other","Not a member of any")
My_TAB

Linear Regression

Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.

In order to run a linear regression, it is always important to ensure that there is a well established linear association between the two variable, for instance, one need to create a scatter plot of the two variable to determine the linear association. Consider the scatter plot below for sales and profit.

Scatter Plot

plot(Sales, Profit, main = "A scatter plot of Sales and Profit")

A scstter plot of profit and sales

Transform one or both variable and create the scatter plot.

plot(log(Sales),Profit)

plot(Sales, log(Profit))

Warning in log(Profit): NaNs produced

Log tranform both variables

plot(log(Sales),log(Profit))

Warning in log(Profit): NaNs produced

Scatter plot of sales and profit

Run the correlation to determine the strength of association between the two variables of interest

cor(Sales,Profit)

[1] 0.4790643

Run the Regression model

reg_model<-lm(Profit~Sales, data = Superstore_data)

summary(reg_model)


Call:
lm(formula = Profit ~ Sales, data = Superstore_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7397.5     2.6    14.6    21.7  5261.6 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -12.732867   2.192459  -5.808        0.00000000653 ***
Sales         0.180067   0.003301  54.555 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 205.6 on 9992 degrees of freedom
Multiple R-squared:  0.2295,    Adjusted R-squared:  0.2294 
F-statistic:  2976 on 1 and 9992 DF,  p-value: < 0.00000000000000022

Visualize the results

stargazer(reg_model, type="text")


===============================================
                        Dependent variable:    
                    ---------------------------
                              Profit           
-----------------------------------------------
Sales                        0.180***          
                              (0.003)          
                                               
Constant                    -12.733***         
                              (2.192)          
                                               
-----------------------------------------------
Observations                   9,994           
R2                             0.230           
Adjusted R2                    0.229           
Residual Std. Error     205.639 (df = 9992)    
F Statistic         2,976.247*** (df = 1; 9992)
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Generate the anova model for the regression model

anova(reg_model)

Gt four plots, including normal probability plot, of residuals:

plot(reg_model)

Residuals vs Leverage

Confidence Intervals for all parameters:

confint(reg_model)

                  2.5 %     97.5 %
(Intercept) -17.0305284 -8.4352058
Sales         0.1735967  0.1865366

Generate the resgression residuals

Resid<-residuals(reg_model)

Summarize the residuals

summary(Resid)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-7397.542     2.581    14.634     0.000    21.658  5261.551

From the results above, the residuals have a mean of 0.00. This indicates the zero conditional mean is met. Next it is important to test the covariance between the residuals and the independent variable. If the covariance is given as zero, that is an indication that the change in dependent is causal.

cov(Resid,Sales)

[1] -0.00000000005062127

The value (covariance) is approximately close to zero, an indicator that changes in Profit is caused by changes in sales. In other words, changes in profits is causal.

Declare the residuals times series

Resid<-ts(Resid, start = 1265, frequency = 1)

Plot the residuals

ts.plot(Resid, main="A time series plot of regression residuals",xlab="Time in years",ylab="Residuals")
abline(0,-0.00000000005062127)

Regression residuals

Data Visualization

Histogram

hist(log(Sales),main="Histogram showing the distribution of sales from Superstore",xlab="log of sales", ylab="Frequencies",breaks = 15)

summary(log(Sales))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.8119  2.8495  3.9980  4.1098  5.3468 10.0274

ggplot(data=Superstore_data, aes(log(Sales))) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(-2,12))+
  stat_function(fun = dnorm,
                args = list(mean = mean(log(Sales)),
                            sd = sd(log(Sales))),
                col = "#1b98e0",
                size = 1)+
 labs(title = "Histogram showing the distribution of 
      Sales for Superstore Company",
      subtitle = "Sales from 2016 to 2019",
       caption="Data Collected and compiled by the Superstore Company")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 2 rows containing missing values (`geom_bar()`).

Histogram showing the distribution of sales

Put title and Sub title at the centre of the graph

ggplot(data=Superstore_data, aes(log(Sales))) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(-2,12))+
  stat_function(fun = dnorm,
                args = list(mean = mean(log(Sales)),
                            sd = sd(log(Sales))),
                col = "#1b98e0",
                size = 1)+
 labs(title = "Histogram showing the distribution of Sales for 
      Superstore Company",
      subtitle = "Sales from 2016 to 2019",
       caption="Data Collected and compiled by the Superstore Company")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.subtitle = element_text(hjust = 0.5))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 2 rows containing missing values (`geom_bar()`).

Histogram showing the distribution of Sales

Pie Chart (Average Sale per Region)

Summarize the average sales from various regions

Average_sales<-Superstore_data %>%
  dplyr::select(Sales, Region)%>%
  filter(Region == "Central") %>%
  group_by(Region)%>%
  summarise(Average_sale = mean(Sales))
Average_sales

Average_sales<-Superstore_data %>%
  dplyr::select(Sales, Region)%>%
  filter(Region == "South") %>%
  group_by(Region)%>%
  summarise(Average_sale = mean(Sales))
Average_sales

Superstore_data %>% group_by(Region) %>% 
  summarize(average_sales=mean(Sales))

Superstore_data %>% group_by(Region) %>% 
  drop_na() %>% 
  summarize(average_sales=mean(Sales))

# Create data for the graph.
average_sales <-  c(215.7727, 236.1127, 241.8036,226.4932)
Regions <-  c("Central","East","South","West")

piepercent<- round(100*average_sales/sum(average_sales), 1)

# Plot the chart.
pie3D(average_sales,labels = Regions,explode = 0.06, main = "Pie Chart of 
      Sales from Various of the United States ")

Create data for the graph.

average_sales <-  c(215.7727, 236.1127, 241.8036,226.4932)
Regions <-  c("Central","East","South","West")

piepercent<- round(100*average_sales/sum(average_sales), 1)

Plot the chart.

pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from 
    Various regions of the United States",col = rainbow(length(average_sales)))
legend("topright", c("Central","East","South","West"), cex = 0.8,
   fill = rainbow(length(average_sales)))

Pie chart of sales for various regions

Pie Chart (Average Sales per category)

Summarize the average sales from various categories

Superstore_data %>% group_by(Category) %>% 
  drop_na() %>% 
  summarize(average_sales=mean(Sales))

Create the matrix

# Create data for the graph.
average_sales <-  c(347.7488, 119.0760,452.5782)
Sales_category <-  c("Furniture","Office Supplies","Technology")

piepercent<- round(100*average_sales/sum(average_sales), 1)

# Plot the chart.
pie3D(average_sales,labels = Sales_category,explode = 0.06, main = "Pie Chart of 
      Sales from Various Categories")

Create sales Pie chart with percentages

pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from 
    Various Categories",col = rainbow(length(average_sales)))
legend("topright", c("Furniture","Office Supplies","Technology"), cex = 0.8,
   fill = rainbow(length(average_sales)))

Pie chart of sales for various categories

Bar graphs

Create the data for the chart (average sales per category)

average_sales1 <-  c(347.7488, 119.0760,452.5782)
Sales_category1 <-  c("Furniture","Office Supplies","Technology")

Plot the bar chart

barplot(average_sales1,names.arg=Sales_category1,xlab="Category",ylab="Average Sales",
        col="blue",
main="Bar graph for the average sales for various category",border="red")

Bar graph for various categories

Bar graph2

Superstore_data %>% group_by(Sub.Category) %>% 
  drop_na() %>% 
  summarize(average_sales=mean(Sales))

Create the data for the chart (average sales per sub category)

average_sales2 <-  c(216.13882,230.08435,34.10157,133.56056,486.67443,532.03556,2198.94162,65.11606,13.93677,95.82567,34.30305, 1645.55331,57.30044,370.17151,263.05245,245.65020,648.79477 )
sub_catgory_sales <-  c("Accessories","Appliances","Art","Binders","Bookcases","Chairs","Copiers","Envelopes","Fasteners","Furnishings","Labels","Machines","Paper","Phones", "Storage","Supplies","Tables")

Plot the bar graph 1

barplot(average_sales2,names.arg=sub_catgory_sales,xlab="sub_category",ylab="Average Sales",col="grey",main="Bar graph for the average sales for various sub categories",border="blue")

Bar graph for various sub categories

Bar graph 2

ggplot(Superstore_data,aes(x=Sub.Category, y=Sales))+
  geom_bar(stat = "identity")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and 
       Sub categories")+
  theme(plot.title = element_text(hjust = 0.5))

Change the angle of the x-axis

ggplot(Superstore_data,aes(x=Sub.Category, y=mean(Sales)))+
  geom_bar(stat = "identity")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
  theme(axis.text.x = element_markdown(angle = 90))

Grouped Barplot of Sales for various Categories and Sub categories

Grouped Bar graphs

Superstore_data %>%
  filter(Category == "Furniture"|
           Category == "Office Supplies"|
         Category == "Technology")%>%
  ggplot(aes(Category, fill= Sub.Category))+
  geom_bar(position = "dodge",
           alpha=0.5)+
  theme_economist()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "Grouped Bar Graph",
       y="Number",
       x="Category")

Grouped bar graph

Stacked Bar graphs (100%)

Superstore_data %>%
  filter(Category == "Furniture"|
           Category == "Office Supplies"|
         Category == "Technology")%>%
  ggplot(aes(Category, fill= Sub.Category))+
  geom_bar(position = "fill",
           alpha=0.5)+
  theme_economist()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "Grouped Bar Graph",
       y="Number",
       x="Category")

Stacked Bar graphs 100%

Stacked bar graph (Proportion)

Superstore_data %>%
  filter(Category == "Furniture"|
           Category == "Office Supplies"|
         Category == "Technology")%>%
  ggplot(aes(Category, fill= Sub.Category))+
  geom_bar(alpha=0.5)+
  theme_economist_white()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "Grouped Bar Graph",
       y="Frequency",
       x="Category")

Stacked bar graph

Stacked Bar Graphs Part 2

ggplot(Superstore_data,aes(x=Category, y=mean(Sales), fill=Sub.Category))+
  geom_bar(stat = "identity")+
  theme_calc()

Stacked Bar Graph for proportion

Grouped bar graph

ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
  geom_bar(stat = "identity", position = "dodge")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")

Grouped Barplot of Sales for various Categories and Sub categories

Facetting

ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
  geom_bar(stat = "identity", position = "dodge")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
  facet_wrap(Category~.)

Grouped Barplot of Sales for various Categories and Sub categories

Facet and remove legend

ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
  geom_bar(stat = "identity", position = "dodge")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
  theme(legend.position = "none")

  facet_wrap(Category~.)

<ggproto object: Class FacetWrap, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetWrap, Facet, gg>

LINEAR REGRESSION PART TWO

Import the data on inflation, unemployment and Fed Fund Rate

unemploymnt_data<-read.csv("C:\\Users\\user\\Downloads\\Unemployment.csv")
attach(unemploymnt_data)

View the data set

head(unemploymnt_data, 10)

tail(unemploymnt_data,10)

This is a quarterly data from 1980 first quarter to 2021 last quarter.

Attach the data

attach(unemploymnt_data)

The following objects are masked from unemploymnt_data (pos = 3):

    FedRate, Inflation, Unemployment, year

Declare the data quarterly time series from 1980 first quarter to 2021 last quarter.

Unemployment<-ts(unemploymnt_data$Unemployment,frequency = 1,start = 1859)

Plot the series

ts.plot(Unemployment,main="Yearly time series plot of Unemployment from 1859 to 2022", xlab="Time in quarters",ylab="Unemployment rate")

Yearly time series plot of Unemployment from 1859 to 2022

Declare the data quarterly time series (Inflation) from 1980 first quarter to 2021 last quarter.

Inflation<-ts(unemploymnt_data$Inflation,frequency = 1,start = 1859)

Plot the series

ts.plot(Inflation,main="Yearly time series plot of Inflation from 1859 to 2022", xlab="Time in quarters",ylab="Inflation rate")

Yearly time series plot of Inflation from 1859 to 2022

Declare the data quarterly time series (FedRate) from 1980 first quarter to 2021 last quarter.

FedRate<-ts(unemploymnt_data$FedRate,frequency = 1,start = 1859)

Plot the series

ts.plot(FedRate,main="Yearly time series plot of FedRate from 1859 to 2022", xlab="Time in quarters",ylab="Fed Funds rate")

Yearly time series plot of FedRate from 1859 to 2022

Establish the association between Inflation and Unemployment

plot(Unemployment,Inflation)

Scatter plot of unemployment and inflation

Establish the association between Inflation and FedRate

plot(FedRate,Inflation)

Scatter plot of FedRate and Inflation

Correlation

cor(Inflation,FedRate)

[1] 0.6716283

cor(Inflation,Unemployment)

[1] 0.2172026

Histogram

hist(Inflation, main="Histogram showing the distribution of Inflation")

Histogram showing the distribution of Inflation

hist(Unemployment, main="Histogram showing the distribution of Unemployment")

Histogram showing the distribution of Unemployment

hist(FedRate, main = "Histogram showing the distribution of FedRate")

Histogram showing the distribution of FedRate

Test Normality

shapiro.test(Inflation)


    Shapiro-Wilk normality test

data:  Inflation
W = 0.91272, p-value = 0.00000002459

shapiro.test(Unemployment)


    Shapiro-Wilk normality test

data:  Unemployment
W = 0.96757, p-value = 0.0006884

shapiro.test(FedRate)


    Shapiro-Wilk normality test

data:  FedRate
W = 0.9102, p-value = 0.00000001704

Estimate the Model

Estimate the model using the log transformed values of inflation, unemployment and FedRate to bring about linearity between the dependent variable and independent variables.

REG_MODEL<-lm(log(Inflation)~log(Unemployment)+log(FedRate),data = unemploymnt_data)

Display the results

stargazer(REG_MODEL,type = "text")


===============================================
                        Dependent variable:    
                    ---------------------------
                          log(Inflation)       
-----------------------------------------------
log(Unemployment)              0.176           
                              (0.157)          
                                               
log(FedRate)                 0.976***          
                              (0.085)          
                                               
Constant                     -0.902***         
                              (0.291)          
                                               
-----------------------------------------------
Observations                    164            
R2                             0.473           
Adjusted R2                    0.466           
Residual Std. Error      0.498 (df = 161)      
F Statistic           72.167*** (df = 2; 161)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Model Diagnostic

1. Test for the presence of outliers in the model

outlierTest(REG_MODEL)

No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
    rstudent unadjusted p-value Bonferroni p
14 -3.172999          0.0018094      0.29673

The p-value for Boniferron test statistics shows that there are outliers in the data set

qqPlot(REG_MODEL, main= "QQ Plot Showing the Possible Presence of outliers")

QQ Plot Showing the Possible Presence of outliers

[1] 14 15

leveragePlots(REG_MODEL)

QQ Plot Showing the Possible Presence of outliers

2. test the presence of multicollinearity in the model

vif(REG_MODEL)

log(Unemployment)      log(FedRate) 
         1.034369          1.034369

VIF of 1.034369 for unemployment and 1.034369 for FedRate is an indication that predictors are not correlated

Testing for the presence of autocorrelation

3. Testing for heteroscedasticty

ncvTest(REG_MODEL)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 2.623628, Df = 1, p = 0.10528

the results show that the variance of the error terms is constant

spreadLevelPlot(REG_MODEL)

Warning in spreadLevelPlot.lm(REG_MODEL): 
2 negative fitted values removed

Spread-Level Plot for Regression Model


Suggested power transformation:  1.024086

4. Autocorrelation

durbinWatsonTest(REG_MODEL)

 lag Autocorrelation D-W Statistic p-value
   1       0.7001071     0.5736665       0
 Alternative hypothesis: rho != 0

The results shows that there is an correlation of the regression residuals between periods

Correcting Econometric problems

Heteroscedasticity

coeftest(REG_MODEL, hccm(REG_MODEL, type = "hc0"))


t test of coefficients:

                   Estimate Std. Error t value              Pr(>|t|)    
(Intercept)       -0.901611   0.272771 -3.3054              0.001169 ** 
log(Unemployment)  0.176483   0.148486  1.1885              0.236368    
log(FedRate)       0.976433   0.074715 13.0688 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

stargazer(coeftest(REG_MODEL, hccm(REG_MODEL, type = "hc0")),type="text")


=============================================
                      Dependent variable:    
                  ---------------------------
                                             
---------------------------------------------
log(Unemployment)            0.176           
                            (0.148)          
                                             
log(FedRate)               0.976***          
                            (0.075)          
                                             
Constant                   -0.902***         
                            (0.273)          
                                             
=============================================
=============================================
Note:             *p<0.1; **p<0.05; ***p<0.01

TEST THE ZERO CONDITIONAL MEAN AND COVARIANCE

RSD<-residuals(REG_MODEL)

summary(RSD)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.52073 -0.35028 -0.01588  0.00000  0.35593  1.13292

The zero conditional mean assumption is met, because the residuals mean is approximately zero.

Covariance

cov(RSD,log(FedRate))

[1] 0.00000000000000007316962

cov(RSD,log(Unemployment))

[1] 0.0000000000000001019151

Plot the residuals with the mean as the referrence line

RSD<-ts(RSD, frequency = 1,start = 1858)
ts.plot(RSD)

Times series of residual

Add referreence line to the residual plot

ts.plot(RSD)
abline(0,0.000)

Time series plot of residuals with an abline line

Visualize graphs using ggplot2 and economist theme

1

ggplot(data=unemploymnt_data,aes(x=year,y=Inflation))+geom_line()

Times series plot of inflation

2

ggplot(data=unemploymnt_data,aes(x=year,y=Inflation))+geom_line()+
  labs(title="Time Series plot of Inflation Rate",
       caption="source:Federal Reserve",
       y="Inflation Rate", x="Year",
       color=3) + # title and caption
  theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
        axis.title=element_text(size=12,face="bold"),
        panel.grid = element_blank())+
  #theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
  theme(legend.text = element_text(size=12,face="bold"))+
  theme_set(theme_economist())

Time Series plot of Inflation Rate

3

ggplot(data=unemploymnt_data,aes(x=year,y=Unemployment))+geom_line()+
  labs(title="Time Series plot of Unemployment Rate",
       caption="source:Federal Reserve",
       y="Unemployment Rate", x="Year",
       color=3) + # title and caption
  theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
        axis.title=element_text(size=12,face="bold"),
        panel.grid = element_blank())+
  #theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
  theme(legend.text = element_text(size=12,face="bold"))+
  theme_set(theme_economist())

Time Series plot of Unemployment Rate

4

ggplot(data=unemploymnt_data,aes(x=year,y=FedRate))+geom_line()+
  labs(title="Time Series plot of Fed Fund Rate",
       caption="source:Federal Reserve",
       y="Fed Fund Rate", x="Year",
       color=3) + # title and caption
  theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
        axis.title=element_text(size=12,face="bold"),
        panel.grid = element_blank())+
  #theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
  theme(legend.text = element_text(size=12,face="bold"))+
  theme_set(theme_economist())

Time Series plot of Fed Fund Rate

ggplot(data=unemploymnt_data,aes(x=year))+
  geom_line(aes(y=Inflation,colour="Inflation Rate"))+
  geom_line(aes(y=FedRate,colour="FedRate"))+
  geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
  labs(title="Trends of Inflation, Unemployment and FedRate from 1859 to 2022",
       caption="", y="Rate", x="Time in Years", color="Key")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 10))

Trends of Inflation, Unemployment and FedRate from 1859 to 2022

Rename time variable

date<-seq(as.Date("1859-01-01"),by="1 year",length.out=length(unemploymnt_data$year))
date

  [1] "1859-01-01" "1860-01-01" "1861-01-01" "1862-01-01" "1863-01-01"
  [6] "1864-01-01" "1865-01-01" "1866-01-01" "1867-01-01" "1868-01-01"
 [11] "1869-01-01" "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01"
 [16] "1874-01-01" "1875-01-01" "1876-01-01" "1877-01-01" "1878-01-01"
 [21] "1879-01-01" "1880-01-01" "1881-01-01" "1882-01-01" "1883-01-01"
 [26] "1884-01-01" "1885-01-01" "1886-01-01" "1887-01-01" "1888-01-01"
 [31] "1889-01-01" "1890-01-01" "1891-01-01" "1892-01-01" "1893-01-01"
 [36] "1894-01-01" "1895-01-01" "1896-01-01" "1897-01-01" "1898-01-01"
 [41] "1899-01-01" "1900-01-01" "1901-01-01" "1902-01-01" "1903-01-01"
 [46] "1904-01-01" "1905-01-01" "1906-01-01" "1907-01-01" "1908-01-01"
 [51] "1909-01-01" "1910-01-01" "1911-01-01" "1912-01-01" "1913-01-01"
 [56] "1914-01-01" "1915-01-01" "1916-01-01" "1917-01-01" "1918-01-01"
 [61] "1919-01-01" "1920-01-01" "1921-01-01" "1922-01-01" "1923-01-01"
 [66] "1924-01-01" "1925-01-01" "1926-01-01" "1927-01-01" "1928-01-01"
 [71] "1929-01-01" "1930-01-01" "1931-01-01" "1932-01-01" "1933-01-01"
 [76] "1934-01-01" "1935-01-01" "1936-01-01" "1937-01-01" "1938-01-01"
 [81] "1939-01-01" "1940-01-01" "1941-01-01" "1942-01-01" "1943-01-01"
 [86] "1944-01-01" "1945-01-01" "1946-01-01" "1947-01-01" "1948-01-01"
 [91] "1949-01-01" "1950-01-01" "1951-01-01" "1952-01-01" "1953-01-01"
 [96] "1954-01-01" "1955-01-01" "1956-01-01" "1957-01-01" "1958-01-01"
[101] "1959-01-01" "1960-01-01" "1961-01-01" "1962-01-01" "1963-01-01"
[106] "1964-01-01" "1965-01-01" "1966-01-01" "1967-01-01" "1968-01-01"
[111] "1969-01-01" "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01"
[116] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01"
[121] "1979-01-01" "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01"
[126] "1984-01-01" "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01"
[131] "1989-01-01" "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01"
[136] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[141] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[146] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[151] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[156] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01"
[161] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01"

Plot the series 1

ggplot(data=unemploymnt_data,aes(x=date))+
  geom_line(aes(y=Inflation,colour="Inflation Rate"))+
  labs(title="Trends of Inflation Rate from 1859 to 2022",
       caption="", y="Inflation Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y", breaks = "4 year")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Trends of Inflation Rate from 1859 to 2022

Plot the series 2

ggplot(data=unemploymnt_data,aes(x=date))+
  geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
  labs(title="Trends of Unemployment Rate from 1859 to 2022",
       caption="", y="Unemployment Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y", breaks = "4 year")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Plot the series 3

ggplot(data=unemploymnt_data,aes(x=date))+
  geom_line(aes(y=FedRate,colour="Fed Fund Rate"))+
  labs(title="Trends of Fed Fund Rate from 1859 to 2022",
       caption="", y="Fed Fund Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y", breaks = "4 year")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Trends of Fed Fund Rate from 1859 to 2022

Combined Plot 5

ggplot(data=unemploymnt_data,aes(x=date))+
  geom_line(aes(y=FedRate,colour="Fed Fund Rate"))+
  geom_line(aes(y=Inflation,colour="Inflation Rate"))+
  geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
  labs(title="Trends of Fed Fund Rate, Inflation Rate and Unemployment
       Rate from 1859 to 2022",
       caption="", y="Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y", breaks = "4 year")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Trends of Fed Rate, Inflation and Unemployment from 1859 to 2022

Create plot1

plot_1<-ggplot(data=unemploymnt_data,aes(x=date,y=Inflation))+geom_line()+
  labs(title="Inflation Rate from 1859 to 2022",
       caption="", y="Inflation Rate", x="Time in Years", color=3)+
  scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Create plot2

plot_2<-ggplot(data=unemploymnt_data,aes(x=date,y=Unemployment))+geom_line()+
  labs(title="Unemployment Rate from 1859 to 2022",
       caption="", y="Unemployment Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Create Plot3

plot_3<-ggplot(data=unemploymnt_data,aes(x=date,y=FedRate))+geom_line()+
  labs(title="Fed Fund Rates from 1859 to 2022",
       caption="", y="Fed Fund Rates", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Create the three plots above in one figure.

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_2),ggplotGrob(plot_3),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_2),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_3),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_2),ggplotGrob(plot_3),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_2),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_3),size="last"))

Data Frame the residuals

data11<-data.frame(unemploymnt_data$year,unemploymnt_data$Inflation,unemploymnt_data$FedRate,unemploymnt_data$Unemployment,RSD)

head(data11,10)

Export the data set above

write.csv(data11,"C:\\Users\\user\\Downloads\\residual_data.csv", row.names = FALSE)

ggplot(data=data11,aes(x=year,y=RSD))+geom_line()+
  labs(title="Regression Residuals",
       caption="Source:Federal Reserve", y="Residuals", x="Time in Years", color=3)

Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.

residual_data<-read.csv("C:\\Users\\user\\Downloads\\residual_data.csv")
attach(residual_data)

The following object is masked _by_ .GlobalEnv:

    RSD

View the data set

head(residual_data,10)

Time<-seq(as.Date("1859-01-01"),by="1 year",length.out=length(residual_data$unemploymnt_data.year))
Time

  [1] "1859-01-01" "1860-01-01" "1861-01-01" "1862-01-01" "1863-01-01"
  [6] "1864-01-01" "1865-01-01" "1866-01-01" "1867-01-01" "1868-01-01"
 [11] "1869-01-01" "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01"
 [16] "1874-01-01" "1875-01-01" "1876-01-01" "1877-01-01" "1878-01-01"
 [21] "1879-01-01" "1880-01-01" "1881-01-01" "1882-01-01" "1883-01-01"
 [26] "1884-01-01" "1885-01-01" "1886-01-01" "1887-01-01" "1888-01-01"
 [31] "1889-01-01" "1890-01-01" "1891-01-01" "1892-01-01" "1893-01-01"
 [36] "1894-01-01" "1895-01-01" "1896-01-01" "1897-01-01" "1898-01-01"
 [41] "1899-01-01" "1900-01-01" "1901-01-01" "1902-01-01" "1903-01-01"
 [46] "1904-01-01" "1905-01-01" "1906-01-01" "1907-01-01" "1908-01-01"
 [51] "1909-01-01" "1910-01-01" "1911-01-01" "1912-01-01" "1913-01-01"
 [56] "1914-01-01" "1915-01-01" "1916-01-01" "1917-01-01" "1918-01-01"
 [61] "1919-01-01" "1920-01-01" "1921-01-01" "1922-01-01" "1923-01-01"
 [66] "1924-01-01" "1925-01-01" "1926-01-01" "1927-01-01" "1928-01-01"
 [71] "1929-01-01" "1930-01-01" "1931-01-01" "1932-01-01" "1933-01-01"
 [76] "1934-01-01" "1935-01-01" "1936-01-01" "1937-01-01" "1938-01-01"
 [81] "1939-01-01" "1940-01-01" "1941-01-01" "1942-01-01" "1943-01-01"
 [86] "1944-01-01" "1945-01-01" "1946-01-01" "1947-01-01" "1948-01-01"
 [91] "1949-01-01" "1950-01-01" "1951-01-01" "1952-01-01" "1953-01-01"
 [96] "1954-01-01" "1955-01-01" "1956-01-01" "1957-01-01" "1958-01-01"
[101] "1959-01-01" "1960-01-01" "1961-01-01" "1962-01-01" "1963-01-01"
[106] "1964-01-01" "1965-01-01" "1966-01-01" "1967-01-01" "1968-01-01"
[111] "1969-01-01" "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01"
[116] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01"
[121] "1979-01-01" "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01"
[126] "1984-01-01" "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01"
[131] "1989-01-01" "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01"
[136] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[141] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[146] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[151] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[156] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01"
[161] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01"

Create the plot for the residuals

residual_plot<-ggplot(data=residual_data,aes(x=Time,y=RSD))+geom_line()+
  labs(title="Regression Residuals",
       caption="", y="Residual", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Plot the residual

grid.newpage()
grid.draw(rbind(ggplotGrob(residual_plot),size="last"))

Additional Tests

Random Sampling

sample_sales<-sample(Superstore_data$Sales,size = 200, replace = FALSE)
sample_profit<-sample(Superstore_data$Profit,size = 200, replace = FALSE)
sample_quantity<-sample(Superstore_data$Quantity,size = 200, replace = FALSE)

Data frame the variables

sample_data<-data.frame(sample_sales,sample_profit,sample_quantity)

View the sample

head(sample_data,10)

tail(sample_data,10)

OMITTED VARIABLE BIAS

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias occurs when the omitted variable is correlated with one or more other independent variable and dependent variable. Consider the data below.

research_data<-read.csv("C:\\Users\\user\\Downloads\\research_data.csv")
attach(research_data)

The following object is masked from unemploymnt_data (pos = 4):

    year

The following object is masked from unemploymnt_data (pos = 5):

    year

View the data set

head(research_data,10)

tail(research_data,10)

Run the first Multiple Regression

mlr1 <- lm(Private_investment~income_tax+value_added_tax, data = research_data)
stargazer(mlr1,type = "text")


===============================================
                        Dependent variable:    
                    ---------------------------
                        Private_investment     
-----------------------------------------------
income_tax                   -1.017***         
                              (0.344)          
                                               
value_added_tax              2.828***          
                              (0.658)          
                                               
Constant                   40,402.690***       
                           (12,398.920)        
                                               
-----------------------------------------------
Observations                    46             
R2                             0.712           
Adjusted R2                    0.698           
Residual Std. Error    60,037.290 (df = 43)    
F Statistic           53.103*** (df = 2; 43)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Run the second model with the omitted variable

mlr2<- lm(Private_investment~income_tax+value_added_tax+import_tax, data = research_data)
stargazer(mlr2,type = "text")


===============================================
                        Dependent variable:    
                    ---------------------------
                        Private_investment     
-----------------------------------------------
income_tax                   -1.026***         
                              (0.323)          
                                               
value_added_tax               1.752**          
                              (0.743)          
                                               
import_tax                    4.249**          
                              (1.630)          
                                               
Constant                    15,523.390         
                           (15,054.700)        
                                               
-----------------------------------------------
Observations                    46             
R2                             0.752           
Adjusted R2                    0.734           
Residual Std. Error    56,362.240 (df = 42)    
F Statistic           42.433*** (df = 3; 42)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Test whether there is an omitted variable bias

waldtest(mlr1,mlr2,test = "F")

Interpretation

The null hypothesis for this test states that the coefficient of the omitted variable is zero. Here the implication is that if we accept the null hypothesis, the variable was correctly omitted. On the other hand, the alternative hypothesis states that the coefficient of the omitted variable is not equal to zero. Therefore, rejecting the null hypothesis indicates that the variable was incorrectly omitted. From the results above, the p-value is approximately 0.01, which indicates that we should reject the null hypothesis and conclude that the variable was incorrectly omitted. Thus, the omitted variable helps to explain the variation in the response variable.

Foreign Direct Investment

fdi_data<-read.csv("C:\\Users\\user\\Downloads\\fdi.csv")
attach(fdi_data)

The following object is masked from research_data:

    year

The following object is masked from unemploymnt_data (pos = 5):

    year

The following object is masked from unemploymnt_data (pos = 6):

    year

View data set

head(fdi_data,10)

Declare the data quarterly time series from 1970 first quarter to 2021 last quarter.

FDI_USA....of.GDP.<-ts(fdi_data$FDI_USA....of.GDP.,frequency = 1,start = 1970)

Plot the series

ts.plot(FDI_USA....of.GDP.,main="Yearly time series plot of Foreign Direct 
        Investment Net Flow in the USA from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")

Yearly time series of FDI in USA

FDI_UK....of.GDP.<-ts(fdi_data$FDI_UK....of.GDP.,frequency = 1,start = 1970)

Plot the series

ts.plot(FDI_UK....of.GDP.,main="Yearly time series plot of Foreign Direct 
        Investment Net Flow in the UK from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")

Yearly time series of FDI in UK

FDI_Kenya....of.GDP.<-ts(fdi_data$FDI_Kenya....of.GDP.,frequency = 1,start = 1970)

Plot the series

ts.plot(FDI_Kenya....of.GDP.,main="Yearly time series plot of Foreign Direct 
        Investment Net Flow in Kenya from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")

Yearly time series of FDI in Kenya

date_time<-seq(as.Date("1970-01-01"),by="1 year",length.out=length(fdi_data$year))
date_time

 [1] "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01" "1974-01-01"
 [6] "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01" "1979-01-01"
[11] "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01" "1984-01-01"
[16] "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01" "1989-01-01"
[21] "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01" "1994-01-01"
[26] "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01" "1999-01-01"
[31] "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01" "2004-01-01"
[36] "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01" "2009-01-01"
[41] "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01"
[46] "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01" "2019-01-01"
[51] "2020-01-01"

Visualize the plots using ggplot

The United States

USA<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_USA....of.GDP.))+geom_line()+
  labs(title="FDI Inflow in the USA from 1970 to 2020",
       caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Kenya

Kenya<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_Kenya....of.GDP.))+geom_line()+
  labs(title="FDI Inflow in Kenya from 1970 to 2020",
       caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

The United Kingdom (UK)

UK<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_UK....of.GDP.))+geom_line()+
  labs(title="FDI Inflow in UK from 1970 to 2020",
       caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Plot the graph for the USA

grid.newpage()
grid.draw(rbind(ggplotGrob(USA),size="last"))

Plot the graph for Kenya

grid.newpage()
grid.draw(rbind(ggplotGrob(Kenya),size="last"))

Plot the graph for the UK

grid.newpage()
grid.draw(rbind(ggplotGrob(UK),size="last"))

Plot the three graphs all together

ggplot(data=fdi_data,aes(x=year))+
  geom_line(aes(y=FDI_USA....of.GDP.,colour="FDI_USA"))+
  geom_line(aes(y=FDI_UK....of.GDP.,colour="FDI_UK"))+
  geom_line(aes(y=FDI_Kenya....of.GDP.,colour="FDI_Kenya"))+
  labs(title="Trends of FDI inflow as a percentage of GDP from 1970 to 2020",
       caption="", y="FDI", x="Time in Years",color="Key")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 10))

Trends of FDI inflow as a percentage of GDP from 1970 to 2020

FOREIGN DIRECT INVESTMENT

ForDI<-read.csv("C:\\Users\\user\\Downloads\\ForDI.csv")
attach(ForDI)

The following object is masked _by_ .GlobalEnv:

    Kenya

View the data set

head(ForDI,10)

Declare the data time series

ForDI_TR<- ts(data=ForDI[,-1],start = min(ForDI$Year),end =max(ForDI$Year))

Plot the series

plot(ForDI_TR,plot.type = "single",col=1:7)
legend("topleft",legend = colnames(ForDI_TR),ncol = 2,lty = 1, col = 1:7,cex = 0.9)

Time series plot Foreign Direct Investment

Plot the graphs seperately

plot(ForDI_TR)

Multivariate Time series Forecasting

ARIMA Modelling

ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. Before we introduce ARIMA models, we must first discuss the concept of stationarity and the technique of differencing time series.

Stationarity

A stationary time series is one whose properties do not depend on the time at which the series is observed.14 Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time. Some cases can be confusing — a time series with cyclic behaviour (but with no trend or seasonality) is stationary. This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be. In general, a stationary time series will have no predictable patterns in the long-term. Time plots will show the series to be roughly horizontal (although some cyclic behaviour is possible), with constant variance.

Unit root tests

One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required. A number of unit root tests are available, which are based on different assumptions and may lead to conflicting answers. In our analysis, we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test (Kwiatkowski, Phillips, Schmidt, & Shin, 1992). In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required. The test can be computed using the ur.kpss() function from the urca package.

Import the data set

gdp_data<-read.csv("C:\\Users\\user\\Downloads\\ARIMA.csv")

View the data

head(gdp_data,5)

the structure of the data

str(gdp_data)

'data.frame':   204 obs. of  2 variables:
 $ Year   : int  1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 ...
 $ realgdp: num  1610 1659 1723 1754 1774 ...

#Class of the data

class(gdp_data)

[1] "data.frame"

Declare the data time series

gdp_TS<- ts(data=gdp_data[,-1],start = min(gdp_data$Year),end =max(gdp_data$Year))

The structure of the new data set

class(gdp_TS)

[1] "ts"

#Plot the dataset

plot(gdp_TS)

#Augumented Dicky Fuller Test (Unit root test)

adf.test(gdp_TS)

Warning in adf.test(gdp_TS): p-value greater than printed p-value


    Augmented Dickey-Fuller Test

data:  gdp_TS
Dickey-Fuller = 0.28354, Lag order = 5, p-value = 0.99
alternative hypothesis: stationary

From the result above, GDP is highly correlated with itself and therefore not stationary.

#Check for autocorrelation (ACF)

acf(gdp_TS)

pacf(gdp_TS)

#Run the ARIMA by automatically Converting non-Stationary data to stationary data

auto.arima(gdp_TS,ic="aic", trace = TRUE)


 Fitting models using approximations to speed things up...

 ARIMA(2,2,2)                    : 2039.769
 ARIMA(0,2,0)                    : 2114.799
 ARIMA(1,2,0)                    : 2073.859
 ARIMA(0,2,1)                    : 2049.559
 ARIMA(1,2,2)                    : 2047.647
 ARIMA(2,2,1)                    : 2037.924
 ARIMA(1,2,1)                    : 2046.277
 ARIMA(2,2,0)                    : 2067.973
 ARIMA(3,2,1)                    : 2041.463
 ARIMA(3,2,0)                    : 2064.69
 ARIMA(3,2,2)                    : 2043.092

 Now re-fitting the best model(s) without approximations...

 ARIMA(2,2,1)                    : 2054.273

 Best model: ARIMA(2,2,1)

Series: gdp_TS 
ARIMA(2,2,1) 

Coefficients:
         ar1     ar2      ma1
      0.2633  0.1442  -0.9656
s.e.  0.0725  0.0726   0.0195

sigma^2 = 1477:  log likelihood = -1023.14
AIC=2054.27   AICc=2054.48   BIC=2067.51

Fit the model

ARIMAMODEL<-auto.arima(gdp_TS,ic="aic", trace = TRUE)


 Fitting models using approximations to speed things up...

 ARIMA(2,2,2)                    : 2039.769
 ARIMA(0,2,0)                    : 2114.799
 ARIMA(1,2,0)                    : 2073.859
 ARIMA(0,2,1)                    : 2049.559
 ARIMA(1,2,2)                    : 2047.647
 ARIMA(2,2,1)                    : 2037.924
 ARIMA(1,2,1)                    : 2046.277
 ARIMA(2,2,0)                    : 2067.973
 ARIMA(3,2,1)                    : 2041.463
 ARIMA(3,2,0)                    : 2064.69
 ARIMA(3,2,2)                    : 2043.092

 Now re-fitting the best model(s) without approximations...

 ARIMA(2,2,1)                    : 2054.273

 Best model: ARIMA(2,2,1)

View the model

ARIMAMODEL

Series: gdp_TS 
ARIMA(2,2,1) 

Coefficients:
         ar1     ar2      ma1
      0.2633  0.1442  -0.9656
s.e.  0.0725  0.0726   0.0195

sigma^2 = 1477:  log likelihood = -1023.14
AIC=2054.27   AICc=2054.48   BIC=2067.51

summary(ARIMAMODEL)

Series: gdp_TS 
ARIMA(2,2,1) 

Coefficients:
         ar1     ar2      ma1
      0.2633  0.1442  -0.9656
s.e.  0.0725  0.0726   0.0195

sigma^2 = 1477:  log likelihood = -1023.14
AIC=2054.27   AICc=2054.48   BIC=2067.51

Training set error measures:
                   ME     RMSE      MAE        MPE      MAPE      MASE
Training set 2.803157 37.96217 29.06838 0.02152401 0.7154865 0.6297229
                     ACF1
Training set -0.008628346

#Test the stationarity of the model

acf(ts(ARIMAMODEL$residuals))

pacf(ts(ARIMAMODEL$residuals))

From the many articles that we have read, the ACF is used to identify order of Moving Average (MA) term, and PACF for Auto-regressive term (AR). From the results above, MA will have one lag,with AR having two lags. #Forecasting

MYGDPF<-forecast(ARIMAMODEL,level=c(95), h=50)
MYGDPF

     Point Forecast     Lo 95     Hi 95
2021       9357.475  9282.142  9432.809
2022       9415.513  9292.099  9538.927
2023       9476.136  9306.002  9646.271
2024       9538.083  9325.159  9751.008
2025       9600.752  9347.949  9853.555
2026       9663.801  9373.544  9954.058
2027       9727.055  9401.215 10052.895
2028       9790.417  9430.445 10150.390
2029       9853.838  9460.849 10246.827
2030       9917.289  9492.143 10342.436
2031       9980.757  9524.114 10437.401
2032      10044.234  9556.599 10531.870
2033      10107.716  9589.474 10625.958
2034      10171.200  9622.640 10719.761
2035      10234.686  9656.019 10813.352
2036      10298.172  9689.550 10906.793
2037      10361.658  9723.181 11000.136
2038      10425.145  9756.870 11093.420
2039      10488.632  9790.583 11186.681
2040      10552.119  9824.290 11279.948
2041      10615.606  9857.967 11373.244
2042      10679.093  9891.594 11466.592
2043      10742.580  9925.152 11560.008
2044      10806.067  9958.626 11653.508
2045      10869.554  9992.003 11747.105
2046      10933.041 10025.271 11840.810
2047      10996.528 10058.422 11934.634
2048      11060.015 10091.445 12028.584
2049      11123.502 10124.335 12122.669
2050      11186.989 10157.083 12216.894
2051      11250.476 10189.685 12311.266
2052      11313.963 10222.136 12405.790
2053      11377.450 10254.430 12500.470
2054      11440.937 10286.565 12595.309
2055      11504.424 10318.536 12690.312
2056      11567.911 10350.341 12785.481
2057      11631.398 10381.976 12880.819
2058      11694.885 10413.441 12976.329
2059      11758.372 10444.733 13072.011
2060      11821.859 10475.849 13167.869
2061      11885.346 10506.789 13263.903
2062      11948.833 10537.552 13360.114
2063      12012.320 10568.135 13456.505
2064      12075.807 10598.538 13553.075
2065      12139.294 10628.761 13649.826
2066      12202.781 10658.803 13746.759
2067      12266.268 10688.663 13843.873
2068      12329.755 10718.340 13941.170
2069      12393.242 10747.834 14038.649
2070      12456.729 10777.146 14136.312

Plot the predicted series

plot(MYGDPF,type="l",main="Time Series plot of Forecasted GDP (ARIMA 2,2,1)",xlab="Time in Years",ylab="MYGDPF")

ARIMA forecast

#Diagnosing the ARIMA Model/Validating the Model

Box.test(MYGDPF$resid, lag=5,type="Ljung-Box")


    Box-Ljung test

data:  MYGDPF$resid
X-squared = 1.6084, df = 5, p-value = 0.9002

Box.test(MYGDPF$resid, lag=2,type="Ljung-Box")


    Box-Ljung test

data:  MYGDPF$resid
X-squared = 0.025225, df = 2, p-value = 0.9875

Box.test(MYGDPF$resid, lag=20,type="Ljung-Box")


    Box-Ljung test

data:  MYGDPF$resid
X-squared = 23.863, df = 20, p-value = 0.2484

From the results, ARIMA(2,2,1) is good for prediction and has no autocorrelation issue. Box-Ljung test gave a p-value value of 0.9875, showing no autocorrelation

Plot the residuals using the ggplot2

autoplot(forecast(MYGDPF))

ARIMA 2,2,1 Forecast

Predict the residuals

RESID<-residuals(ARIMAMODEL)

Summarize the residuals

summary(RESID)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-127.529  -20.085    2.219    2.803   25.356  154.703

Plot the residuals

plot(RESID)
abline(2.803,0)

The plot of residuals with an abline

Vector Autoregressive Model (VAR)

In today’s lesson, you’ll learn the basics of the vector autoregressive model. We lay the foundation for getting started with this crucial multivariate time series model and cover the important details including:

What a VAR model is.
Who uses VAR models.
Basic types of VAR models.
How to specify a VAR model.
Estimation and forecasting with VAR models.

What is a vector autoregressive model?

The vector autoregressive (VAR) model is a workhouse multivariate time series model that relates current observations of a variable with past observations of itself and past observations of other variables in the system. VAR models differ from univariate autoregressive models because they allow feedback to occur between the variables in the model. For example, we could use a VAR model to show how Inflation rate is a function of unemployment rate and how unemployment rate is, in turn, a function of inflation rate. In economics we might be interested in establishing a bi-directional relationship between personal income and personal consumption spending? A two-equation VAR system is used to model the relationship between income and consumption over time.

Who uses VAR models?

VAR models are traditionally widely used in finance and econometrics because they offer a framework for accomplishing important modeling goals, including (Stock and Watson 2001):

Data description.
Forecasting.
Structural inference.
Policy analysis. However, more recently VAR models have been gaining traction in other fields like epidemiology, medicine, and biology.

The reduced form, recursive, and structural VAR

There are three broad types of VAR models, the reduced form, the recursive form, and the structural VAR model.

Reduced form VAR models consider each variable to be a function of:

Its own past values.
The past values of other variables in the model.
While reduced form models are the simplest of the VAR models, they do come with disadvantages:

Recursive VAR models.

This model contain all the components of the reduced form model, but also allow some variables to be functions of other concurrent variables. By imposing these short-run relationships, the recursive model allows us to model structural shocks.

Specifying a VAR model

What makes up a VAR model?

A VAR model is made up of a system of equations that represents the relationships between multiple variables. When referring to VAR models, we often use special language to specify:

How many endogenous variables there are included.
How many autoregressive terms are included. For example, if we have two endogenous variables and autoregressive terms, we say the model is a Bivariate VAR(2) model. If we have three endogenous variables and four autoregressive terms, we say the model is a Trivariate VAR(4) model.

In general, a VAR model is composed of n-equations (representing n endogenous variables) and includes p-lags of the variables.

How do we choose the number of lags in a VAR model?

Lag selection is one of the important aspects of VAR model specification. In practical applications, we generally choose a maximum number of lags, , and evaluate the performance of the model including. The optimal model is then the model VAR(p) which minimizes some lag selection criteria. The most commonly used lag selection criteria are:

Akaike (AIC)
Schwarz-Bayesian (BIC)
Hannan-Quinn (HQ). These methods are usually built into software and lag selection is almost completely automated now.

How do we decide what endogenous variables to include in our VAR model?

From an estimation standpoint, it is important to be deliberate about how many variables we include in our VAR model. Adding additional variables: * Increases the number of coefficients to be estimated for each equation and each number of lags.

Introduce additional estimation error.

Deciding what variables to include in a VAR model should be founded in theory, as much as possible. We can use additional tools, like Granger causality or Sims causality, to test the forecasting relevance of variables.

Granger causality tests

Tests whether a variable is “helpful” for forecasting the behavior of another variable. It’s important to note that Granger causality only allows us to make inferences about forecasting capabilities – not about true causality.

Estimating and inference in VAR models

Despite their seeming complexities, VAR models are quite easy to estimate. The equation can be estimated using ordinary least squares given a few assumptions:

The error term has a conditional mean of zero.
The variables in the model are stationary.
Large outliers are unlikely.
No perfect multicollinearity. Under these assumptions, the ordinary least squares estimates:
Will be consistent.
Can be evaluated using traditional t-statistics and p-values.
Can be used to jointly test restrictions across multiple equations.

Forecasting

One of the most important functions of VAR models is to generate forecasts. Forecasts are generated for VAR models using an iterative forecasting algorithm:

Estimate the VAR model using OLS for each equation.

Compute the one-period-ahead forecast for all variables.
Compute the two-period-ahead forecasts, using the one-period-ahead forecast.
Iterate until the h-step ahead forecasts are computed.

Reporting and evaluating VAR models

Often we are more interested in the dynamics that are predicted by our VAR models than the actual coefficients that are estimated. For this reason, it is most common that VAR studies report:

Granger-causality statistics.
Impulse response functions.
Forecast error decompositions
Granger-causality statistics

As we previously discussed, Granger-causality statistics test whether one variable is statistically significant when predicting another variable. In other words, the Granger-causality statistics are F-statistics that test if the coefficients of all lags of a variable are jointly equal to zero in the equation for another variable. As the p-value of the F-statistic decreases, evidence that a variable is relevant for predict another variable increases.

For example, in the Granger-causality test of on , if the p-value is 0.02 we would say that does help predict at the 5% level. However, if the p-value is 0.3 we would say that there is no evidence that helps predict .

Impulse response functions

The impulse response function traces the dynamic path of variables in the system to shocks to other variables in the system. This is done by:

Estimating the VAR model.
Implementing a one-unit increase in the error of one of the variables in the model, while holding the other errors equal to zero.
Predicting the impacts h-period ahead of the error shock.
Plotting the forecasted impacts, along with the one-standard-deviation confidence intervals.

Forecast error decomposition

Forecast error decomposition separates the forecast error variance into proportions attributed to each variable in the model. Intuitively, this measure helps us judge how much of an impact one variable has on another variable in the VAR model and how intertwined our variables’ dynamics are. For example, if X is responsible for 85% of the forecast error variance of Y, it is explaining a large amount of the forecast variation in Y. However, if X is only responsible for 20% of the forecast error variance of Y, much of the forecast error variance of is left unexplained by X.

Conclusion

VAR models are an essential component of multivariate time series modeling. After today’s blog, you should have a better understanding of the fundamentals of the VAR model including:

What a VAR model is.
Who uses VAR models.
Basic types of VAR models.
How to specify a VAR model.
Estimation and forecasting with VAR models.

ESTIMATING VAR MODEL IN RSTUDIO

var<-read.csv("C:\\Users\\user\\Downloads\\estimation.csv")

View data set

head(var,10)

WHAT TO CONSIDER WHEN ESTIMATING VAR MODEL

Check the integrating properties
Check the required lag length (optimal lag length)
Use Archaic Information criteria
Estimate the VAR model
Test the stability of the model
Granger causality test
Generating Impulse response function

Converting variables into time series objects

Inflation<-ts(var$Inflation,start=c(2000,1),frequency = 12)

plot.ts(Inflation,type="l",main="Time Series plot Inflation",xlab="Year",ylab="inflation")

The time Series plot Inflation

unemployment<-ts(var$unemployment,start=c(2000,1),frequency = 12)
plot.ts(unemployment,type="l",main="Time Series plot unemployment",xlab="Year",ylab="unemployment")

Time Series plot unemployment

Augumented Dickey Fuller Test

Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of a series.

The null hypothesis however is still the same as the Dickey Fuller test. A key point to remember here is: Since the null hypothesis assumes the presence of unit root, that is alpha=1, the p-value obtained should be less than the significance level (say 0.05) in order to reject the null hypothesis. Thereby, inferring that the series is stationary. However, this is a very common mistake analysts commit with this test. That is, if the p-value is less than significance level, people mistakenly take the series to be non-stationary.

ADF<-adf.test(Inflation)

Warning in adf.test(Inflation): p-value smaller than printed p-value

ADF


    Augmented Dickey-Fuller Test

data:  Inflation
Dickey-Fuller = -5.7974, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

ADF1<-adf.test(unemployment)

Warning in adf.test(unemployment): p-value smaller than printed p-value

ADF1


    Augmented Dickey-Fuller Test

data:  unemployment
Dickey-Fuller = -4.0617, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

Optimal lag length

lag<-VARselect(var, lag.max=4)
lag

$selection
AIC(n)  HQ(n)  SC(n) FPE(n) 
     1      1      1      1 

$criteria
                  1           2           3           4
AIC(n) -0.093318151 -0.08006220 -0.08372238 -0.07044136
HQ(n)  -0.052691480 -0.01235108  0.01107318  0.05143865
SC(n)   0.007032298  0.08718855  0.15042866  0.23060999
FPE(n)  0.910908005  0.92307937  0.91974245  0.93210286

The four criteria above indicates that the appropriate number of lags is one. Including more than one lag in the model will be inappropriate.

VAR Model Estimation

EST<-VAR(var, p=1, type = "none")

summary(EST)


VAR Estimation Results:
========================= 
Endogenous variables: Inflation, unemployment 
Deterministic variables: none 
Sample size: 199 
Log Likelihood: -549.51 
Roots of the characteristic polynomial:
 0.43  0.43
Call:
VAR(y = var, p = 1, type = "none")


Estimation results for equation Inflation: 
========================================== 
Inflation = Inflation.l1 + unemployment.l1 

                Estimate Std. Error t value Pr(>|t|)    
Inflation.l1     0.25493    0.06887   3.702 0.000278 ***
unemployment.l1 -0.05589    0.04771  -1.171 0.242875    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.9565 on 197 degrees of freedom
Multiple R-Squared: 0.06981,    Adjusted R-squared: 0.06037 
F-statistic: 7.392 on 2 and 197 DF,  p-value: 0.0008023 


Estimation results for equation unemployment: 
============================================= 
unemployment = Inflation.l1 + unemployment.l1 

                Estimate Std. Error t value             Pr(>|t|)    
Inflation.l1     0.58693    0.07050   8.325    0.000000000000014 ***
unemployment.l1  0.59648    0.04885  12.211 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.9792 on 197 degrees of freedom
Multiple R-Squared: 0.5342, Adjusted R-squared: 0.5295 
F-statistic:   113 on 2 and 197 DF,  p-value: < 0.00000000000000022 



Covariance matrix of residuals:
             Inflation unemployment
Inflation       0.9104      -0.0391
unemployment   -0.0391       0.9578

Correlation matrix of residuals:
             Inflation unemployment
Inflation      1.00000     -0.04187
unemployment  -0.04187      1.00000

#Use stargaze to organize the results

stargazer(EST[["varresult"]], type='text')


===========================================================
                                   Dependent variable:     
                               ----------------------------
                                            y              
                                    (1)            (2)     
-----------------------------------------------------------
Inflation.l1                      0.255***      0.587***   
                                  (0.069)        (0.071)   
                                                           
unemployment.l1                    -0.056       0.596***   
                                  (0.048)        (0.049)   
                                                           
-----------------------------------------------------------
Observations                        199            199     
R2                                 0.070          0.534    
Adjusted R2                        0.060          0.530    
Residual Std. Error (df = 197)     0.957          0.979    
F Statistic (df = 2; 197)         7.392***     112.979***  
===========================================================
Note:                           *p<0.1; **p<0.05; ***p<0.01

Interpretation of the coefficients

In practice, you usually don’t don’t make interpretation for the coefficient of a VAR model. And there is a pretty intuitive reason for that. As you recall, VARs assume that all relevant variables are somehow affecting each other through time as a unique universe, so much that in practice VAR estimates as much equations as there are variables. For example, if you estimate a 3-var VAR, then you get an output that consists of 3 estimated equations. This means that each variable gets the “opportunity” of being endogenous while the others are exogenous. So, since VAR assumes this, you won’t want to interpret coefficients as doing that would imply that an x- var is causally affecting the y-var, which by construction, VAR does not assumes that. That’s why most analysis skip that and jump to other analysis, like impulse-response functions and graphs.

Additionally

Strictly speaking, the coefficients of a reduced-form VAR have the same interpretation as the coefficients in any regression, ie. they are the estimated partial derivatives of the dependent variable with respect to the lagged explanatory variables. However, we are typically not interested in these coefficients, but rather on the dynamic properties of the whole system. After all, a VAR is intended to summarize the dynamic behavior of a group of variables, all of which are endogenous with respect to one another. So, we typically focus on the impulse response functions (IRFs) of the VAR, which provide the (cumulative) total derivatives of the endogenous variables with respect to an exogenous shock to one of the variables. More technically, calculating IRFs is equivalent to inverting the vector autoregressive representation to obtain the vector moving average representation. Note that doing so requires that the matrix be invertible, meaning no unit roots in the underlying data.

However, there is an additional complication, which gets to the root of the difference between VARs and the linear simultaneous equation models of the 1970s that they replaced. Since only lags appear as regressors in a VAR, any contemporaneous correlations between variables is captured in the residual covariance matrix. But IRFs become invalid when errors are correlated since the former explicitly assumes that only one variable is being shocked at t=0. Therefore, in addition to the reduced-form coefficient matrix being invertible, we require a means by which the residuals from the VAR can be rendered orthogonal (uncorrelated) with respect to one another. This is how we get from a reduced-form to a structural VAR whose IRFs are valid. Unfortunately, the orthogonality requirement alone is not sufficient to produce a unique structural representation. In other words, there exists an infinity of different structural representations of the VAR that (a) are consistent with the estimated reduced form parameters of the VAR, and (b) possess orthogonal shocks. One popular example is the Cholesky factorization of the residual covariance matrix.

To summarize, the IRFs of VAR are much easier to interpret than the parameters of the reduced-form VAR. But calculating valid IRFs requires (a) stationary ( I(0) ) endogenous variables (or cointegration of the variables) and (b) some form of factorization of the residual covariance matrix that renders the resulting shocks orthogonal with respect to one another. There are other more subtle requirements such as linearity of the underlying data generating process, but these are the 2 big ones.

Testing the stability of the model; Use Eigen value, which should be less than one

roots(EST, modulus=TRUE)

[1] 0.4299625 0.4299625

The model is stable since the roots are inside the unit circle, that is, the roots are less than one

GRANGER CAUSALITY TEST

GRA<-causality(EST, cause = "Inflation")
GRA

$Granger

    Granger causality H0: Inflation do not Granger-cause unemployment

data:  VAR object EST
F-Test = 69.304, df1 = 1, df2 = 394, p-value = 0.000000000000001332


$Instant

    H0: No instantaneous causality between: Inflation and unemployment

data:  VAR object EST
Chi-squared = 0.38572, df = 1, p-value = 0.5346

GRA1<-causality(EST, cause = "unemployment")
GRA1

$Granger

    Granger causality H0: unemployment do not Granger-cause Inflation

data:  VAR object EST
F-Test = 1.372, df1 = 1, df2 = 394, p-value = 0.2422


$Instant

    H0: No instantaneous causality between: unemployment and Inflation

data:  VAR object EST
Chi-squared = 0.38572, df = 1, p-value = 0.5346

From the analysis, it was established that inflation granger cause unemployment, however, unemployment do not granger cause inflation.

Variance Decomposition and Impulse Response Function

IRF1<-irf(EST, impulse = "Inflation", response = "unemployment",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)

IRF1


Impulse response coefficients
$Inflation
          unemployment
 [1,] -0.0431533864513
 [2,]  0.5356735487953
 [3,]  0.4640598613174
 [4,]  0.2960802457858
 [5,]  0.1662984086264
 [6,]  0.0868538042711
 [7,]  0.0432056980416
 [8,]  0.0207296508563
 [9,]  0.0096622621431
[10,]  0.0043943820861
[11,]  0.0019552162997
[12,]  0.0008523272496
[13,]  0.0003642305110
[14,]  0.0001525446726
[15,]  0.0000625448388
[16,]  0.0000250512251
[17,]  0.0000097665459
[18,]  0.0000036842516
[19,]  0.0000013313196
[20,]  0.0000004524105
[21,]  0.0000001390725


Lower Band, CI= 0.95 
$Inflation
         unemployment
 [1,] -0.169171500436
 [2,]  0.376654126141
 [3,]  0.279313868516
 [4,]  0.145139507875
 [5,]  0.063398901744
 [6,]  0.021161699257
 [7,] -0.005481819418
 [8,] -0.013567116845
 [9,] -0.011108438313
[10,] -0.007477150215
[11,] -0.004163985660
[12,] -0.002051689657
[13,] -0.001004384246
[14,] -0.000461231348
[15,] -0.000180387432
[16,] -0.000080200060
[17,] -0.000015417320
[18,] -0.000006248758
[19,] -0.000015273511
[20,] -0.000011951222
[21,] -0.000008943788


Upper Band, CI= 0.95 
$Inflation
      unemployment
 [1,] 0.0850139786
 [2,] 0.6681237588
 [3,] 0.5941520957
 [4,] 0.4280824180
 [5,] 0.2855338709
 [6,] 0.1825943775
 [7,] 0.1157773305
 [8,] 0.0733199290
 [9,] 0.0480346920
[10,] 0.0314476673
[11,] 0.0206537647
[12,] 0.0139051638
[13,] 0.0093772944
[14,] 0.0063337296
[15,] 0.0042801623
[16,] 0.0028938776
[17,] 0.0019362835
[18,] 0.0012880042
[19,] 0.0008568245
[20,] 0.0005700235
[21,] 0.0003792451

Inflation response to unemployment shock

plot(IRF1, ylab="Inflation",
     main="Inflation response to unemployment shock")

Inflation response to the unemployment shock

Interpretation of the impulse response fucntion above.

Usually, the impulse response functions are interpreted as something like “a one standard deviation shock to x causes significant increases (decreases) in y for m periods (determined by the length of period for which the SE bands are above 0 or below 0 in case of decrease) after which the effect dissipates. The IRF curve above shows that a one standard deviation shock from unemployment causes inflation to increase from 0 to 3 months and fall there after. The effect of do not last longer. The effects lasted for 3 months.

Unemployment response to inflation shock

IRF2<-irf(EST, impulse = "unemployment", response = "Inflation",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)

plot(IRF2, ylab="unemployment",
     main="unemployment response to Inflation shock")

unemployment response to Inflation shock

Interpretation of the IRF graph above

From the graph above, a one standard deviation shock from inflation on unemployment cause unemployment to decrease upto about 2 months and starts to increase. However, the confidence bound contains a zero, which is an implication that the shock of inflation on unemployment on inflation is not statistically significant. This results confirms to the results gotten above which indicated that unemployment granger cause inflation but inflation do not granger cause unemployment.

Effect of inflation on itself

IRF3<-irf(EST, impulse = "Inflation", response = "Inflation",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)

plot(IRF3, ylab="Inflation",
     main="Inflation response to Inflation shock")

Inflation response to Inflation shock

The response of Unemployment on unemployment shock

IRF4<-irf(EST, impulse = "unemployment", response = "unemployment",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)

plot(IRF4, ylab="unemployment",
     main="unemployment response to unemployment shock")

unemployment response to unemployment shock

Variance Decomposition

The variance decomposition indicates the amount of information each variable contributes to the other variables in the autoregression. It determines how much of the forecast error variance of each of the variables can be explained by exogenous shocks to the other variables.

VARD<-fevd(EST, n.ahead=20)
VARD

$Inflation
      Inflation unemployment
 [1,] 1.0000000  0.000000000
 [2,] 0.9969451  0.003054924
 [3,] 0.9947479  0.005252059
 [4,] 0.9938673  0.006132705
 [5,] 0.9935938  0.006406153
 [6,] 0.9935202  0.006479764
 [7,] 0.9935022  0.006497786
 [8,] 0.9934981  0.006501898
 [9,] 0.9934972  0.006502784
[10,] 0.9934970  0.006502966
[11,] 0.9934970  0.006503001
[12,] 0.9934970  0.006503008
[13,] 0.9934970  0.006503009
[14,] 0.9934970  0.006503010
[15,] 0.9934970  0.006503010
[16,] 0.9934970  0.006503010
[17,] 0.9934970  0.006503010
[18,] 0.9934970  0.006503010
[19,] 0.9934970  0.006503010
[20,] 0.9934970  0.006503010

$unemployment
        Inflation unemployment
 [1,] 0.001942077    0.9980579
 [2,] 0.182061322    0.8179387
 [3,] 0.265135962    0.7348640
 [4,] 0.293687415    0.7063126
 [5,] 0.302330557    0.6976694
 [6,] 0.304683598    0.6953164
 [7,] 0.305269961    0.6947300
 [8,] 0.305405969    0.6945940
 [9,] 0.305435712    0.6945643
[10,] 0.305441899    0.6945581
[11,] 0.305443130    0.6945569
[12,] 0.305443365    0.6945566
[13,] 0.305443408    0.6945566
[14,] 0.305443415    0.6945566
[15,] 0.305443416    0.6945566
[16,] 0.305443417    0.6945566
[17,] 0.305443417    0.6945566
[18,] 0.305443417    0.6945566
[19,] 0.305443417    0.6945566
[20,] 0.305443417    0.6945566

plot(VARD)

Variance decomposition function for inflation and unemployment

VECTOR AUTOREGRESSIVE MODEL (Inflation, Unemployment and Fed Rate)

Import the data set

my_var<-read.csv("C:\\Users\\user\\Downloads\\my_var.csv")
attach(my_var)

The following objects are masked _by_ .GlobalEnv:

    Inflation, Unemployment

The following objects are masked from unemploymnt_data (pos = 7):

    Inflation, Unemployment

The following objects are masked from unemploymnt_data (pos = 8):

    Inflation, Unemployment

View the data set

head(my_var,15)

Declare the data time series and plot

Inflation<-ts(my_var$Inflation,start=c(1960),frequency = 4)
plot.ts(Inflation,type="l",main="Time Series plot Inflation",xlab="Year",ylab="inflation")

Time series plot of inflation

Unemployment<-ts(my_var$Unemployment,start=c(1960),frequency = 4)
plot.ts(Unemployment,type="l",main="Time Series plot Unemployment",xlab="Year",ylab="Unemployment")

Time series plot of unemployment

Fed.Rate<-ts(my_var$Fed.Rate,start=c(1960),frequency = 4)
plot.ts(Fed.Rate,type="l",main="Time Series plot of FedRate",xlab="Year",ylab="FedRate")

Time series plot of FedRate

Unit Root Test (Inflation)

URT<-adf.test(my_var$Inflation)
URT


    Augmented Dickey-Fuller Test

data:  my_var$Inflation
Dickey-Fuller = -2.2805, Lag order = 5, p-value = 0.4593
alternative hypothesis: stationary

Unit Root Test (Unemployment)

URT1<-adf.test(my_var$Unemployment)
URT1


    Augmented Dickey-Fuller Test

data:  my_var$Unemployment
Dickey-Fuller = -2.0581, Lag order = 5, p-value = 0.5521
alternative hypothesis: stationary

Unit Root Test (FedRate)

URT2<-adf.test(my_var$Fed.Rate)
URT2


    Augmented Dickey-Fuller Test

data:  my_var$Fed.Rate
Dickey-Fuller = -3.3068, Lag order = 5, p-value = 0.07246
alternative hypothesis: stationary

Lag Length determination

lag<-VARselect(my_var, lag.max=4)
lag

$selection
AIC(n)  HQ(n)  SC(n) FPE(n) 
     3      3      2      3 

$criteria
                1           2           3           4
AIC(n) -2.6746095 -3.15420316 -3.28434359 -3.25963208
HQ(n)  -2.5809554 -2.99030845 -3.05020830 -2.95525620
SC(n)  -2.4439714 -2.75058659 -2.70774850 -2.51005846
FPE(n)  0.0689359  0.04267955  0.03748351  0.03844391

VAR Model Estimation

VAR_mdl<-VAR(my_var, p=3, type = "none")
VAR_mdl


VAR Estimation Results:
======================= 

Estimated coefficients for equation Unemployment: 
================================================= 
Call:
Unemployment = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3 

Unemployment.l1    Inflation.l1     Fed.Rate.l1 Unemployment.l2    Inflation.l2 
    1.540461079     0.036092543     0.015635976    -0.567700896    -0.026670658 
    Fed.Rate.l2 Unemployment.l3    Inflation.l3     Fed.Rate.l3 
    0.034688923    -0.011735675    -0.003682298    -0.019977419 


Estimated coefficients for equation Inflation: 
============================================== 
Call:
Inflation = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3 

Unemployment.l1    Inflation.l1     Fed.Rate.l1 Unemployment.l2    Inflation.l2 
    -1.01182762      0.68232932      0.14845919      1.56424639      0.09094890 
    Fed.Rate.l2 Unemployment.l3    Inflation.l3     Fed.Rate.l3 
    -0.06004687     -0.57197523      0.18422864     -0.05318817 


Estimated coefficients for equation Fed.Rate: 
============================================= 
Call:
Fed.Rate = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3 

Unemployment.l1    Inflation.l1     Fed.Rate.l1 Unemployment.l2    Inflation.l2 
     -1.5314206       0.0541425       0.9752619       1.3715074       0.2532385 
    Fed.Rate.l2 Unemployment.l3    Inflation.l3     Fed.Rate.l3 
     -0.3649440       0.1113407      -0.1461168       0.3350045

Use stargaze to organize the results

stargazer(VAR_mdl[["varresult"]], type='text')


====================================================================
                                        Dependent variable:         
                               -------------------------------------
                                                 y                  
                                    (1)         (2)         (3)     
--------------------------------------------------------------------
Unemployment.l1                  1.540***     -1.012**   -1.531***  
                                  (0.088)     (0.389)     (0.343)   
                                                                    
Inflation.l1                      0.036**     0.682***     0.054    
                                  (0.018)     (0.079)     (0.070)   
                                                                    
Fed.Rate.l1                        0.016       0.148      0.975***  
                                  (0.022)     (0.096)     (0.084)   
                                                                    
Unemployment.l2                  -0.568***    1.564**     1.372**   
                                  (0.147)     (0.646)     (0.570)   
                                                                    
Inflation.l2                      -0.027       0.091      0.253***  
                                  (0.022)     (0.096)     (0.085)   
                                                                    
Fed.Rate.l2                        0.035       -0.060    -0.365***  
                                  (0.030)     (0.133)     (0.117)   
                                                                    
Unemployment.l3                   -0.012       -0.572      0.111    
                                  (0.086)     (0.376)     (0.332)   
                                                                    
Inflation.l3                      -0.004      0.184**     -0.146**  
                                  (0.019)     (0.084)     (0.074)   
                                                                    
Fed.Rate.l3                       -0.020       -0.053     0.335***  
                                  (0.022)     (0.098)     (0.087)   
                                                                    
--------------------------------------------------------------------
Observations                        161         161         161     
R2                                 0.999       0.955       0.986    
Adjusted R2                        0.999       0.952       0.985    
Residual Std. Error (df = 152)     0.231       1.015       0.895    
F Statistic (df = 9; 152)      12,729.890*** 356.559*** 1,193.802***
====================================================================
Note:                                    *p<0.1; **p<0.05; ***p<0.01

Testing the stability of the model; Use Eigen value, which should be less than one

roots(VAR_mdl, modulus=TRUE)

[1] 0.99513614 0.94079541 0.76174764 0.76174764 0.59992529 0.59992529 0.38418659
[8] 0.38418659 0.08928642

The mdoel is stable since the roots are inside the unit circle, that is, the roots are less than one

#GRANGER CAUSALITY TEST

GrI<-causality(VAR_mdl, cause = "Inflation")
GrI

$Granger

    Granger causality H0: Inflation do not Granger-cause Unemployment
    Fed.Rate

data:  VAR object VAR_mdl
F-Test = 6.0224, df1 = 6, df2 = 456, p-value = 0.000004455


$Instant

    H0: No instantaneous causality between: Inflation and Unemployment
    Fed.Rate

data:  VAR object VAR_mdl
Chi-squared = 4.219, df = 2, p-value = 0.1213

GrU<-causality(VAR_mdl, cause = "Unemployment")
GrU

$Granger

    Granger causality H0: Unemployment do not Granger-cause Inflation
    Fed.Rate

data:  VAR object VAR_mdl
F-Test = 5.6633, df1 = 6, df2 = 456, p-value = 0.00001088


$Instant

    H0: No instantaneous causality between: Unemployment and Inflation
    Fed.Rate

data:  VAR object VAR_mdl
Chi-squared = 23.73, df = 2, p-value = 0.000007031

GrF<-causality(VAR_mdl, cause = "Fed.Rate")
GrF

$Granger

    Granger causality H0: Fed.Rate do not Granger-cause Unemployment
    Inflation

data:  VAR object VAR_mdl
F-Test = 3.2328, df1 = 6, df2 = 456, p-value = 0.004033


$Instant

    H0: No instantaneous causality between: Fed.Rate and Unemployment
    Inflation

data:  VAR object VAR_mdl
Chi-squared = 26.054, df = 2, p-value = 0.0000022

From the analysis, it was established that both inflation, unemployment and fed rate granger cause each other.

Variance Decomposition and Impulse Response Function

IRF_1<-irf(VAR_mdl, impulse = "Inflation", response = "Unemployment",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)

plot(IRF_1, ylab="Inflation",
     main="Inflation response to Unemployment shock")

Inflation response to unemployment shock

Variance Decomposition (Cholesky Decomposition)

VARD_1<-fevd(VAR_mdl, n.ahead=5)
VARD_1

$Unemployment
     Unemployment   Inflation     Fed.Rate
[1,]    1.0000000 0.000000000 0.0000000000
[2,]    0.9906356 0.008466579 0.0008978494
[3,]    0.9689308 0.017920429 0.0131487360
[4,]    0.9348733 0.028723840 0.0364029017
[5,]    0.8903648 0.046527944 0.0631072218

$Inflation
     Unemployment Inflation    Fed.Rate
[1,]  0.001789413 0.9982106 0.000000000
[2,]  0.062362243 0.9290021 0.008635625
[3,]  0.092268984 0.8917762 0.015954800
[4,]  0.100141867 0.8864991 0.013359037
[5,]  0.105821679 0.8826761 0.011502228

$Fed.Rate
     Unemployment  Inflation  Fed.Rate
[1,]    0.1722401 0.02083114 0.8069287
[2,]    0.3326729 0.02515215 0.6421749
[3,]    0.4415973 0.06171751 0.4966851
[4,]    0.5042660 0.07830998 0.4174240
[5,]    0.5349494 0.08422492 0.3808256

Plot the variance decomposition

plot(VARD_1)

Variance decomposition function

GARCH MODEL

Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH) is a statistical model used in analyzing time-series data where the variance error is believed to be serially auto-correlated. GARCH models assume that the variance of the error term follows an auto-regressive moving average process.

Understanding Generalized AutoRegressive Conditional Heteroskedasticity (GARCH)

Although GARCH models can be used in the analysis of a number of different types of financial data, such as macroeconomic data, financial institutions typically use them to estimate the volatility of returns for stocks, bonds, and market indices. They use the resulting information to help determine pricing and judge which assets will potentially provide higher returns, as well as to forecast the returns of current investments to help in their asset allocation, hedging, risk management, and portfolio optimization decisions.

GARCH models are used when the variance of the error term is not constant. That is, the error term is heteroskedastic. Heteroskedasticity describes the irregular pattern of variation of an error term, or variable, in a statistical model. Essentially, wherever there is heteroskedasticity, observations do not conform to a linear pattern. Instead, they tend to cluster. Therefore, if statistical models that assume constant variance are used on this data, then the conclusions and predictive value one can draw from the model will not be reliable.

The variance of the error term in GARCH models is assumed to vary systematically, conditional on the average size of the error terms in previous periods. In other words, it has conditional heteroskedasticity, and the reason for the heteroskedasticity is that the error term is following an auto-regressive moving average pattern. This means that it is a function of an average of its own past values.

History of GARCH

GARCH was developed in 1986 by Dr. Tim Bollerslev, a doctoral student at the time, as a way to address the problem of forecasting volatility in asset prices. It built on economist Robert Engle’s breakthrough 1982 work in introducing the Autoregressive Conditional Heteroskedasticity (ARCH) model. His model assumed the variation of financial returns was not constant over time but are autocorrelated, or conditional to/dependent on each other. For instance, one can see this in stock returns where periods of volatility in returns tend to be clustered together. 1

Since the original introduction, many variations of GARCH have emerged. These include Nonlinear (NGARCH), which addresses correlation and observed “volatility clustering” of returns, and Integrated GARCH (IGARCH), which restricts the volatility parameter. All the GARCH model variations seek to incorporate the direction, positive or negative, of returns in addition to the magnitude (addressed in the original model).

Each derivation of GARCH can be used to accommodate the specific qualities of the stock, industry, or economic data. When assessing risk, financial institutions incorporate GARCH models into their Value-at-Risk (VAR), maximum expected loss (whether for a single investment or trading position, portfolio, or at a division or firm-wide level) over a specified time period. GARCH models are viewed to provide better gauges of risk than can be obtained through tracking standard deviation alone.

Various studies have been conducted on the reliability of various GARCH models during different market conditions, including during the periods leading up to and after the Great Recession.

Practical example using APPLE stock prices

We are going to download the data for the apple stock price from the internet. We shall therefore use the command below.

apple<-getSymbols("AAPL",
           from="2007-01-01",
           to="2022-12-31")

plot(AAPL$AAPL.Close["2020"])

The plot of Returns for 2020

View the first few observations of the data set

head(AAPL,10)

           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2007-01-03  3.081786  3.092143 2.925000   2.992857  1238319600      2.551165
2007-01-04  3.001786  3.069643 2.993571   3.059286   847260400      2.607790
2007-01-05  3.063214  3.078571 3.014286   3.037500   834741600      2.589220
2007-01-08  3.070000  3.090357 3.045714   3.052500   797106800      2.602005
2007-01-09  3.087500  3.320714 3.041071   3.306071  3349298400      2.818155
2007-01-10  3.383929  3.492857 3.337500   3.464286  2952880000      2.953020
2007-01-11  3.426429  3.456429 3.396429   3.421429  1440252800      2.916488
2007-01-12  3.378214  3.395000 3.329643   3.379286  1312690400      2.880563
2007-01-16  3.417143  3.473214 3.408929   3.467857  1244076400      2.956063
2007-01-17  3.484286  3.485714 3.386429   3.391071  1646260000      2.890609

View the last few observations of the data set

tail(AAPL,10)

           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2022-11-18    152.31    152.70   149.97     151.29    74794600        151.29
2022-11-21    150.16    150.37   147.72     148.01    58724100        148.01
2022-11-22    148.13    150.42   146.93     150.18    51804100        150.18
2022-11-23    149.45    151.83   149.34     151.07    58301400        151.07
2022-11-25    148.31    148.88   147.12     148.11    35195900        148.11
2022-11-28    145.14    146.64   143.38     144.22    69246000        144.22
2022-11-29    144.29    144.81   140.35     141.17    83763800        141.17
2022-11-30    141.40    148.72   140.55     148.03   111224400        148.03
2022-12-01    148.21    149.13   146.61     148.31    71250400        148.31
2022-12-02    145.96    148.00   145.65     147.81    65421400        147.81

Plot the general series

chartSeries(AAPL)

Chartseries for returns

Plot the series/chart for a specific year (2020)

chartSeries(AAPL["2020-12"])

Chart series showing returns for 2020/12

Note that the yellow candle sticks indicates that AAPL experience a lower closing stock than the opening stock price. On the other hand, the green candle stick shows that the company experienced a higher closing stock prices than opening stock prices.

chartSeries(AAPL["2008-12"])

Chart series showing returns for 2008/12

Daily Returns

NOTE: It is difficult to work on this data the way it is, so what we always is to convert the series to daily returns as shown below.

Returns<-CalculateReturns(AAPL$AAPL.Close)

View the returns

head(Returns,10)

             AAPL.Close
2007-01-03           NA
2007-01-04  0.022195848
2007-01-05 -0.007121269
2007-01-08  0.004938272
2007-01-09  0.083069943
2007-01-10  0.047855899
2007-01-11 -0.012371092
2007-01-12 -0.012317368
2007-01-16  0.026209975
2007-01-17 -0.022142205

Removing the missing value from the returns

Returns<-Returns[-1]

View the new returns

head(Returns,10)

             AAPL.Close
2007-01-04  0.022195848
2007-01-05 -0.007121269
2007-01-08  0.004938272
2007-01-09  0.083069943
2007-01-10  0.047855899
2007-01-11 -0.012371092
2007-01-12 -0.012317368
2007-01-16  0.026209975
2007-01-17 -0.022142205
2007-01-18 -0.061927338

Plot the histograms of our returns

hist(Returns, breaks = 15)

Histogram for the returns

Histogram superimposed with a normal curve

chart.Histogram(Returns,
                methods = c('add.density','add.normal'),
                colorset = c('blue','green','red'))

Histogram showing the distribution of returns superimposed with the normal curve

Seasonality in the series

chartSeries(Returns,theme = 'white')

Volatility chart series

The graph above shows that there is volatility in the series, however, the chart shows that there is no seasonality in the series. High volatility is seen in the year 2008, specifically after the stock market crash in the united states due to a fall in the housing pricing. Similarly, the graph shows some periods where we had a higher variability in the returns.

Annualized Volatility.

We can capture volatility in the market with the help of standard deviation. First, we calculate the standard deviation of our series as shown below

sd(Returns)

[1] 0.02038514

The return’s standard deviation is approximately 2.03%. Now suppose we assume that we have 252 trading days in a year. Therefore, this standard deviation will be multiplied by the square root of 252.

sqrt(252)*sd(Returns)

[1] 0.3236041

Generally speaking, the average volatility in returns is approximately 32.19% per year.

Analyzed volatility per year, say 2008

sqrt(252)*sd(Returns["2008"])

[1] 0.5820481

Volatility in returns for the Apple company in the year 2008 was approximately 58.20%. Volatility measures how fast returns from an investment changes due to some factors, such as economic shocks,and or political shocks. Compare the volatility across the years, 2008, 2015 and 2020. In which year did the AAPL experienced a higher volatility in returns.

sqrt(252)*sd(Returns["2015"])

[1] 0.2673398

Volatility in returns for the Apple company in the year 2008 was approximately 26.73%.

sqrt(252)*sd(Returns["2020"])

[1] 0.4665857

Volatility in returns for the Apple company in the year 2008 was approximately 46.66%.

We can as well make a plot of volatility of returns from the year 2007 to 2022 in a nice looking graph, given monthwise where we have 22 monthly trading days and 252 annually trading days.

Monthly

chart.RollingPerformance(R=Returns["2007::2022"],
                         width = 22,
                         FUN = "sd.annualized",
                         scale=252,
                         main = "Apple's monthly rolling volatility")

Apple’s monthly rolling volatility

Weekly

chart.RollingPerformance(R=Returns["2007::2022"],
                         width = 6,
                         FUN = "sd.annualized",
                         scale=252,
                         main = "Apple's monthly rolling volatility")

Apple’s weekly rolling volatility

Annually

chart.RollingPerformance(R=Returns["2007::2022"],
                         width = 252,
                         FUN = "sd.annualized",
                         scale=252,
                         main = "Apple's monthly rolling volatility")

Apple’s Annual rolling volatility

GARCH MODEL: Standard Model with interpretation

1. GARCH model with a constant mean

Before we create the model, we create the specification for the model and so store the specification, and how do we do that;

s<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="sGARCH"),
              distribution.model = 'norm')

m<- ugarchfit(data=Returns,spec = s)
m


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : norm 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001951    0.000250   7.7928        0
omega   0.000014    0.000002   8.1685        0
alpha1  0.108084    0.004180  25.8595        0
beta1   0.857895    0.008443 101.6101        0

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001951    0.000352   5.5489 0.000000
omega   0.000014    0.000005   2.5905 0.009585
alpha1  0.108084    0.023389   4.6212 0.000004
beta1   0.857895    0.014386  59.6333 0.000000

LogLikelihood : 10385.77 

Information Criteria
------------------------------------
                    
Akaike       -5.1805
Bayes        -5.1742
Shibata      -5.1805
Hannan-Quinn -5.1783

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.8287  0.3626
Lag[2*(p+q)+(p+q)-1][2]    1.3710  0.3921
Lag[4*(p+q)+(p+q)-1][5]    4.4402  0.2040
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.1499  0.6986
Lag[2*(p+q)+(p+q)-1][5]    1.4674  0.7479
Lag[4*(p+q)+(p+q)-1][9]    2.6763  0.8110
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.7406 0.500 2.000  0.3895
ARCH Lag[5]    2.4204 1.440 1.667  0.3856
ARCH Lag[7]    2.5880 2.315 1.543  0.5945

Nyblom stability test
------------------------------------
Joint Statistic:  14.7015
Individual Statistics:             
mu     0.1656
omega  2.9902
alpha1 0.3782
beta1  0.4481

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.07 1.24 1.6
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value    prob sig
Sign Bias           0.8319 0.40553    
Negative Sign Bias  1.4287 0.15316    
Positive Sign Bias  0.7851 0.43247    
Joint Effect        9.7816 0.02052  **


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic              p-value(g-1)
1    20     139.0 0.00000000000000000002877
2    30     153.7 0.00000000000000000062105
3    40     177.0 0.00000000000000000017618
4    50     173.4 0.00000000000000082932719


Elapsed time : 0.529916

Model interpretation

From the results above, the standard GARCH model is sGARCH(1,1). Remember we are using a constant mean model ASRIMA(0,0,0) with a normal distribution of error. The model has three parameters, mu, omega, alpha1 and beta1. From the results above, all the parameter with their estimates shows that the parameters are statistically significant.

We can write the equation in Rmarkdown using the command below; \[ Returns_{t}= \mu + e_{t} \] The main equation now becomes the following. \[ Returns_{t}= 0.001968 + e_{t} \] Remember the mean is constant. On the other hand, the equation for sigma squares variable at time t is given as follows; \[ \sigma_{t}^2 = w + \alpha*e_{t-1}^2 + \beta \sigma_{t-1}^2 \]

If we replace the value from the output above into our equation, the new equation will be as shown below.

\[ \sigma_{t}^2 = 0.000014 + 0.108468e_{t-1}^2 + 0.855942 \sigma_{t-1}^2 \]

The model above is our variance GARCH model. Remember if we do not have the beta term in the model, the entire model reduces to ARCH(1) model, which is basically as shown below; \[ \sigma_{t}^2 = w + \alpha*e_{t-1}^2 \]

Model Output

Information criteria

This model also provides a lot of other useful analysis and information that we can use. One of the most useful information we can use is the information criteria, which include Akaike information criteria, Bayes information criteria, Hannan Quinn information criteria and Shibata information criteria. The information criteria helps to the best model without necessarily choosing a very complex model. The simple always has a lower information criteria. The model that has a lower information criteria will be useful in calculating returns. Therefore, we shall choose the model with lowest information value.

Ljung Box Test

Another important information we have is the Ljung-Box test on standardized residuals. The null hypothesis in this case is that there is no serial correlation. From the output above, there is no sufficient evidence to reject the null hypothesis. As a result, we can conclude that there is no serial correlation between the standardized residuals. Besides, there is no serial correlation in the standardized squared residuals, and they more or less behave like the white noise process. This also justifies that our GARCH model is valid.

Adjusted Pearson Goodness of Fit test

One thing we can notice is that the probability values are all less than 0.05 indicating that the model for the residuals that we used is not really a good choice. If the distribution was a good fit, then the p-value for the goodness fit test would be greater than 0.05. Thus, we shall assume that there is a scope to improve our model.

The Twelve (12) Plots

s<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="sGARCH"),
              distribution.model = 'norm')

m<- ugarchfit(data=Returns,spec = s)
m


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : norm 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001951    0.000250   7.7928        0
omega   0.000014    0.000002   8.1685        0
alpha1  0.108084    0.004180  25.8595        0
beta1   0.857895    0.008443 101.6101        0

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001951    0.000352   5.5489 0.000000
omega   0.000014    0.000005   2.5905 0.009585
alpha1  0.108084    0.023389   4.6212 0.000004
beta1   0.857895    0.014386  59.6333 0.000000

LogLikelihood : 10385.77 

Information Criteria
------------------------------------
                    
Akaike       -5.1805
Bayes        -5.1742
Shibata      -5.1805
Hannan-Quinn -5.1783

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.8287  0.3626
Lag[2*(p+q)+(p+q)-1][2]    1.3710  0.3921
Lag[4*(p+q)+(p+q)-1][5]    4.4402  0.2040
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.1499  0.6986
Lag[2*(p+q)+(p+q)-1][5]    1.4674  0.7479
Lag[4*(p+q)+(p+q)-1][9]    2.6763  0.8110
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.7406 0.500 2.000  0.3895
ARCH Lag[5]    2.4204 1.440 1.667  0.3856
ARCH Lag[7]    2.5880 2.315 1.543  0.5945

Nyblom stability test
------------------------------------
Joint Statistic:  14.7015
Individual Statistics:             
mu     0.1656
omega  2.9902
alpha1 0.3782
beta1  0.4481

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.07 1.24 1.6
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value    prob sig
Sign Bias           0.8319 0.40553    
Negative Sign Bias  1.4287 0.15316    
Positive Sign Bias  0.7851 0.43247    
Joint Effect        9.7816 0.02052  **


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic              p-value(g-1)
1    20     139.0 0.00000000000000000002877
2    30     153.7 0.00000000000000000062105
3    40     177.0 0.00000000000000000017618
4    50     173.4 0.00000000000000082932719


Elapsed time : 0.5013511

plot(m,which ='all')


please wait...calculating quantiles...

The 12 plots for the distribution of residuals

Note: The command above give 12 different plots.

The first plot shows the series with 2 conditional standard deviation superimposed. Remember you have this data for returns and the red lines you are seeing on the plot are at 2 sd. Basically, the two lines represents 95% of the data on returns. The second plot shows the series with a 1% value risk limit, therefore the red line you are seeing are at 1%. The third plot is for conditional standard deviation versus returns. Remember returns are absolute returns. The blue line you are seeing on the third plot is the standard deviation. The forth plot shows the autocorrelation for the observations. In the fifth plot there is a significant autocorrelation as shown by all the bars being above the red line. There is also a similar situation in the sixth plot. The seventh plot shows the autocorrelation for the squared versus the actual observations. The eighth plot shows histogram for the residuals. The tails of the histograms deviates from the normal distribution. This can be well established by looking at the normal Q-Q plot. In other words, there is a significant deviation of the standard residuals from the normal distribution for both low and high values. The ninth plot shows the autocorrelation of the standardized residuals where we are able to see that the residuals improved significantly and for squared standardized residuals, all the values are within the red lines

FORECAST

We shall therefore use this model to make our forecast and we will store our forecast in the object F.

f<-ugarchforecast(fitORspec = m,
                  n.ahead = 20)

Plot the forecast

plot(fitted(f))

Fitted plot

Remember is model was estimated on a constant mean model. In other words, our prediction will be constant. We can also plot the variability using the command below.

plot(sigma(f))

Variability plot

The graph shows that volatility in returns is going to increase for the next 20 days predicted.

VARIOUS VARIANTS OF GARCH MODEL

Application example of portfolio allocation.

Portfolio allocation example will help and guide what percentage of portfolio should go to the riskier assets such as stocks and what percentage should go to free riskier assets. This will be followed by four different variants. * Model-2: GARCH with sstd * Model-3: GJR-GARCH * Model-4: AR(1) GJG-GARCH * Model-5: GJG-GARCH in mean

Application example; Portfolio Allocation

Let us now calculate the analyzed volatility using the code below and store the results in V; remember we shall assume that there are 252 trading days.

V<- sqrt(252) * sigma(m)

We shall also create an object W which represent the weight assigned to the risky assets. In this case, we shall assign a target of 5%, that is, 0.05.

W<-0.05/V

Merge and two and plot

plot(merge(V,W),
     multi.panel=T)

Weight assigned to risky assets

The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end.

tail(W,20)

                 [,1]
2022-11-04 0.09982105
2022-11-07 0.10669273
2022-11-08 0.11409389
2022-11-09 0.12183158
2022-11-10 0.11730929
2022-11-11 0.08268621
2022-11-14 0.08765183
2022-11-15 0.09345250
2022-11-16 0.09964281
2022-11-17 0.10601643
2022-11-18 0.11246150
2022-11-21 0.12014075
2022-11-22 0.12222863
2022-11-23 0.12846092
2022-11-25 0.13662344
2022-11-28 0.13810258
2022-11-29 0.13475537
2022-11-30 0.13549282
2022-12-01 0.11797602
2022-12-02 0.12593584

Let us assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 18.3%. In this case the model suggests that the company can allocate the following amount to the risky assets. \[ Amount_{riskyAssets}= 2000000*0.1831701 = 366340.2 \]

The calculations above shows that the company should have allocated $366340.2 on the risky assets as suggested by the model and allocate the following amount on the free risky assets; Some of the suggested free risky assets include but are limited to bank accounts

\[ Amount_{freeRisk}=2000000-366340.2= 1633659.8 \]

Suppose we increase the assigned to the risky assets to 10%, that is, 0.1. Consider the graph below

W_1<-0.1/V

Merge and two and plot

plot(merge(V,W_1),
     multi.panel=T)

Weight assigned to risky assets

tail(W_1,20)

                [,1]
2022-11-04 0.1996421
2022-11-07 0.2133855
2022-11-08 0.2281878
2022-11-09 0.2436632
2022-11-10 0.2346186
2022-11-11 0.1653724
2022-11-14 0.1753037
2022-11-15 0.1869050
2022-11-16 0.1992856
2022-11-17 0.2120329
2022-11-18 0.2249230
2022-11-21 0.2402815
2022-11-22 0.2444573
2022-11-23 0.2569218
2022-11-25 0.2732469
2022-11-28 0.2762052
2022-11-29 0.2695107
2022-11-30 0.2709856
2022-12-01 0.2359520
2022-12-02 0.2518717

Let us assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 36.6%. In this case the model suggests that the company can allocate the following amount to the risky assets.

2000000*0.3663402

[1] 732680.4

According to this model, the company should therefore allocate $732680.4 to the risky assets and the remaining amount to be allocated to the free risky assets. This allocation is only possible if the compamy accepts set 10% as the target for the assigned weight to the risky assets.

Model-2: GARCH Model with skewed student’s t-distribution

In the previous model we saw that the normal distribution is not a good distribution for the residuals. Now let us make use of skewed student’s t-distribution to capture this distribution more accurately. We shall edit the model to skewed student’s t-distribution

S<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="sGARCH"),
              distribution.model = 'sstd')

M<- ugarchfit(data=Returns,spec = S)
M


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001646    0.000255   6.4610 0.000000
omega   0.000008    0.000003   2.3373 0.019424
alpha1  0.098038    0.012390   7.9128 0.000000
beta1   0.889328    0.013202  67.3651 0.000000
skew    1.010442    0.021780  46.3936 0.000000
shape   5.039717    0.450983  11.1750 0.000000

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001646    0.000257   6.3964   0.0000
omega   0.000008    0.000008   1.0067   0.3141
alpha1  0.098038    0.018950   5.1735   0.0000
beta1   0.889328    0.019562  45.4616   0.0000
skew    1.010442    0.020704  48.8036   0.0000
shape   5.039717    0.708013   7.1181   0.0000

LogLikelihood : 10543.89 

Information Criteria
------------------------------------
                    
Akaike       -5.2584
Bayes        -5.2490
Shibata      -5.2584
Hannan-Quinn -5.2551

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.8252  0.3637
Lag[2*(p+q)+(p+q)-1][2]    1.3217  0.4047
Lag[4*(p+q)+(p+q)-1][5]    4.5496  0.1930
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                   0.009601  0.9219
Lag[2*(p+q)+(p+q)-1][5]  1.122584  0.8314
Lag[4*(p+q)+(p+q)-1][9]  2.192556  0.8808
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.6271 0.500 2.000  0.4284
ARCH Lag[5]    2.0666 1.440 1.667  0.4567
ARCH Lag[7]    2.3500 2.315 1.543  0.6435

Nyblom stability test
------------------------------------
Joint Statistic:  4.1734
Individual Statistics:              
mu     0.19445
omega  0.86119
alpha1 0.77193
beta1  0.90835
skew   0.04426
shape  1.06900

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.49 1.68 2.12
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value    prob sig
Sign Bias           1.0081 0.31348    
Negative Sign Bias  1.1927 0.23306    
Positive Sign Bias  0.8992 0.36859    
Joint Effect       10.3009 0.01617  **


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     28.51      0.07415
2    30     44.57      0.03234
3    40     56.77      0.03277
4    50     57.69      0.18472


Elapsed time : 1.52689

Having created the skewed student’s t-distribution, let us now have the 12 plots and see the distribution of the residuals.

plot(M,which ='all')


please wait...calculating quantiles...

The 12 plots for residual distribution

The plots above shows that our models improved compared to the previous model as indicated by the distribution by of the residuals. From the plots above, the sstd Q-Q plot shows that residuals relatively aligns to the diagonal line. Besides, the information criteria also improved. Additionally, the skew statistics shows that the distribution is symmetry as indicated by the skewed estimated of 1.007, which is quite close to 1.

V_1<- sqrt(252) * sigma(M)

We shall also create an object W which represent the weight assigned to the risky assets. In this case, we shall assign a target of 5%, that is, 0.05.

W_2<-0.05/V

Merge and two and plot

plot(merge(V_1,W_2),
     multi.panel=T)

Weight assigned to risky and non-risky assets

tail(W_2,20)

                 [,1]
2022-11-04 0.09982105
2022-11-07 0.10669273
2022-11-08 0.11409389
2022-11-09 0.12183158
2022-11-10 0.11730929
2022-11-11 0.08268621
2022-11-14 0.08765183
2022-11-15 0.09345250
2022-11-16 0.09964281
2022-11-17 0.10601643
2022-11-18 0.11246150
2022-11-21 0.12014075
2022-11-22 0.12222863
2022-11-23 0.12846092
2022-11-25 0.13662344
2022-11-28 0.13810258
2022-11-29 0.13475537
2022-11-30 0.13549282
2022-12-01 0.11797602
2022-12-02 0.12593584

Similarly, if we assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 18.3%. In this case the model suggests that the company can allocate the following amount to the risky assets and the remaining to be allocated to the free risky assets.

2000000*0.1831701

[1] 366340.2

Model-3: GJR_GARCH

We are going to copy the same model we used above and modify so as to estimate the GJR GARCH model. The GJR GARCH model was developed by Glosen Jagannathan-Runkle. Consider the results below.

GJR<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="gjrGARCH"),
              distribution.model = 'sstd')

M_1<- ugarchfit(data=Returns,spec = GJR)
M_1


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001439    0.000245   5.8735        0
omega   0.000010    0.000001   8.0910        0
alpha1  0.037620    0.001338  28.1231        0
beta1   0.872453    0.009578  91.0907        0
gamma1  0.141980    0.020040   7.0849        0
skew    1.005834    0.021604  46.5580        0
shape   5.363441    0.387189  13.8522        0

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001439    0.000268   5.3614 0.000000
omega   0.000010    0.000002   5.4874 0.000000
alpha1  0.037620    0.010679   3.5229 0.000427
beta1   0.872453    0.010833  80.5341 0.000000
gamma1  0.141980    0.025049   5.6680 0.000000
skew    1.005834    0.020659  48.6881 0.000000
shape   5.363441    0.491852  10.9046 0.000000

LogLikelihood : 10573.09 

Information Criteria
------------------------------------
                    
Akaike       -5.2725
Bayes        -5.2615
Shibata      -5.2725
Hannan-Quinn -5.2686

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                      1.770  0.1833
Lag[2*(p+q)+(p+q)-1][2]     2.193  0.2326
Lag[4*(p+q)+(p+q)-1][5]     5.259  0.1337
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.2066  0.6494
Lag[2*(p+q)+(p+q)-1][5]    1.3939  0.7660
Lag[4*(p+q)+(p+q)-1][9]    2.3991  0.8524
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.8719 0.500 2.000  0.3504
ARCH Lag[5]    2.1876 1.440 1.667  0.4312
ARCH Lag[7]    2.4394 2.315 1.543  0.6249

Nyblom stability test
------------------------------------
Joint Statistic:  21.3464
Individual Statistics:              
mu     0.50717
omega  5.27138
alpha1 1.50521
beta1  1.49999
gamma1 1.00250
skew   0.03171
shape  1.42563

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.69 1.9 2.35
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value   prob sig
Sign Bias           1.3229 0.1860    
Negative Sign Bias  0.6039 0.5460    
Positive Sign Bias  0.1287 0.8976    
Joint Effect        2.7437 0.4329    


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     24.07       0.1936
2    30     37.15       0.1424
3    40     42.72       0.3144
4    50     47.64       0.5284


Elapsed time : 2.497092

We can notice that there is a very big change in the last plot among the 12 plots as shown below;

plot(M_1,which ='all')


please wait...calculating quantiles...

Plot number 12 changed drastically due to the gjr garch modeling. According to this model, when the news have a positive impacts on the stock price, the increase in prices are gradual. However, when the news have a negative impacts on the stock price, the impact is significantly higher. Therefore, as per this model, the impacts of news on the stock market is not symmetric. However, the impacts is higher for the negative new than the positive news.

The model parameter are all significant. This model is written as follows

\[ Returns_{t} = \mu + e_{t}^2 \]

When we plug in values, the model will be as follows: \[ Returns_{t} = 0.001448 + e_{t} \]

Remember in this model, et follows a normal distribution with the mean of zero and a variance of sigma square. However, for sigma t-square, we will have two equations for the negative and positive news as shown below; \[ \sigma_{t}^2 (e_{t-1}<0) \] The equation above is for the impacts of the negative news. The full equation is going to as follows; \[ \omega + (\alpha + \gamma)e_{t-1}^2 + \beta\sigma_{t-1}^2 \]

If we plug in the values, this is going to be as follows; \[ 0.000011 + (0.039228 + 0.142040)e_{t-1}^2 + 0.869758 \sigma_{t-1}^2 \] When the news have a positive impacts on the stock market, then et square will be as follows; \[ \sigma_{t}^2 (e_{t-1}>0) \]

However, if the news have a positive impacts on the stocm market, we shall have a standard model as we’ve seen earlier.

\[ \omega + \alpha e_{t-1}^2 + \beta\sigma_{t-1}^2 \]

When we plug in values, we will have the following model.

\[ 0.000011 + 0.039228e_{t-1}^2 + 0.869758\sigma_{t-1}^2 \]

Generally, the third model is better than the first two models. To validate this claim we shall focus on the following * skeweness statistics * Information criteria The skeweness statistics for this models is close to one as compared to the first two models. Besides, the values for the information criteria for this models are relatively lower compared to the information criteria value in the other two models. This is enough evidence to shows that this model performed better than the first two models.

Model-4: AR(1) GJR GARCH Model

We shall edit the model we used above to have the AR(1) GJR GARCH model;

AR_GJR<-ugarchspec(mean.model = list(armaOrder =c(1,0)),
              variance.model = list(model="gjrGARCH"),
              distribution.model = 'sstd')

M_2<- ugarchfit(data=Returns,spec = AR_GJR)
M_2


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model  : ARFIMA(1,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001420    0.000247  5.74515  0.00000
ar1     0.010494    0.015945  0.65813  0.51045
omega   0.000010    0.000001  8.24188  0.00000
alpha1  0.037181    0.001534 24.23671  0.00000
beta1   0.872005    0.009593 90.89994  0.00000
gamma1  0.143733    0.020281  7.08721  0.00000
skew    1.005822    0.021631 46.49968  0.00000
shape   5.382598    0.389205 13.82971  0.00000

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001420    0.000268  5.29040 0.000000
ar1     0.010494    0.014629  0.71735 0.473160
omega   0.000010    0.000002  5.60750 0.000000
alpha1  0.037181    0.010548  3.52510 0.000423
beta1   0.872005    0.010817 80.61469 0.000000
gamma1  0.143733    0.025288  5.68388 0.000000
skew    1.005822    0.020836 48.27366 0.000000
shape   5.382598    0.490608 10.97128 0.000000

LogLikelihood : 10573.3 

Information Criteria
------------------------------------
                    
Akaike       -5.2721
Bayes        -5.2595
Shibata      -5.2721
Hannan-Quinn -5.2677

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.5763  0.4478
Lag[2*(p+q)+(p+q)-1][2]    1.0005  0.7396
Lag[4*(p+q)+(p+q)-1][5]    4.0600  0.2160
d.o.f=1
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.2509  0.6164
Lag[2*(p+q)+(p+q)-1][5]    1.4228  0.7589
Lag[4*(p+q)+(p+q)-1][9]    2.4147  0.8502
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.8672 0.500 2.000  0.3517
ARCH Lag[5]    2.1471 1.440 1.667  0.4396
ARCH Lag[7]    2.4044 2.315 1.543  0.6322

Nyblom stability test
------------------------------------
Joint Statistic:  21.9003
Individual Statistics:              
mu     0.50680
ar1    0.28772
omega  5.39492
alpha1 1.51052
beta1  1.50376
gamma1 1.01348
skew   0.03135
shape  1.42984

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.89 2.11 2.59
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value   prob sig
Sign Bias           1.2808 0.2003    
Negative Sign Bias  0.6068 0.5440    
Positive Sign Bias  0.1426 0.8866    
Joint Effect        2.5962 0.4582    


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     25.76       0.1369
2    30     38.74       0.1068
3    40     45.37       0.2235
4    50     50.93       0.3975


Elapsed time : 2.785747

Create the 12 plots

plot(M_2,which='all')


please wait...calculating quantiles...

There is nothing in the plots above; however, what is readily noticeable is that the AR(1) we’ve just included in our model is not statistically significant. The implication is that adding this term to GARCH model is not in any way helpful and we should therefore consider removing from our model. In other words, we are no long going to use this model.

Model-5: GJG-GARCH in mean

Like usual, we shall edit the models we used above to have the GJR GARCH in mean model.

GJR_mean<-ugarchspec(mean.model = list(armaOrder =c(0,0),
                                       archm=T,
                                       archpow=2),
              variance.model = list(model="gjrGARCH"),
              distribution.model = 'sstd')

M_3<- ugarchfit(data=Returns,spec = GJR_mean)
M_3


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error   t value Pr(>|t|)
mu      0.001444    0.000325  4.443402 0.000009
archm  -0.049490    0.932517 -0.053071 0.957675
omega   0.000010    0.000001  8.754815 0.000000
alpha1  0.037684    0.004911  7.673515 0.000000
beta1   0.872433    0.009586 91.011271 0.000000
gamma1  0.142023    0.020514  6.923123 0.000000
skew    1.005493    0.021781 46.164192 0.000000
shape   5.363234    0.397180 13.503282 0.000000

Robust Standard Errors:
        Estimate  Std. Error   t value Pr(>|t|)
mu      0.001444    0.000379  3.813922 0.000137
archm  -0.049490    0.958897 -0.051611 0.958839
omega   0.000010    0.000002  6.413839 0.000000
alpha1  0.037684    0.009759  3.861426 0.000113
beta1   0.872433    0.010894 80.081843 0.000000
gamma1  0.142023    0.025209  5.633769 0.000000
skew    1.005493    0.020491 49.069703 0.000000
shape   5.363234    0.467440 11.473630 0.000000

LogLikelihood : 10573.09 

Information Criteria
------------------------------------
                    
Akaike       -5.2720
Bayes        -5.2594
Shibata      -5.2720
Hannan-Quinn -5.2675

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                      1.756  0.1851
Lag[2*(p+q)+(p+q)-1][2]     2.183  0.2341
Lag[4*(p+q)+(p+q)-1][5]     5.248  0.1344
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                      0.209  0.6476
Lag[2*(p+q)+(p+q)-1][5]     1.404  0.7634
Lag[4*(p+q)+(p+q)-1][9]     2.414  0.8503
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.8757 0.500 2.000  0.3494
ARCH Lag[5]    2.1980 1.440 1.667  0.4291
ARCH Lag[7]    2.4523 2.315 1.543  0.6223

Nyblom stability test
------------------------------------
Joint Statistic:  22.1719
Individual Statistics:              
mu     0.51432
archm  0.31673
omega  5.27461
alpha1 1.51079
beta1  1.50491
gamma1 1.00694
skew   0.03152
shape  1.42678

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.89 2.11 2.59
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value   prob sig
Sign Bias           1.3120 0.1896    
Negative Sign Bias  0.5963 0.5510    
Positive Sign Bias  0.1378 0.8904    
Joint Effect        2.7252 0.4360    


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     24.24       0.1873
2    30     38.00       0.1223
3    40     42.74       0.3137
4    50     50.31       0.4214


Elapsed time : 4.44684

Again the additional term in our model is not statistically significant. Thus we shall ignore this model. Therefore we are not going to use this model to make our prediction.

GARCH Model in R

Simulating stock prices

In this part we shall make use of the best model out of the five models above. We shall therefore use the model to simulate the stock prices for apple. From the five models that we ran, the best model was GJR-GARCh model. This was the third model that we estimated. We shall therefore copy the model that we used and edit the model to make our simulation.

GJR<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="gjrGARCH"),
              distribution.model = 'sstd')

M_1<- ugarchfit(data=Returns,spec = GJR)
M_1


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001439    0.000245   5.8735        0
omega   0.000010    0.000001   8.0910        0
alpha1  0.037620    0.001338  28.1231        0
beta1   0.872453    0.009578  91.0907        0
gamma1  0.141980    0.020040   7.0849        0
skew    1.005834    0.021604  46.5580        0
shape   5.363441    0.387189  13.8522        0

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001439    0.000268   5.3614 0.000000
omega   0.000010    0.000002   5.4874 0.000000
alpha1  0.037620    0.010679   3.5229 0.000427
beta1   0.872453    0.010833  80.5341 0.000000
gamma1  0.141980    0.025049   5.6680 0.000000
skew    1.005834    0.020659  48.6881 0.000000
shape   5.363441    0.491852  10.9046 0.000000

LogLikelihood : 10573.09 

Information Criteria
------------------------------------
                    
Akaike       -5.2725
Bayes        -5.2615
Shibata      -5.2725
Hannan-Quinn -5.2686

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                      1.770  0.1833
Lag[2*(p+q)+(p+q)-1][2]     2.193  0.2326
Lag[4*(p+q)+(p+q)-1][5]     5.259  0.1337
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.2066  0.6494
Lag[2*(p+q)+(p+q)-1][5]    1.3939  0.7660
Lag[4*(p+q)+(p+q)-1][9]    2.3991  0.8524
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.8719 0.500 2.000  0.3504
ARCH Lag[5]    2.1876 1.440 1.667  0.4312
ARCH Lag[7]    2.4394 2.315 1.543  0.6249

Nyblom stability test
------------------------------------
Joint Statistic:  21.3464
Individual Statistics:              
mu     0.50717
omega  5.27138
alpha1 1.50521
beta1  1.49999
gamma1 1.00250
skew   0.03171
shape  1.42563

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.69 1.9 2.35
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value   prob sig
Sign Bias           1.3229 0.1860    
Negative Sign Bias  0.6039 0.5460    
Positive Sign Bias  0.1287 0.8976    
Joint Effect        2.7437 0.4329    


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     24.07       0.1936
2    30     37.15       0.1424
3    40     42.72       0.3144
4    50     47.64       0.5284


Elapsed time : 2.763984

We shall treat this model as our final model. Run the two command lines above and run the command lines below as well.

GJR_final<-GJR

setfixed(GJR_final)<-as.list(coef(M_1))

Run the command below to obtain the coefficients of the model M_1.

coef(M_1)

           mu         omega        alpha1         beta1        gamma1 
0.00143879804 0.00001029444 0.03762037087 0.87245345504 0.14198003252 
         skew         shape 
1.00583419409 5.36344066744

Now, we need to compare a one year focus for two different years. We shall however use 2008 focus using the command below. We shall make our prediction for the next year, which has approximately 252 trading days

f2008<-ugarchforecast(data = Returns["/2008-12"],
                      fitORspec = GJR_final,
                      n.ahead = 252)

f2022<-ugarchforecast(data = Returns["/2022-12"],
                      fitORspec = GJR_final,
                      n.ahead = 252)

par(mfrow= c(2,1))

plot(sigma(f2008))

plot(sigma(f2022))

For the 2008, we had a higher volatility and it was forecasted that after 2008,in next one year, the volatility is going to fall. At the end of 2022, the volatility was low and therefore it was forecasted that the volatility was going to increase. In other words, the volatility was forecasted to increase in 2020. In the long run, the volatility was expected to go towards the average. Now, let us have the two results have simulation in SIM, showing simulation. Consider the command below;

Note: m.sim is the number of time series of the simulated returns that you want. Assume that we want 3 different simulated returns. n.sim is the duration, which in this case is in the next one year, that is, 252 trading days. We shall use random seet (rseed) for the repeatebility purposes

SIM <- ugarchpath(spec = GJR_final,
                  m.sim = 3,
                  n.sim = 1*250,
                  rseed = 123)

plot.zoo(fitted(SIM))

The graph above shows three different simulated returns based on our model. We can as well plot the the simulated variability for the 252 days in 2023.

plot.zoo(sigma(SIM))

Convert to actual Apple share price

It is possible that may be you do not want the simulated returns or simulated variability, but you need the actual Apple share prices. First, run the following command;

tail(AAPL)

           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2022-11-25    148.31    148.88   147.12     148.11    35195900        148.11
2022-11-28    145.14    146.64   143.38     144.22    69246000        144.22
2022-11-29    144.29    144.81   140.35     141.17    83763800        141.17
2022-11-30    141.40    148.72   140.55     148.03   111224400        148.03
2022-12-01    148.21    149.13   146.61     148.31    71250400        148.31
2022-12-02    145.96    148.00   145.65     147.81    65421400        147.81

We are going the use the Apple’s closing price for the last trading period. The last value for the closing stock price was 161.38 dollars. Let us store this stock price for that period using the following command. Therefore, last value for the clsong stock price will be the starting value for our simulation of real share prices. We shall use the cumulative sum to convert the daily returns to cumulative values.

P <- 161.3*apply(fitted(SIM), 2,'cumsum')+291.52

Make the plot of the cumulative share prices

matplot(P, type = "l", lwd = 3)

Stock prices prediction

The black line shows the best prediction for the Apple stock prices. From the plot the Apple stocks prices rises all through up to 2023-08-29.

PARAMETRIC TESTS AND REGRESSION ANALYSIS

INTRODUCTION

In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.

Parametric Tests

One sample t-test
Paired and unpaired t-test
Analysis of Variance (ANOVA)
Pearson correlation

One sample t-test

A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.

One sample t-test

Enter the following data

score<-c(50,65,72,77,73,85,88,80,65,56,66,78,82,90)
score

 [1] 50 65 72 77 73 85 88 80 65 56 66 78 82 90

Length and mean

length(score)

[1] 14

mean(score)

[1] 73.35714

Default t-test

t.test(score,mu=80)


    One Sample t-test

data:  score
t = -2.0988, df = 13, p-value = 0.05593
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
 66.51943 80.19486
sample estimates:
mean of x 
 73.35714

In the above output, the p-value of the test is 0.05593, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.05593.

Customized t-test

t.test(score,mu=80,alternative="two.sided",conf.level=0.95)


    One Sample t-test

data:  score
t = -2.0988, df = 13, p-value = 0.05593
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
 66.51943 80.19486
sample estimates:
mean of x 
 73.35714

In the above output, the p-value of the test is 0.05593, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.05593.

One tail to the left

t.test(score,mu=80,alternative="less",conf.level=0.95)


    One Sample t-test

data:  score
t = -2.0988, df = 13, p-value = 0.02797
alternative hypothesis: true mean is less than 80
95 percent confidence interval:
     -Inf 78.96227
sample estimates:
mean of x 
 73.35714

In the above output, the p-value of the test is 0.02797, which is less than the significance level alpha = 0.05. We can conclude that the true mean is not equal to 80 with a p-value = 0.02797.

One tailed to the right

t.test(score,mu=80,alternative="greater",conf.level=0.95)


    One Sample t-test

data:  score
t = -2.0988, df = 13, p-value = 0.972
alternative hypothesis: true mean is greater than 80
95 percent confidence interval:
 67.75202      Inf
sample estimates:
mean of x 
 73.35714

In the above output, the p-value of the test is 0.972, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.972.

TWO SAMPLE T-TEST

Paired t-test

Paired T-Test. The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.

Create the data set(By variables)

weight_before <-c(190.1, 190.9, 172.7, 213, 231.4,
        196.9, 172.2, 285.5, 225.2, 113.7)
 
weight_after <-c(392.9, 313.2, 345.1, 393, 434,
        227.9, 422, 383.9, 392.3, 352.2)

Data frame

myDatta<-data.frame(weight_before,weight_after)
head(myDatta,10)

Perform the test (two sided)

t.test(weight_before,weight_after, alternative="two.sided",paired = TRUE)


    Paired t-test

data:  weight_before and weight_after
t = -7.9059, df = 9, p-value = 0.00002433
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -214.1284 -118.8516
sample estimates:
mean difference 
        -166.49

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value approximately 0.0001.

Perform the test (one tailed to the left)

t.test(weight_before,weight_after, alternative="less",paired = TRUE)


    Paired t-test

data:  weight_before and weight_after
t = -7.9059, df = 9, p-value = 0.00001216
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
      -Inf -127.8868
sample estimates:
mean difference 
        -166.49

Perform the test (one tailed to the right)

t.test(weight_before,weight_after, alternative="greater",paired = TRUE)


    Paired t-test

data:  weight_before and weight_after
t = -7.9059, df = 9, p-value = 1
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 -205.0932       Inf
sample estimates:
mean difference 
        -166.49

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference is equal to zero with a p-value =1.

Create the data set(By group)

weight_before <-c(190.1, 190.9, 172.7, 213, 231.4,
        196.9, 172.2, 285.5, 225.2, 113.7)
 
weight_after <-c(392.9, 313.2, 345.1, 393, 434,
        227.9, 422, 383.9, 392.3, 352.2)

Data frame the data set

myData <- data.frame(
  weight = c(weight_before, weight_after),
group = rep(c("weight_before", "weight_after"), each = 10)
)

View data set

myData

Perform the test (two sided)

t.test(weight~group, alternative="two.sided",paired = TRUE, data=myData)


    Paired t-test

data:  weight by group
t = 7.9059, df = 9, p-value = 0.00002433
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 118.8516 214.1284
sample estimates:
mean difference 
         166.49

Perform the test (left sided)

t.test(weight~group, alternative="less",paired = TRUE, data=myData)


    Paired t-test

data:  weight by group
t = 7.9059, df = 9, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
     -Inf 205.0932
sample estimates:
mean difference 
         166.49

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference is equal to zero with a p-value =1.

Perform the test (left sided)

t.test(weight~group, alternative="greater",paired = TRUE, data=myData)


    Paired t-test

data:  weight by group
t = 7.9059, df = 9, p-value = 0.00001216
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 127.8868      Inf
sample estimates:
mean difference 
         166.49

Import a table containing the datas set

my_tbl2020 <- tibble::tribble(
  ~S.n, ~mathematics, ~arts,
     1,           22,    53,
     2,           37,    68,
     3,           36,    42,
     4,           38,    49,
     5,           42,    51,
     6,           58,    65,
     7,           58,    51,
     8,           60,    71,
     9,           62,    55,
    10,           65,    74,
    11,           66,    68,
    12,           56,    64,
    13,           66,    67,
    14,           67,    73,
    15,           62,    65
  )

require(knitr)

Loading required package: knitr

kable(my_tbl2020, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

S.n	mathematics	arts
1	22	53
2	37	68
3	36	42
4	38	49
5	42	51
6	58	65
7	58	51
8	60	71
9	62	55
10	65	74
11	66	68
12	56	64
13	66	67
14	67	73
15	62	65

Attach the data set

attach(my_tbl2020)

View the data set

head(my_tbl2020,10)

Perform the paired test

TTEST<-t.test(mathematics,arts, alternative = "two.sided",paired = TRUE,data=my_tbl2020, conf.level=.95)
TTEST


    Paired t-test

data:  mathematics and arts
t = -2.8805, df = 14, p-value = 0.0121
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -14.073042  -2.060291
sample estimates:
mean difference 
      -8.066667

In the above output, the p-value of the test is approximately 0.0121, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between mathematics and arts is not equal to zero with a p-value which is approximately 0.0121 at a 5% level of significance.

Additional Paired Tests

Generate the data set

before<-c(117,111,98,104,105,100,81,89,78)
after<-c(83,85,75,82,82,77,62,69,64)

Data Frame

My_framed_data<-data.frame(before,after)
My_framed_data

Perform the two sided test

MY_TTEST<-t.test(before,after, alternative = "two.sided",paired = TRUE,data=My_framed_data, conf.level = 0.99)
MY_TTEST


    Paired t-test

data:  before and after
t = 12.52, df = 8, p-value = 0.000001551
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
 16.59186 28.74147
sample estimates:
mean difference 
       22.66667

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can therefore conclude that the true mean difference between the weight before and after is not equal to zero with a p-value which is approximately 0.0001 at a 5% level of significance.

Perform the left tailed test

MY_TTEST1<-t.test(before,after, alternative = "less",paired = TRUE,data=My_framed_data)
MY_TTEST1


    Paired t-test

data:  before and after
t = 12.52, df = 8, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
     -Inf 26.03331
sample estimates:
mean difference 
       22.66667

In the above output, the p-value of the test is approximately 1, which is significantly greater than the significance level alpha = 0.05. We therefore retain the null hypothesis and conclude that the true mean difference between the weight before and after is greater than zero with a p-value which is approximately 1 at a 5% level of significance.

Perform the right tailed test

MY_TTEST2<-t.test(before,after, alternative = "greater",paired = TRUE,data=My_framed_data)
MY_TTEST2


    Paired t-test

data:  before and after
t = 12.52, df = 8, p-value = 0.0000007754
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 19.30002      Inf
sample estimates:
mean difference 
       22.66667

In the above output, the p-value of the test is approximately 0.0001, which is significantly less than the significance level alpha = 0.05. We therefore reject the null hypothesis and conclude that the true mean difference between the weight before and after is less than zero with a p-value which is approximately 0.0001 at a 5% level of significance.

UNPAIRED T-TEST

What is an unpaired t-test? An unpaired t-test (also known as an independent t-test) is a statistical procedure that compares the averages/means of two independent or unrelated groups to determine if there is a significant difference between the two.

Extract the data set and run the test (two sided)

data("gapminder")
data2021<-gapminder%>%
  dplyr::select(country, lifeExp)%>%
  filter(country == "Kenya"|
           country == "United States")

head(data2021,14)

t.test(lifeExp ~ country,alternative="two.sided",data = data2021)


    Welch Two Sample t-test

data:  lifeExp by country
t = -11.051, df = 17.966, p-value = 0.000000001917
alternative hypothesis: true difference in means between group Kenya and group United States is not equal to 0
95 percent confidence interval:
 -24.75174 -16.84326
sample estimates:
        mean in group Kenya mean in group United States 
                    52.6810                     73.4785

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States is not equal to zero with a p-value which is approximately 0.0001.

Perform the test (left sided)

data2021<-gapminder %>%
  dplyr::select(country, lifeExp)%>%
  filter(country == "Kenya"|
           country == "United States")

head(data2021,14)

t.test(lifeExp ~ country,alternative="less",data = data2021)


    Welch Two Sample t-test

data:  lifeExp by country
t = -11.051, df = 17.966, p-value = 0.0000000009583
alternative hypothesis: true difference in means between group Kenya and group United States is less than 0
95 percent confidence interval:
      -Inf -17.53385
sample estimates:
        mean in group Kenya mean in group United States 
                    52.6810                     73.4785

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States is greater or equal to zero with a p-value which is approximately 0.0001.

Perform the test (right sided)

data2021<-gapminder %>%
  dplyr::select(country, lifeExp)%>%
  filter(country == "Kenya"|
           country == "United States")

head(data2021,14)

t.test(lifeExp ~ country,alternative="greater",data = data2021)


    Welch Two Sample t-test

data:  lifeExp by country
t = -11.051, df = 17.966, p-value = 1
alternative hypothesis: true difference in means between group Kenya and group United States is greater than 0
95 percent confidence interval:
 -24.06115       Inf
sample estimates:
        mean in group Kenya mean in group United States 
                    52.6810                     73.4785

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States less than or equal to zero with a p-value =1.

Additional Tests

Additional unpaired test

male<-c(135,180,108,128,160,143,175,170)
female<-c(205,195,185,150,175,190,180,220)

Data Frame

daatta<-data.frame(male,female)
daatta

Perform the two tailed test

TTEST1<-t.test(male,female, alternative = "two.sided",paired = FALSE,data=daatta)
TTEST1


    Welch Two Sample t-test

data:  male and female
t = -3.2304, df = 13.477, p-value = 0.006308
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -62.69717 -12.55283
sample estimates:
mean of x mean of y 
  149.875   187.500

In the above output, the p-value of the test is approximately 0.006308, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is not equal to zero with a p-value which is approximately 0.006308 at a 5% level of significance.

Perform the left tailed test

TTEST2<-t.test(male,female, alternative = "less",paired = FALSE,data=daatta)
TTEST2


    Welch Two Sample t-test

data:  male and female
t = -3.2304, df = 13.477, p-value = 0.003154
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -17.05415
sample estimates:
mean of x mean of y 
  149.875   187.500

In the above output, the p-value of the test is approximately 0.003154, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.003154 at a 5% level of significance.

Perform the right tailed test

TTEST3<-t.test(male,female, alternative = "greater",paired = FALSE,data=daatta)
TTEST3


    Welch Two Sample t-test

data:  male and female
t = -3.2304, df = 13.477, p-value = 0.9968
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -58.19585       Inf
sample estimates:
mean of x mean of y 
  149.875   187.500

In the above output, the p-value of the test is approximately 0.9968, which is greater than the significance level alpha = 0.05. We therefore fail to reject the null hypothesis and conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.9968 at a 5% level of significance.

Additional data set

weight<-c(135,180,108,128,160,143,175,170,205,195,185,150,175,190,180,220)
weight

 [1] 135 180 108 128 160 143 175 170 205 195 185 150 175 190 180 220

Generate the grouping variable

gender<-gl(2,8,labels=c("male","female"))
gender

 [1] male   male   male   male   male   male   male   male   female female
[11] female female female female female female
Levels: male female

Data frame the observations

paired_data<-data.frame(weight,gender)
paired_data

TTEST4<-t.test(weight~gender, alternative = "two.sided",paired = FALSE,data=paired_data)
TTEST4


    Welch Two Sample t-test

data:  weight by gender
t = -3.2304, df = 13.477, p-value = 0.006308
alternative hypothesis: true difference in means between group male and group female is not equal to 0
95 percent confidence interval:
 -62.69717 -12.55283
sample estimates:
  mean in group male mean in group female 
             149.875              187.500

In the above output, the p-value of the test is approximately 0.006308, which is less than the significance level alpha = 0.05. We therefore reject the null hypothesis and conclude that the true mean difference in height between male and female is not equal zero with a p-value which is approximately 0.006308 at a 5% level of significance.

Perform the left tailed test

TTEST5<-t.test(weight~gender, alternative = "less",paired = FALSE,data=paired_data)
TTEST5


    Welch Two Sample t-test

data:  weight by gender
t = -3.2304, df = 13.477, p-value = 0.003154
alternative hypothesis: true difference in means between group male and group female is less than 0
95 percent confidence interval:
      -Inf -17.05415
sample estimates:
  mean in group male mean in group female 
             149.875              187.500

Perform the right tailed test

TTEST6<-t.test(weight~gender, alternative = "greater",paired = FALSE,data=paired_data)
TTEST6


    Welch Two Sample t-test

data:  weight by gender
t = -3.2304, df = 13.477, p-value = 0.9968
alternative hypothesis: true difference in means between group male and group female is greater than 0
95 percent confidence interval:
 -58.19585       Inf
sample estimates:
  mean in group male mean in group female 
             149.875              187.500

ANALYSIS OF VARIANCE

Analysis of Variance (ANOVA) is a statistical approach used to compare variances across the means (or average) of different groups. A range of scenarios use it to determine if there is any difference between the means of three of more different groups.

Load gapminder data set

head(gapminder,10)

Perform the test (GDP per capita against continent)

attach(gapminder)

The following object is masked from fdi_data:

    year

The following object is masked from research_data:

    year

The following object is masked from unemploymnt_data (pos = 10):

    year

The following object is masked from unemploymnt_data (pos = 11):

    year

ANOVA<-aov(gdpPercap~continent)

summary(ANOVA)

              Df       Sum Sq    Mean Sq F value              Pr(>F)    
continent      4  37990228667 9497557167   126.6 <0.0000000000000002 ***
Residuals   1699 127489276662   75037832                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 126.6 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the mean gdp per capita across continent at a 5% level of significance.

#Test the significance of the difference # Load the package below

library(agricolae)

Test the significant difference

TK<-TukeyHSD(ANOVA)
TK

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = gdpPercap ~ continent)

$continent
                       diff        lwr       upr     p adj
Americas-Africa   4942.3558  3280.4806  6604.231 0.0000000
Asia-Africa       5708.3958  4188.6340  7228.158 0.0000000
Europe-Africa    12275.7210 10710.1623 13841.280 0.0000000
Oceania-Africa   16427.8546 11507.4034 21348.306 0.0000000
Asia-Americas      766.0401 -1044.5150  2576.595 0.7767582
Europe-Americas   7333.3652  5484.2011  9182.529 0.0000000
Oceania-Americas 11485.4989  6467.6035 16503.394 0.0000000
Europe-Asia       6567.3251  4844.7530  8289.897 0.0000000
Oceania-Asia     10719.4588  5746.8215 15692.096 0.0000000
Oceania-Europe    4152.1337  -834.6909  9138.958 0.1539474

From the results above a significant difference in mean gdp per capita exist across various continent except for Asia and America and Oceania and Europe.

Plot

plot(TK)

plot(TK, las=1)

ANOVA(Life Expectancy against continent)

ANOVA2<-aov(lifeExp~continent)
summary(ANOVA2)

              Df Sum Sq Mean Sq F value              Pr(>F)    
continent      4 139343   34836   408.7 <0.0000000000000002 ***
Residuals   1699 144805      85                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 126.6 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the mean life expectancy across continent at a 5% level of significance.

Test the significant difference

TK2<-TukeyHSD(ANOVA2)
TK2

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lifeExp ~ continent)

$continent
                      diff       lwr       upr     p adj
Americas-Africa  15.793407 14.022263 17.564550 0.0000000
Asia-Africa      11.199573  9.579887 12.819259 0.0000000
Europe-Africa    23.038356 21.369862 24.706850 0.0000000
Oceania-Africa   25.460878 20.216908 30.704848 0.0000000
Asia-Americas    -4.593833 -6.523432 -2.664235 0.0000000
Europe-Americas   7.244949  5.274203  9.215696 0.0000000
Oceania-Americas  9.667472  4.319650 15.015293 0.0000086
Europe-Asia      11.838783 10.002952 13.674614 0.0000000
Oceania-Asia     14.261305  8.961718 19.560892 0.0000000
Oceania-Europe    2.422522 -2.892185  7.737230 0.7250559

From the results above a significant difference in mean life expectancy exist across various continent except for Oceania and Europe.

Plot the significant difference

plot(TK2)

plot(TK2, las=1)

TWO WAY ANOVA (Two factor ANOVA)

library(haven)

MAIZE_YIELDS_two_way_ANOVA <- read_dta("D:/Data/data and other training stuff/STATA training DATA set/MAIZE YIELDS two way ANOVA.dta")

head(MAIZE_YIELDS_two_way_ANOVA,10)

Rename the data set

my_2anova<-MAIZE_YIELDS_two_way_ANOVA
str(my_2anova)

tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
 $ maize_yields: num [1:24] 18.5 11.7 15.4 16.5 15.7 14.2 16.6 18.5 16.2 12.9 ...
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ treatment   : dbl+lbl [1:24] 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5...
   ..@ format.stata: chr "%10.0g"
   ..@ labels      : Named num [1:6] 1 2 3 4 5 6
   .. ..- attr(*, "names")= chr [1:6] "TR 1" "TR 2" "TR 3" "TR 4" ...
 $ blocks      : dbl+lbl [1:24] 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3...
   ..@ format.stata: chr "%10.0g"
   ..@ labels      : Named num [1:4] 1 2 3 4
   .. ..- attr(*, "names")= chr [1:4] "BLOCK 1" "BLOCK 2" "BLOCK 3" "BLOCK 4"

my_2anova

attach(my_2anova)

Convert the character variables to factors variables

my_2anova$treatment <- as.factor(my_2anova$treatment) 
my_2anova$blocks <- as.factor(my_2anova$blocks)

Confirm the structure of the data set

str(my_2anova)

tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
 $ maize_yields: num [1:24] 18.5 11.7 15.4 16.5 15.7 14.2 16.6 18.5 16.2 12.9 ...
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ treatment   : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ blocks      : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...

modell<-lm(maize_yields~treatment+blocks,data=my_2anova)
modell


Call:
lm(formula = maize_yields ~ treatment + blocks, data = my_2anova)

Coefficients:
(Intercept)   treatment2   treatment3   treatment4   treatment5   treatment6  
     14.921        0.725       -1.200        0.575        0.675        0.900  
    blocks2      blocks3      blocks4  
     -1.467        2.767        1.117

Model summary

summary(modell)


Call:
lm(formula = maize_yields ~ treatment + blocks, data = my_2anova)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5958 -1.7688  0.0292  1.2313  3.5792 

Coefficients:
            Estimate Std. Error t value     Pr(>|t|)    
(Intercept)   14.921      1.410  10.585 0.0000000235 ***
treatment2     0.725      1.628   0.445       0.6624    
treatment3    -1.200      1.628  -0.737       0.4723    
treatment4     0.575      1.628   0.353       0.7288    
treatment5     0.675      1.628   0.415       0.6842    
treatment6     0.900      1.628   0.553       0.5884    
blocks2       -1.467      1.329  -1.104       0.2872    
blocks3        2.767      1.329   2.082       0.0549 .  
blocks4        1.117      1.329   0.840       0.4140    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.302 on 15 degrees of freedom
Multiple R-squared:  0.4681,    Adjusted R-squared:  0.1843 
F-statistic:  1.65 on 8 and 15 DF,  p-value: 0.192

Create the two factor anove

two_way<-anova(modell)
two_way

From the results above, the F statistics for treatment is given as 0.4672 with its associated p-value of approximately 0.79478, which is significantly greater than 0.05. This is an indication that there is no significant difference in the mean maize yield across treatment at a 5% level of significance. On the other, F statistics for blocks is 3.6208 with its associated p-value of 0.03802, which is less than 0.05. This is a clear indication that there exists a significant difference in the mean maize yield across blocks at a 5% level of significant.

PEARSON CORRELATION

A Pearson’s correlation is used when the two statistics we want to analyze are both quantitative. This means we will be comparing quantitative variables to find a linear relationship (if the variables represent a nonlinear relationship, a correlation is not appropriate). The Pearson correlation measures the strength of the linear relationship between two variables. It has a value between -1 to 1, with a value of -1 meaning a total negative linear correlation, 0 being no correlation, and + 1 meaning a total positive correlation.

Load the data set

data("gapminder")

head(gapminder,10)

Perform the pearson correlation

cor.test(gdpPercap, lifeExp, method = "pearson")


    Pearson's product-moment correlation

data:  gdpPercap and lifeExp
t = 29.658, df = 1702, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5515065 0.6141690
sample estimates:
      cor 
0.5837062

“t” is the value of the test statistic (t=29.658). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true correlation is not equal to 0). Sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s named as cor (cor = 0.5837). From the results we shall therefore reject the null hypothesis and conclude that there is a positive moderate correlation between life expectancy and gdp per capita at a 5% level of significance.

Additional Pearson correlation

library(haven)
Pearson <- read_sav("C:/Users/user/Downloads/Pearson.sav")
head(Pearson,10)

length(Pearson)

[1] 2

str(Pearson)

tibble [300 × 2] (S3: tbl_df/tbl/data.frame)
 $ mpg   : num [1:300] 18 15 18 16 17 15 14 14 14 15 ...
  ..- attr(*, "format.spss")= chr "F4.1"
 $ weight: num [1:300] 3504 3693 3436 3433 3449 ...
  ..- attr(*, "format.spss")= chr "F4.0"

Perform the test

attach(Pearson)

The following object is masked _by_ .GlobalEnv:

    weight

The following object is masked from package:ggplot2:

    mpg

cor(Pearson)

              mpg     weight
mpg     1.0000000 -0.8782815
weight -0.8782815  1.0000000

cor.test(Pearson$mpg, Pearson$weight, method = "pearson", data=Pearson)


    Pearson's product-moment correlation

data:  Pearson$mpg and Pearson$weight
t = -31.709, df = 298, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9018288 -0.8495329
sample estimates:
       cor 
-0.8782815

“t” is the value of the test statistic (t=-31.709). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true correlation is not equal to 0). Sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s named as cor (cor = -0.8783). From the results we shall therefore reject the null hypothesis and conclude that the true correlation is not equal to zero. Further, correlation coefficient from the results above shows that there a strong negative correlation between mpg and weight at a 5% level of significance.
There is sufficient evidence to conclude that there is a significant linear relationship between mpg and weight because the correlation coefficient is significantly different from zero.

Scatter plot

plot(Pearson$mpg, Pearson$weight)

Inserting an image in Rmarkdown

knitr::include_graphics("C:/Users/user/Downloads/my_scatter.png")

Warning in knitr::include_graphics("C:/Users/user/Downloads/my_scatter.png"): It
is highly recommended to use relative paths for images. You had absolute paths:
"C:/Users/user/Downloads/my_scatter.png"

My scatter plot showing negative correlation

knitr::include_graphics("my_scatter.png")

My scatter plot showing negative correlation

LINEAR REGRESSION

Insert data in form of a table

library(devtools)

Loading required package: usethis

my_tbl <- tibble::tribble(
  ~ln_population, ~ln_life_exp, ~ln_gdp_cap,
     6.925587075,   1.45940757,    2.891786,
     6.965715868,   1.48190105,    2.914265,
     7.011447073,   1.50510926,       2.931,
     7.062129255,   1.53173431,    2.922309,
     7.116589814,   1.55736281,    2.869221,
     7.172613788,   1.58476078,    2.895485,
     7.109977092,   1.60047192,    2.990344,
     7.142012486,   1.61089428,    2.930641,
     7.212664826,   1.61986519,    2.812473,
     7.346888958,   1.62079169,    2.803007,
     7.402577829,   1.62458115,    2.861376,
     7.503653471,   1.64175165,    2.988818,
     6.108124079,   1.74217504,    3.204407,
     6.169234922,   1.77290819,    3.288313,
     6.237578169,   1.81170903,    3.364155,
     6.297554802,   1.82098918,     3.44094,
      6.35479086,   1.83052451,    3.520277,
      6.39950897,   1.83840828,    3.548144,
     6.444059949,   1.84769602,    3.560012
  )

require(knitr)
kable(my_tbl, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

ln_population	ln_life_exp	ln_gdp_cap
6.926	1.459	2.892
6.966	1.482	2.914
7.011	1.505	2.931
7.062	1.532	2.922
7.117	1.557	2.869
7.173	1.585	2.895
7.110	1.600	2.990
7.142	1.611	2.931
7.213	1.620	2.812
7.347	1.621	2.803
7.403	1.625	2.861
7.504	1.642	2.989
6.108	1.742	3.204
6.169	1.773	3.288
6.238	1.812	3.364
6.298	1.821	3.441
6.355	1.831	3.520
6.400	1.838	3.548
6.444	1.848	3.560

attach(my_tbl)

my_model<-lm(ln_life_exp~ln_gdp_cap,data=my_tbl)
stargazer(my_model,type="text")


===============================================
                        Dependent variable:    
                    ---------------------------
                            ln_life_exp        
-----------------------------------------------
ln_gdp_cap                   0.430***          
                              (0.050)          
                                               
Constant                      0.328**          
                              (0.155)          
                                               
-----------------------------------------------
Observations                    19             
R2                             0.813           
Adjusted R2                    0.802           
Residual Std. Error       0.058 (df = 17)      
F Statistic           73.702*** (df = 1; 17)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

exp(0.328)

[1] 1.388189

my_tbl1 <- tibble::tribble(
  ~Year, ~Income, ~Consumption,
   1950,   791.8,        733.2,
   1951,     819,        748.7,
   1952,   844.3,        771.4,
   1953,     880,        802.5,
   1954,     894,        822.7,
   1955,   944.5,        873.8,
   1956,   989.4,        899.8,
   1957,  1012.1,        919.7,
   1958,  1028.8,        932.9,
   1959,  1067.2,        979.4,
   1960,  1091.1,       1005.1,
   1961,  1123.2,       1025.2,
   1962,  1170.2,         1069,
   1963,  1207.3,       1108.4,
   1964,    1291,       1170.6,
   1965,  1365.7,       1236.4,
   1966,  1431.3,       1298.9,
   1967,  1493.2,       1337.7,
   1968,  1551.3,       1405.9,
   1969,  1599.8,       1456.7,
   1970,  1688.1,         1492,
   1971,  1728.4,       1538.8,
   1972,  1797.4,       1621.9,
   1973,  1916.3,       1689.6,
   1974,  1896.6,         1674,
   1975,  1931.7,       1711.9,
   1976,    2001,       1803.9,
   1977,  2066.6,       1883.8,
   1978,  2167.4,         1961,
   1979,  2216.2,       2004.4
  )

require(knitr)
kable(my_tbl1, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

Year	Income	Consumption
1950	791.8	733.2
1951	819.0	748.7
1952	844.3	771.4
1953	880.0	802.5
1954	894.0	822.7
1955	944.5	873.8
1956	989.4	899.8
1957	1012.1	919.7
1958	1028.8	932.9
1959	1067.2	979.4
1960	1091.1	1005.1
1961	1123.2	1025.2
1962	1170.2	1069.0
1963	1207.3	1108.4
1964	1291.0	1170.6
1965	1365.7	1236.4
1966	1431.3	1298.9
1967	1493.2	1337.7
1968	1551.3	1405.9
1969	1599.8	1456.7
1970	1688.1	1492.0
1971	1728.4	1538.8
1972	1797.4	1621.9
1973	1916.3	1689.6
1974	1896.6	1674.0
1975	1931.7	1711.9
1976	2001.0	1803.9
1977	2066.6	1883.8
1978	2167.4	1961.0
1979	2216.2	2004.4

View the data set

attach(my_tbl1)

The following object is masked from ForDI:

    Year

tail(my_tbl1)

head(my_tbl1)

str(my_tbl)

tibble [19 × 3] (S3: tbl_df/tbl/data.frame)
 $ ln_population: num [1:19] 6.93 6.97 7.01 7.06 7.12 ...
 $ ln_life_exp  : num [1:19] 1.46 1.48 1.51 1.53 1.56 ...
 $ ln_gdp_cap   : num [1:19] 2.89 2.91 2.93 2.92 2.87 ...

Estimate the model (Linear Regression Model)

myy_model<-lm(Consumption~Income,data = my_tbl1)
stargazer(myy_model,type="text")


===============================================
                        Dependent variable:    
                    ---------------------------
                            Consumption        
-----------------------------------------------
Income                       0.881***          
                              (0.006)          
                                               
Constant                     31.935***         
                              (9.004)          
                                               
-----------------------------------------------
Observations                    30             
R2                             0.999           
Adjusted R2                    0.999           
Residual Std. Error      14.858 (df = 28)      
F Statistic         20,657.590*** (df = 1; 28) 
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

myy_model<-lm(log(Consumption)~log(Income),data = my_tbl1)
stargazer(myy_model,type="text")


===============================================
                        Dependent variable:    
                    ---------------------------
                         log(Consumption)      
-----------------------------------------------
log(Income)                  0.972***          
                              (0.006)          
                                               
Constant                      0.107**          
                              (0.040)          
                                               
-----------------------------------------------
Observations                    30             
R2                             0.999           
Adjusted R2                    0.999           
Residual Std. Error       0.010 (df = 28)      
F Statistic         30,476.920*** (df = 1; 28) 
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Interpretation of the model

According to Relative Income Theory (RIT), the level of consumption expenditures is not determined by the absolute level of income but by the relative level of income, with the APC declining as relative income increases. More specifically, the RIT argues that the level of consumption spending is determined by the household’s level of current income relative to the highest level of income previously earned. Symbolically, it is shown:

knitr::include_graphics("consumption.png")

Simbolic Representation of RIT

where C represents the current level of consumption expenditures, Y, the current level of income, Yh, the highest level of income previously earned, and a and b, represent numerical constants which relate income to consumption. From the above equation, we find that when the households experience a temporary and short-run increase in current income above its previous peak level of income, it increases its consumption expenditures by an amount which is less than proportional to the increase in current income.

Consequently, when current income rises relative to peak income, the APC declines and the increase in total consumption expenditures is not proportional to the increase in total income. Again, when a household experiences current and peak income growing by the same percentage amount, it increases its consumption expenditures by an amount which is proportional to the increase in current income.

Consequently, the APC remains constant and the increase in total consumption expenditure is proportional to the increase in total income. Thus, according to the RIT, changes in current consumption are not proportional to the changes in current income only when current income increases relative to previous peak income.

If current and peak income grow together, changes in consumption are always proportional to the changes in income. However, it must be noted that RIT works for decreases as well as increases in the level of current income. The RIT is fundamentally different from AIT. The RIT explains away the short-run consumption function as a result of temporary deviations in current income, while the AIT explains away the long-run consumption function as the result of factors other than income on consumption.

From the results above, autonomous consumption is about 31.935 US dollar. On the other hand, a unit change in income results in 88.1% change in consumption. The effect is statistically significant at a 1% level of significance.

Prediction

Pre<-predict(myy_model,data.frame(Income=2000))
Pre

       1 
7.491265

my_tbl2021 <- tibble::tribble(
  ~YEAR,    ~GDP,   ~PIV,      ~ICT,      ~VAT,   ~IMPTT,
   1973,   15790,   3645,    1124.8,     639.8,   795.44,
   1974,   18776,   4075,    1531.4,    937.26,   842.24,
   1975,   21140,   4837,    1796.8,   1185.48,   983.62,
   1976,   25562,   5808,    2149.4,   1308.44,  1057.18,
   1977,   32699,   7800,    2846.8,   1855.26,  2083.94,
   1978,   35601,  10280,    3121.4,   1995.38,  2025.48,
   1979,   39543,  10809,      3437,   3098.14,  2049.64,
   1980,   44648,  12451,    3951.6,      3588,     2919,
   1981,   51641,  14508,    3993.4,    3895.9,  3674.24,
   1982,   58214,  13364,    4624.6,    3917.5,  3305.84,
   1983,   66218,  14349,      5023,   5062.38,  3424.38,
   1984,   72550,  16143,    6019.4,      5471,  3043.58,
   1985,  100831,  17631,    7162.4,   6065.86,   4236.8,
   1986,  117472,  23064,    7714.6,    7950.4,   4934.2,
   1987,  131169,  25735,    9089.6,  10399.14,  5473.72,
   1988,  151194,  30359,   10240.4,   9774.92,  10240.5,
   1989,  171589,  33056,     11983,   9461.94, 11983.06,
   1990,  195536,  40560,   14261.6,  12140.28, 14261.56,
   1991,  224232,  42671,   17027.8,   18555.4,  5118.78,
   1992,  264476,  43777,   19970.4,  22142.72,     9183,
   1993,  333616,  56580,   36767.2,  28994.34, 14792.78,
   1994,  400700,  75616,   43505.8,  24533.86, 18598.28,
   1995,  465654,  99497,   48082.3,  28403.72, 21175.68,
   1996,  687998, 110142,  48375.02,  29850.08, 22594.06,
   1997,  770312, 118535,   55577.9,   34468.1, 24567.06,
   1998,  850808, 133366,   55234.9,  90204.76, 28443.93,
   1999,  906928, 141403,  53316.99,  40944.11, 28605.16,
   2000,  967838, 161714,  53428.93,  50220.92, 28803.74,
   2001, 1020020, 185186,  55861.95,  50871.68,    27302,
   2002, 1035370, 178466,  70140.28,  56135.25, 24936.09,
   2003, 1138060, 179254,  77409.73,  58853.08,    25214,
   2004, 1286460, 207196,  99312.47,  75995.66, 23531.72,
   2005, 1445480, 264912, 114629.06,  79925.91, 20511.43,
   2006, 1642400, 309402,    130719,  96497.01,    27927,
   2007, 1833511, 309781,    165078, 111904.51, 32944.35,
   2008, 2111173, 239525,    194155,    126854,    36181,
   2009, 2365453, 292488,    288168,    146791,    41372,
   2010, 2551161, 238938,    268291,    172360,    48903,
   2011, 3725918, 213845,  290621.6, 163940.39, 48436.93,
   2012, 4261370, 242365, 353693.71, 174716.17, 55397.81,
   2013, 4745594, 234586, 422357.86, 218297.19, 61692.72,
   2014, 5403471, 273641, 470101.98,  243156.2, 70245.12,
   2015, 6284191, 293645, 540440.43,  276504.4, 75410.29,
   2016, 7194163, 261548, 589921.37, 287766.52, 86329.96,
   2017, 7749435, 269542, 557959.32,  309977.4, 85243.79,
   2018, 7946248, 256345, 683377.33,  381419.9, 95354.98
  )

require(knitr)
kable(my_tbl2021, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

YEAR	GDP	PIV	ICT	VAT	IMPTT
1973	15790	3645	1124.80	639.80	795.44
1974	18776	4075	1531.40	937.26	842.24
1975	21140	4837	1796.80	1185.48	983.62
1976	25562	5808	2149.40	1308.44	1057.18
1977	32699	7800	2846.80	1855.26	2083.94
1978	35601	10280	3121.40	1995.38	2025.48
1979	39543	10809	3437.00	3098.14	2049.64
1980	44648	12451	3951.60	3588.00	2919.00
1981	51641	14508	3993.40	3895.90	3674.24
1982	58214	13364	4624.60	3917.50	3305.84
1983	66218	14349	5023.00	5062.38	3424.38
1984	72550	16143	6019.40	5471.00	3043.58
1985	100831	17631	7162.40	6065.86	4236.80
1986	117472	23064	7714.60	7950.40	4934.20
1987	131169	25735	9089.60	10399.14	5473.72
1988	151194	30359	10240.40	9774.92	10240.50
1989	171589	33056	11983.00	9461.94	11983.06
1990	195536	40560	14261.60	12140.28	14261.56
1991	224232	42671	17027.80	18555.40	5118.78
1992	264476	43777	19970.40	22142.72	9183.00
1993	333616	56580	36767.20	28994.34	14792.78
1994	400700	75616	43505.80	24533.86	18598.28
1995	465654	99497	48082.30	28403.72	21175.68
1996	687998	110142	48375.02	29850.08	22594.06
1997	770312	118535	55577.90	34468.10	24567.06
1998	850808	133366	55234.90	90204.76	28443.93
1999	906928	141403	53316.99	40944.11	28605.16
2000	967838	161714	53428.93	50220.92	28803.74
2001	1020020	185186	55861.95	50871.68	27302.00
2002	1035370	178466	70140.28	56135.25	24936.09
2003	1138060	179254	77409.73	58853.08	25214.00
2004	1286460	207196	99312.47	75995.66	23531.72
2005	1445480	264912	114629.06	79925.91	20511.43
2006	1642400	309402	130719.00	96497.01	27927.00
2007	1833511	309781	165078.00	111904.51	32944.35
2008	2111173	239525	194155.00	126854.00	36181.00
2009	2365453	292488	288168.00	146791.00	41372.00
2010	2551161	238938	268291.00	172360.00	48903.00
2011	3725918	213845	290621.60	163940.39	48436.93
2012	4261370	242365	353693.71	174716.17	55397.81
2013	4745594	234586	422357.86	218297.19	61692.72
2014	5403471	273641	470101.98	243156.20	70245.12
2015	6284191	293645	540440.43	276504.40	75410.29
2016	7194163	261548	589921.37	287766.52	86329.96
2017	7749435	269542	557959.32	309977.40	85243.79
2018	7946248	256345	683377.33	381419.90	95354.98

View the data set

head(my_tbl2021,10)

Descriptive Statistics

summary(my_tbl2021)

      YEAR           GDP               PIV              ICT        
 Min.   :1973   Min.   :  15790   Min.   :  3645   Min.   :  1125  
 1st Qu.:1984   1st Qu.:  79620   1st Qu.: 16515   1st Qu.:  6305  
 Median :1996   Median : 576826   Median :104820   Median : 48229  
 Mean   :1996   Mean   :1542657   Mean   :124401   Mean   :128339  
 3rd Qu.:2007   3rd Qu.:1785733   3rd Qu.:237850   3rd Qu.:156488  
 Max.   :2018   Max.   :7946248   Max.   :309781   Max.   :683377  
      VAT               IMPTT        
 Min.   :   639.8   Min.   :  795.4  
 1st Qu.:  5619.7   1st Qu.: 3814.9  
 Median : 29422.2   Median :20843.6  
 Mean   : 75848.5   Mean   :25351.1  
 3rd Qu.:108052.6   3rd Qu.:31909.2  
 Max.   :381419.9   Max.   :95355.0

Attach the data set

attach(my_tbl2021)

Perform the multiple regression model

MDL<-lm(log(PIV)~log(ICT)+log(VAT)+log(IMPTT),data=my_tbl2021)
stargazer(MDL,type="text")


===============================================
                        Dependent variable:    
                    ---------------------------
                             log(PIV)          
-----------------------------------------------
log(ICT)                      -0.135           
                              (0.177)          
                                               
log(VAT)                     0.647***          
                              (0.211)          
                                               
log(IMPTT)                    0.356**          
                              (0.162)          
                                               
Constant                     2.532***          
                              (0.398)          
                                               
-----------------------------------------------
Observations                    46             
R2                             0.958           
Adjusted R2                    0.955           
Residual Std. Error       0.303 (df = 42)      
F Statistic           317.211*** (df = 3; 42)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Model interpretation

The Effect of Value Added Tax (VAT) on Private Investment.

From the analysis it was found that the value added tax has a significantly positive effect on the level of private investment in Kenya. The findings are concluded on the basis that the value added tax has positive coefficient. Other studies shows that the increase in the value of value added tax increases the level of private investment (Adejare, 2017). This is an indicator that the increase in the value added tax increases the level of private investment. According to Gitahi (2010), value added tax negatively influences the level of private investment. From the table of coefficients, value added tax had a statistically significant effect on the level of private investment as opposed to the null hypothesis which stated that value added tax has no statistically significant effect on the level of private investment. This implies that we will reject the null hypothesis and conclude that value added tax has a statistically significant effect on the level of private investment at 5% level of significant. The coefficient of 0.647 show that one unit change in the level of value added tax results into 0.647 unit change in the level of private investment.

The Effect of Income Tax on Private Investment.

From the table 4.8, the coefficient for the income shows that income tax has a negative effect on the level of private investment. The argument here is that an increase in the rate of income tax constrain private investment. The assumption here is that, according to Keynesian economist, an increase in income tax reduce the disposable income. A decrease in the value of disposable income implies a decrease in the level of saving. Consequently, whenever the level of saving decreases private investment decreases. This is so because, Keynes (1936) argued that in the long run savings equates investment. An increase in the value of income tax charged results into a decrease in the level of private investment in country (Gitahi, 2010). The results from the analysis in the coefficient table shows that the effect of income tax on private investment is significant. The p- value for the income tax given as 0.028 which is less than 0.05 implies that we will therefore reject the null hypothesis and conclude that income tax has a statistically significant effect on the level of private investment at 5% level of significant. The fact that the Kenyan government is subjecting its citizens to higher income tax rate more than other countries in the region, this has accelerated higher capital flight to other countries leading shrinking of the private investment level in Kenya.

The Effect of Import Tax on private Investment

The effect of Import tax on private investment was found to be positive and statistically significant. From the existing literature, the results shows that import tax encouraged private investment that deterring private investment. Newbery (1987) and Coady (1997) argued that import taxes and barriers on international trade of any sort tend to encourage domestic production of final consumer goods while permitting relatively free imports of capital or intermediate goods. This tends to be associated with high rates of effective protection, high cost of domestic production, and creating a bias against exports. Import tax charged on final consumer goods leads production of import substitute. The production of import substitute in the country enables for the expansion of the local business enterprises leading to an increase in the level of private investment (Gitahi, 2010) Consequently, while reducing the dependence of the country on imports of final consumption goods, the economy becomes highly dependent on imports of intermediate goods. The above scenario will lead to the expansion of the local business enterprises and consequently increasing the aggregate demand which will reflect as the positive effect of import tax on private investment. From the table 4.8 import tax is statistically significant shown by the p-value of 0.034 which is less than 0.05 implying that we will reject the null hypothesis and conclude import tax has statistically significant effect on the level of private investment in Kenya at 5% level of significance.

Predict the residuals

RESID<-residuals(MDL)

Summarize the residuals

mean(RESID)

[1] -0.00000000000000001027399

Covariance

RESID and log(ICT)

cov(RESID,log(ICT))

[1] 0.0000000000000000004384995

RESID and log(VAT)

cov(RESID,log(VAT))

[1] -0.000000000000000001207078

RESID and log(IMPTT)

cov(RESID,log(IMPTT))

[1] 0.00000000000000000800864

NON PARAMETRIC TESTS

TEST OF VARIANCE EQUALITY

Import the data set

Rehearsal_and_non_rehearsal_groups <- read_sav("D:/Training/Data/Rehearsal and non rehearsal groups.sav")
head(Rehearsal_and_non_rehearsal_groups,10)

Rename the data set

data222<-Rehearsal_and_non_rehearsal_groups

Views the data set

head(data222,10)

Attach the data set

attach(data222)

Test of variance equality

var.test(No.ofwordsrecalled~Groups, ratio=1,
         alternative=c("two.sided","less","greater"),
         conf.level=0.95, data=data222)


    F test to compare two variances

data:  No.ofwordsrecalled by Groups
F = 1.1019, num df = 9, denom df = 9, p-value = 0.8875
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2736844 4.4360491
sample estimates:
ratio of variances 
          1.101852

In the above output, the p-value of the test is 0.8875, which is greater than the significance level alpha = 0.05. We can conclude that there is equal variance across the two groups with a p-value = 0.8875.

KRUSKAL WALLI TEST

Kruskal-Wallis test, proposed by Kruskal and Wallis in 1952, is a nonparametric method for testing whether samples are originated from the same distribution.597,681 It extends the Mann-Whitney U test to more than two groups. The null hypothesis of the Kruskal-Wallis test is that the mean ranks of the groups are the same. As the nonparametric equivalent one-way ANOVA, Kruskal-Wallis test is called one-way ANOVA on ranks. Unlike the analogous one-way ANOVA, the nonparametric Kruskal-Wallis test does not assume a normal distribution of the underlying data. Thus, Kruskal-Wallis test is more suitable for analysis of microbiome data. Because the postsequencing microbiome data are often not normally distributed and contain some strong outliers, it is more appropriate to use ranks rather than actual values to avoid the testing being affected by the presence of outliers or by the nonnormal distribution of data.

The Kruskal Wallis test is the non parametric alternative to the One Way ANOVA. Non parametric means that the test doesn’t assume your data comes from a particular distribution. The H test is used when the assumptions for ANOVA aren’t met (like the assumption of normality). It is sometimes called the one-way ANOVA on ranks, as the ranks of the data values are used in the test rather than the actual data points.

H_test_miles_per_gassolin <- read_sav("D:/Data/data and other training stuff/SPSS_training data sets/H-test ,miles per gassolin.sav")

Rename the data set

data333<-H_test_miles_per_gassolin

View the data set

head(data333,10)

Attach the data set

attach(data333)

Change gassolin to factors

gassolin<-as.factor(data333$gassolin)

KRUSKAL WALLIS TEST

kruskal.test(miles~gassolin, data=data333)


    Kruskal-Wallis rank sum test

data:  miles by gassolin
Kruskal-Wallis chi-squared = 0.86831, df = 2, p-value = 0.6478

In the above output, the p-value of the test is 0.6478, which is greater than the significance level alpha = 0.05. We can conclude that the median miles covered using the three types of gasoline is not significantly different with a p-value =0.6478

WILCOXON PAIRED SAMPLE

The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. This tutorial describes how to compute paired samples Wilcoxon test in R.

before <-c(190.1, 190.9, 172.7, 213, 231.4,
        196.9, 172.2, 285.5, 225.2, 113.7)
 
after <-c(392.9, 313.2, 345.1, 393, 434,
        227.9, 422, 383.9, 392.3, 352.2)

Data frame the data set

myData <- data.frame(
group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)

View the data set

head(myData,10)

attach the data set

attach(myData)

The following object is masked _by_ .GlobalEnv:

    weight

The following object is masked from Pearson:

    weight

Perform the wilcoxon test for paired data (two.sided test)

wilcox.test(weight~group, data=myData)


    Wilcoxon rank sum exact test

data:  weight by group
W = 98, p-value = 0.0000433
alternative hypothesis: true location shift is not equal to 0

In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 4.33e-05.

Perform the wilcoxon test for paired data (one.sided test to the left)

result = wilcox.test(weight ~ group,
                    data = myData,
                    paired = TRUE,
                    alternative = "less")
 
print(result)


    Wilcoxon signed rank exact test

data:  weight by group
V = 55, p-value = 1
alternative hypothesis: true location shift is less than 0

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater or equal to the median weight after treatment with a p-value = 1.

Perform the wilcoxon test for paired data (one.sided test to the right)

result2 = wilcox.test(weight ~ group,
                    data = myData,
                    paired = TRUE,
                    alternative = "greater")

print(result2)


    Wilcoxon signed rank exact test

data:  weight by group
V = 55, p-value = 0.0009766
alternative hypothesis: true location shift is greater than 0

In the above output, the p-value of the test is 0.0009766, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than the median weight after treatment with a p-value = 0.0009766.

Perform the wilcoxon test for one sample (two.sided)

weight <-c(27.6,30.6,32.2,25.3,30.9,31.0,28.9,28.9,28.9,28.2)

Print the results

print(weight)

 [1] 27.6 30.6 32.2 25.3 30.9 31.0 28.9 28.9 28.9 28.2

Perform the test

result3 = wilcox.test(weight, mu = 25)

Warning in wilcox.test.default(weight, mu = 25): cannot compute exact p-value
with ties

print(result3)


    Wilcoxon signed rank test with continuity correction

data:  weight
V = 55, p-value = 0.005793
alternative hypothesis: true location is not equal to 25

One.sided to the left

wilcox.test(myData$weight, mu = 25,
            alternative = "less")


    Wilcoxon signed rank exact test

data:  myData$weight
V = 210, p-value = 1
alternative hypothesis: true location is less than 25

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than or equal to the median weight after treatment with a p-value = 1.

One sided test to the right

wilcox.test(myData$weight, mu = 25,
            alternative = "greater")


    Wilcoxon signed rank exact test

data:  myData$weight
V = 210, p-value = 0.0000009537
alternative hypothesis: true location is greater than 25

In the above output, the p-value of the test is approximately 0.0001, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than or equal to the median weight after treatment with a p-value approximately 0.0001.

SIGN TEST

The sign test is used to test the null hypothesis that the median of a distribution is equal to some value. It can be used a) in place of a one-sample t-test b) in place of a paired t-test or c) for ordered categorical data where a numerical scale is inappropriate but where it is possible to rank the observations.

Perform the sign test for one sample (two.sided)

weight <-c(27.6,30.6,32.2,25.3,30.9,31.0,28.9,28.9,28.9,28.2)

Print the results

print(weight)

 [1] 27.6 30.6 32.2 25.3 30.9 31.0 28.9 28.9 28.9 28.2

Perform the test

instal the package BSDA

library(BSDA)

Loading required package: lattice


Attaching package: 'BSDA'

The following object is masked _by_ '.GlobalEnv':

    Pearson

The following object is masked from 'data222':

    Groups

The following object is masked from 'my_tbl1':

    Income

The following object is masked from 'Superstore_data':

    Region

The following objects are masked from 'package:carData':

    Vocab, Wool

The following object is masked from 'package:datasets':

    Orange

result4 = SIGN.test(weight, mu = 25)
print(result4)


    One-sample Sign-Test

data:  weight
s = 10, p-value = 0.001953
alternative hypothesis: true median is not equal to 0
95 percent confidence interval:
 27.79467 30.96756
sample estimates:
median of x 
       28.9 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level  L.E.pt  U.E.pt
Lower Achieved CI     0.8906 28.2000 30.9000
Interpolated CI       0.9500 27.7947 30.9676
Upper Achieved CI     0.9785 27.6000 31.0000

Sign Test of paired data

before <-c(190.1, 190.9, 172.7, 213, 231.4,
        196.9, 172.2, 285.5, 225.2, 113.7)
 
after <-c(392.9, 313.2, 345.1, 393, 434,
        227.9, 422, 383.9, 392.3, 352.2)

Bind the data

myDatta<-data.frame(before,after)
head(myDatta,10)

Paired sample sign test to the right

SIGN.test(before,after,md=0, alternative ="greater",paired=T, data=myDatta)


    Dependent-samples Sign-Test

data:  before and after
S = 0, p-value = 1
alternative hypothesis: true median difference is greater than 0
95 percent confidence interval:
 -206.608      Inf
sample estimates:
median of x-y 
       -176.2 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level   L.E.pt U.E.pt
Lower Achieved CI     0.9453 -202.800    Inf
Interpolated CI       0.9500 -206.608    Inf
Upper Achieved CI     0.9893 -238.500    Inf

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is equal to zero with a p-value = 1.

Paired sample sign test to the left

SIGN.test(before,after,md=0, alternative ="less",paired=T, data=myDatta)


    Dependent-samples Sign-Test

data:  before and after
S = 0, p-value = 0.0009766
alternative hypothesis: true median difference is less than 0
95 percent confidence interval:
      -Inf -119.7507
sample estimates:
median of x-y 
       -176.2 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level L.E.pt    U.E.pt
Lower Achieved CI     0.9453   -Inf -122.3000
Interpolated CI       0.9500   -Inf -119.7507
Upper Achieved CI     0.9893   -Inf  -98.4000

In the above output, the p-value of the test is 0.0009766, which is less than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is not equal to zero with a p-value = 0.009766.

Paired sample sign test (two sided)

SIGN.test(before,after,md=0, alternative ="two.sided",paired=T, data=myDatta)


    Dependent-samples Sign-Test

data:  before and after
S = 0, p-value = 0.001953
alternative hypothesis: true median difference is not equal to 0
95 percent confidence interval:
 -226.9173 -106.1542
sample estimates:
median of x-y 
       -176.2 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level    L.E.pt    U.E.pt
Lower Achieved CI     0.8906 -202.8000 -122.3000
Interpolated CI       0.9500 -226.9173 -106.1542
Upper Achieved CI     0.9785 -238.5000  -98.4000

In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is not equal to zero with a p-value = 0.001953.

MANN-WHITNEY WILCOXON TEST

The Mann-Whitney U test is used to compare differences between two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed. For example, you could use the Mann-Whitney U test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differ based on gender (i.e., your dependent variable would be “attitudes towards pay discrimination” and your independent variable would be “gender”, which has two groups: “male” and “female”). Alternately, you could use the Mann-Whitney U test to understand whether salaries, measured on a continuous scale, differed based on educational level (i.e., your dependent variable would be “salary” and your independent variable would be “educational level”, which has two groups: “high school” and “university”). The Mann-Whitney U test is often considered the nonparametric alternative to the independent t-test although this is not always the case.

Unlike the independent-samples t-test, the Mann-Whitney U test allows you to draw different conclusions about your data depending on the assumptions you make about your data’s distribution. These conclusions can range from simply stating whether the two populations differ through to determining if there are differences in medians between groups. These different conclusions hinge on the shape of the distributions of your data, which we explain more about later.

Create the data set

red_bulb <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8)
orange_bulb <- c(47.8, 60, 63.4, 76, 89.4, 67.3, 61.3, 62.4)

Passing them in the columns

BULB_PRICE = c(red_bulb, orange_bulb)
BULB_TYPE = rep(c("red", "orange"), each = 8)

Now creating a dataframe

DATASET <- data.frame(BULB_TYPE, BULB_PRICE, stringsAsFactors = TRUE)

printing the dataframe

DATASET

Conduct the test

RSLT<-wilcox.test(BULB_PRICE~ BULB_TYPE,
                   data = DATASET,
                   exact = FALSE)

Print the results

RSLT


    Wilcoxon rank sum test with continuity correction

data:  BULB_PRICE by BULB_TYPE
W = 44.5, p-value = 0.2072
alternative hypothesis: true location shift is not equal to 0

Here as we can see that the p-value of is coming out to be 0.2072 which is significantly greater than the null hypothesis(0.05). Due to which it will be accepted. And it can be conclude that the distribution of prices of red and orange bulbs is the same.We can therefore assume that prices for both orange and red bulbs come from the same distribution.

SPEARMAN’S RANK CORRELATION

Spearman’s rank correlation is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function

Import the data set

My_data<-read.csv("C:\\Users\\user\\Downloads\\gapminder.csv")
head(My_data,10)

Attach the data set

attach(My_data)

The following objects are masked from gapminder:

    continent, country, gdpPercap, lifeExp, pop, year

The following object is masked from fdi_data:

    year

The following object is masked from research_data:

    year

The following object is masked from unemploymnt_data (pos = 23):

    year

The following object is masked from unemploymnt_data (pos = 24):

    year

Perform the test

res <- cor.test(gdpPercap, lifeExp, method = "spearman")

Warning in cor.test.default(gdpPercap, lifeExp, method = "spearman"): Cannot
compute exact p-value with ties

res


    Spearman's rank correlation rho

data:  gdpPercap and lifeExp
S = 143096490, p-value < 0.00000000000000022
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8264712

S is the value of the test statistic (S = 14309649). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true rho is not equal to 0). Sample estimates is the correlation coefficient. For Spearmann correlation coefficient it’s named as rho (Cor.coeff = 0.8264). From the results we shall therefore reject the null hypothesis and conclude that life expectancy and gdp per capita are dependent, and therefore true rho is not equal to 0.

MATRIX

matrix, a set of numbers arranged in rows and columns so as to form a rectangular array. The numbers are called the elements, or entries, of the matrix. Matrices have wide applications in engineering, physics, economics, and statistics as well as in various branches of mathematics. Matrices also have important applications in computer graphics, where they have been used to represent rotations and other transformations of images.

An example of a matrix

A<-matrix(c(4,19,16,21,14,8,10,18,14),nrow=3,ncol=3,byrow=T)
A

     [,1] [,2] [,3]
[1,]    4   19   16
[2,]   21   14    8
[3,]   10   18   14

Determinant of a matrix (2x2)

The determinant is a special number that can be calculated from a matrix. The matrix has to be square (same number of rows and columns) like this one:

Mat<-matrix(c(3,8,4,6),nrow=2,byrow = T)
Mat

     [,1] [,2]
[1,]    3    8
[2,]    4    6

\[ (3*6)-(4*8)= 18-32 = -14 \]

Alternatively

det(Mat)

[1] -14

Determinant of a matrix (3x3)

mtrx <- matrix( c(-1, -3, -5, -7, -8, 9, 11, 13, 15),nrow = 3,byrow = T)
mtrx

     [,1] [,2] [,3]
[1,]   -1   -3   -5
[2,]   -7   -8    9
[3,]   11   13   15

the determinant

det(mtrx)

[1] -360

Matrix Operation

3x3 Matrix

A<-matrix(c(4,19,16,21,14,8,10,18,14),nrow=3,ncol=3,byrow=T)
A

     [,1] [,2] [,3]
[1,]    4   19   16
[2,]   21   14    8
[3,]   10   18   14

B<-matrix(c(161,191,123,165,125,155,100,200,155),nrow=3,ncol=3,byrow=T)
B

     [,1] [,2] [,3]
[1,]  161  191  123
[2,]  165  125  155
[3,]  100  200  155

Inverse of the matrix B

solve(B)

             [,1]         [,2]         [,3]
[1,]  0.009121582  0.003927184 -0.011165601
[2,]  0.007905371 -0.009929774  0.003656479
[3,] -0.016085370  0.010278944  0.008937189

Find the identity matrix of the matrix B

zapsmall(B%*%solve(B))

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

matrix operation

addition

A+B

     [,1] [,2] [,3]
[1,]  165  210  139
[2,]  186  139  163
[3,]  110  218  169

subtraction

B-A

     [,1] [,2] [,3]
[1,]  157  172  107
[2,]  144  111  147
[3,]   90  182  141

multiplication

A%*%B

     [,1] [,2] [,3]
[1,] 5379 6339 5917
[2,] 6491 7361 5993
[3,] 5980 6960 6190

Division

B/A

          [,1]      [,2]     [,3]
[1,] 40.250000 10.052632  7.68750
[2,]  7.857143  8.928571 19.37500
[3,] 10.000000 11.111111 11.07143

M1<-matrix(c(16,121,166,121,129,177,163,135,198),nrow=3,ncol=3,byrow=T)
M1

     [,1] [,2] [,3]
[1,]   16  121  166
[2,]  121  129  177
[3,]  163  135  198

M2<-matrix(c(100,170,188,131,175,178,151,199,502),nrow=3,ncol=3,byrow=T)
M2

     [,1] [,2] [,3]
[1,]  100  170  188
[2,]  131  175  178
[3,]  151  199  502

S<-solve(A)
S

      [,1]  [,2]  [,3]
[1,] -1.04 -0.44  1.44
[2,]  4.28  2.08 -6.08
[3,] -4.76 -2.36  6.86

solve(B)

             [,1]         [,2]         [,3]
[1,]  0.009121582  0.003927184 -0.011165601
[2,]  0.007905371 -0.009929774  0.003656479
[3,] -0.016085370  0.010278944  0.008937189

matrix binding

column binding

cbind(M1,M2)

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   16  121  166  100  170  188
[2,]  121  129  177  131  175  178
[3,]  163  135  198  151  199  502

Row bidning

rbind(M1,M2)

     [,1] [,2] [,3]
[1,]   16  121  166
[2,]  121  129  177
[3,]  163  135  198
[4,]  100  170  188
[5,]  131  175  178
[6,]  151  199  502

identity matrix

zapsmall(A%*%S)

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Additional Matrix

S<-matrix(c(2.879,10.002,-1.810,10.002,199.798,-5.627,-1.810,-5.627,3.628),nrow=3)
S

       [,1]    [,2]   [,3]
[1,]  2.879  10.002 -1.810
[2,] 10.002 199.798 -5.627
[3,] -1.810  -5.627  3.628

S_inv<-solve(S)
S_inv

            [,1]         [,2]         [,3]
[1,]  0.58648233 -0.022083814  0.258342724
[2,] -0.02208381  0.006065228 -0.001610437
[3,]  0.25834272 -0.001610437  0.402022712

Identity matrix

zapsmall(S%*%S_inv)

     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

4x4 Matrix

R<-matrix(c(4,3,4,5,8,4,7,8,2,5,3,8,9,5,5,6),nrow = 4)
R

     [,1] [,2] [,3] [,4]
[1,]    4    8    2    9
[2,]    3    4    5    5
[3,]    4    7    3    5
[4,]    5    8    8    6

The determinant

det(R)

[1] -71

Find the inverse

solve(R)

            [,1]       [,2]       [,3]        [,4]
[1,] -0.87323944  1.5211268  1.6901408 -1.36619718
[2,]  0.33802817 -0.9436620 -0.4929577  0.69014085
[3,]  0.07042254 -0.1549296 -0.3943662  0.35211268
[4,]  0.18309859  0.1971831 -0.2253521 -0.08450704

Find the identity matrix

zapsmall(R%*%solve(R))

     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1

Giving the column and Row Names

my_mat<-matrix(c(25,22,18,26,12,16,24,20,26,28,30,28,12,
                 17,16,17,10,15,16,20,14,19,17,14),ncol=3,nrow = 8)
my_mat

     [,1] [,2] [,3]
[1,]   25   26   10
[2,]   22   28   15
[3,]   18   30   16
[4,]   26   28   20
[5,]   12   12   14
[6,]   16   17   19
[7,]   24   16   17
[8,]   20   17   14

Give name to columns and rows

colnames(my_mat) <- c("sales", "profit","quantity")
rownames(my_mat)<- c("John","Chang","Michael","Bryan","Jose","McTon","Jayden","Jane")
my_mat

        sales profit quantity
John       25     26       10
Chang      22     28       15
Michael    18     30       16
Bryan      26     28       20
Jose       12     12       14
McTon      16     17       19
Jayden     24     16       17
Jane       20     17       14

Convert the matrix into a data frame

my_mat<-as.data.frame(my_mat)
my_mat

Data framing/ Data binding

Flour<-c(230,181,165,150,97,182,191,199,172,170)
Cooking_oil<-c(125,99,97,115,120,100,80,90,95,125)
Kerosene<-c(200,55,105,85,125,150,85,120,110,130)
Rice<-c(109,107,98,71,82,103,111,93,86,78)

Data frame

data_frame<-data.frame(Flour,Cooking_oil,Kerosene,Rice)
data_frame

Further Operation

Covariance

cov(data_frame)

                Flour Cooking_oil   Kerosene       Rice
Flour       1196.4556   -127.3556  469.94444  309.60000
Cooking_oil -127.3556    244.2667  316.22222 -101.75556
Kerosene     469.9444    316.2222 1589.16667   69.77778
Rice         309.6000   -101.7556   69.77778  197.06667

Variance

var(data_frame)

                Flour Cooking_oil   Kerosene       Rice
Flour       1196.4556   -127.3556  469.94444  309.60000
Cooking_oil -127.3556    244.2667  316.22222 -101.75556
Kerosene     469.9444    316.2222 1589.16667   69.77778
Rice         309.6000   -101.7556   69.77778  197.06667

mean

mean(data_frame$Flour)

[1] 173.7

Linear Regression from matrix

income<-c(25,26,32,31,25,26,24,36,21,26,21,23,25)
expenditure<-c(15,16,14,12,20,16,19,18,14,13,19,18,17)
investment<-c(10,12,14,12,12,14,13,16,21,14,15,19,17)

Data Frame

data.frame(income,expenditure,investment)

Estimate the regression model

Regression_model=lm(expenditure~income+investment)
summary(Regression_model)


Call:
lm(formula = expenditure ~ income + investment)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5335 -1.4506 -0.2696  1.9866  3.5648 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 20.284694   6.894461   2.942   0.0147 *
income      -0.150291   0.183800  -0.818   0.4326  
investment  -0.007682   0.259844  -0.030   0.9770  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.67 on 10 degrees of freedom
Multiple R-squared:  0.06582,   Adjusted R-squared:  -0.121 
F-statistic: 0.3523 on 2 and 10 DF,  p-value: 0.7115

View the results using stargazer

stargazer(Regression_model, type="text")


===============================================
                        Dependent variable:    
                    ---------------------------
                            expenditure        
-----------------------------------------------
income                        -0.150           
                              (0.184)          
                                               
investment                    -0.008           
                              (0.260)          
                                               
Constant                     20.285**          
                              (6.894)          
                                               
-----------------------------------------------
Observations                    13             
R2                             0.066           
Adjusted R2                   -0.121           
Residual Std. Error       2.670 (df = 10)      
F Statistic             0.352 (df = 2; 10)     
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

SURVIVAL ANALYSIS (Part one)

In many cancer studies, the main outcome under assessment is the time to an event of interest. The generic name for the time is survival time, although it may be applied to the time ‘survived’ from complete remission to relapse or progression as equally as to the time from diagnosis to death. If the event occurred in all individuals, many methods of analysis would be applicable. However, it is usual that at the end of follow-up some of the individuals have not had the event of interest, and thus their true time to event is unknown. Further, survival data are rarely Normally distributed, but are skewed and comprise typically of many early events and relatively few late ones. It is these features of the data that make the special methods called survival analysis necessary.

This paper is the first of a series of four articles that aim to introduce and explain the basic concepts of survival analysis. Most survival analyses in cancer journals use some or all of Kaplan–Meier (KM) plots, logrank tests, and Cox (proportional hazards) regression. We will discuss the background to, and interpretation of, each of these methods but also other approaches to analysis that deserve to be used more often. In this first article, we will present the basic concepts of survival analysis, including how to produce and interpret survival curves, and how to quantify and test survival differences between two or more groups of patients. Future papers in the series cover multivariate analysis and the last paper introduces some more advanced concepts in a brief question and answer format. More detailed accounts of these methods can be found in books written specifically about survival analysis, for example, Collett (1994), Parmar and Machin (1995) and Kleinbaum (1996). In addition, individual references for the methods are presented throughout the series. Several introductory texts also describe the basis of survival analysis, for example, Altman (2003) and Piantadosi (1997).

In other words, survival analysis is a branch of statistics for analyzing the expected duration of time until one event happens, such as a death in biological organisms and failure in mechanical systems. Today, I have gone through this topic and found it very interesting. How one can predict the survival of cancer patients, churn rate or students’ dropout rate, or machine failure.

Load the dataset

my_data<-ovarian
data(cancer, package="survival")

View the first few observations

head(my_data,10)

View the last few observations

tail(my_data,10)

Variable definition

Note:

fustat= 1= event occurred(death), =0= no event(they are censored). When fitting a model, make sure you know what 0 and 1 code for. This is the survival after diagnosis.

futime: survival or censoring time

fustat: censoring status

age: in years

resid.ds: residual disease present (1=no,2=yes)

rx: treatment group

ecog.ps: ECOG performance status (1 is better, see reference)

Note:

A patient is scored as censored if he or she did not suffer the outcome of interest. In survival analysis, patients who do not have an “event” during a specified period are said to have censored observation.

Read variable name

names(my_data)

[1] "futime"   "fustat"   "age"      "resid.ds" "rx"       "ecog.ps"

Attach the dataset

attach(my_data)

Kaplan Meier Survival Model

Kaplan-Meier estimate is one of the best options to be used to measure the fraction of subjects living for a certain amount of time after treatment. In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects survived or saved after that intervention over a period of time. The time starting from a defined point to the occurrence of a given event, for example death is called as survival time and the analysis of group data as survival analysis. This can be affected by subjects under study that are uncooperative and refused to be remained in the study or when some of the subjects may not experience the event or death before the end of the study, although they would have experienced or died if observation continued, or we lose touch with them midway in the study. We label these situations as censored observations. The Kaplan-Meier estimate is the simplest way of computing the survival over time in spite of all these difficulties associated with subjects or situations. The survival curve can be created assuming various situations. It involves computing of probabilities of occurrence of event at a certain point of time and multiplying these successive probabilities by any earlier computed probabilities to get the final estimate. This can be calculated for two groups of subjects and also their statistical difference in the survivals. This can be used in Ayurveda research when they are comparing two drugs and looking for survival of subjects.

Fit the model

km.model<-survfit(Surv(futime,fustat)~1,
                  type="kaplan-meier")

What is ~1

Since we have no x-variable, we must put a 1 there, its just survival, no variable to relate it to. In the command above, we specified the model, however, if we don’t the model, the system will fit the Kaplan-meier model by default.

Summaries of the model

km.model

Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")

      n events median 0.95LCL 0.95UCL
[1,] 26     12    638     464      NA

This gives the MEDIAN survival and confidence interval (CI) for it. From the results above, we have 26 individuals and 12 events. The results also gives the median survival time. From the results, more than half of the people survived beyond 638 days.

Model summary

summary(km.model)

Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   59     26       1    0.962  0.0377        0.890        1.000
  115     25       1    0.923  0.0523        0.826        1.000
  156     24       1    0.885  0.0627        0.770        1.000
  268     23       1    0.846  0.0708        0.718        0.997
  329     22       1    0.808  0.0773        0.670        0.974
  353     21       1    0.769  0.0826        0.623        0.949
  365     20       1    0.731  0.0870        0.579        0.923
  431     17       1    0.688  0.0919        0.529        0.894
  464     15       1    0.642  0.0965        0.478        0.862
  475     14       1    0.596  0.0999        0.429        0.828
  563     12       1    0.546  0.1032        0.377        0.791
  638     11       1    0.497  0.1051        0.328        0.752

From the summary table above, at time (59 days) 26 individuals were at risk and an event occurred. In other words, and individual died. Additionally, beyond 59 days, the probability of an individual surviving is 96.2%. Further, from the summary table above, we are 95% confident that there are 89% to 100% chances of survival beyond 59 days. At time 115 days, 25 individuals were at risk and one died. The probability of survival beyond 115 days is 92.3% with the standard error of 0.0523. Besides, we are 95% confident that beyond 115 days, there are 82.6% to 100% chances of survival.

Plot the model

plot(km.model,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model")

From the command above, we provide conf.int = F, if we do not want the coefficient level in the plot. Consider the plot below with confidence level.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)

From the graph above, you get the reason why why the median survival days is 638.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
abline(h=0.5,col="red")

Remember we can also include a ‘tick’ where there is a censored observation by using the “mark-time” argument.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1, mark.time = T)

The graph above shows every point where there were censored observation.

Add Another variable (+50 years)

Import the data set

cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")

View the dataset

head(cancer,10)

tail(cancer,10)

Let us now fit a KM model relating over/under50 to survival and see if there is a relationship between survival and age.

km.model2<-survfit(Surv(futime,fustat)~over50,
                  type="kaplan-meier", data = cancer)

Plot the model

plot(km.model2,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="K-M Model: Survival Analysis", las=1)

ggsurvplot(km.model2,data = cancer, pval = TRUE)

Alternatively,

plot(km.model2, conf.int = F,xlab = "Time(days)",ylab = "Survival Probability",main="K-M model: Survival Analysis", col = c("red","blue"),las=1,lwd=2,mark.time = T)

## legend 
legend(100, 0.2, legend = c("below50","above50"),lty = 1,lwd = 2,col=c("red","blue"),bty = "",cex = 0.6)

Display the summary

km.model2

Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer, 
    type = "kaplan-meier")

          n events median 0.95LCL 0.95UCL
over50=0  6      1     NA      NA      NA
over50=1 20     11    563     431      NA

From the summary statistics above, there were six individuals who were under 50 years of age, from which only one event occurred. On the other hand, we have 20 individual who were at risk and over 50 years from which 11 events occurred. This is a clear indication that there is a relationship between age and survival.

Display the model summary

summary(km.model2)

Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer, 
    type = "kaplan-meier")

                over50=0 
        time       n.risk      n.event     survival      std.err lower 95% CI 
     329.000        6.000        1.000        0.833        0.152        0.583 
upper 95% CI 
       1.000 

                over50=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   59     20       1    0.950  0.0487        0.859        1.000
  115     19       1    0.900  0.0671        0.778        1.000
  156     18       1    0.850  0.0798        0.707        1.000
  268     17       1    0.800  0.0894        0.643        0.996
  353     16       1    0.750  0.0968        0.582        0.966
  365     15       1    0.700  0.1025        0.525        0.933
  431     12       1    0.642  0.1093        0.460        0.896
  464     10       1    0.577  0.1157        0.390        0.855
  475      9       1    0.513  0.1193        0.326        0.809
  563      7       1    0.440  0.1227        0.255        0.760
  638      6       1    0.367  0.1222        0.191        0.705

From the summary table above, at time (329 days) 6 individuals below 50 years were at risk and an event occurred. In other words, and individual died. Additionally, beyond 329 days, the probability of an individual below 50 years surviving is 83.3%. Further, from the summary table above, we are 95% confident that there are 58.3% to 100% chances of survival beyond 329 days for the individuals below 50 years. On the other hand, at time (59 days) 20 individuals above 50 years were at risk and an event occurred. In other words, and individual died. Besides, beyond 59 days, the probability of an individual above 50 years surviving is 95%. Further, from the summary table above, we are 95% confident that there are 85.9% to 100% chances of survival beyond 59 days for the individuals above 50 years. It is important to note that the tick marks in the plot above shows censored observations.

Discussion

Are the two survival functions statistically different *
What do you think? Just visually comparing the two *

From the graph, we can argue that survival differs significantly across the two age category. Individuals below 50 years seems to have a higher survival probability as compared to individuals above 50 years.

Mathematical Proof

Ho: Survival in the two age groups is the same

Ha: Survival in the two age groups is not the same

survdiff(Surv(futime,fustat)~over50, data = cancer)

Call:
survdiff(formula = Surv(futime, fustat) ~ over50, data = cancer)

          N Observed Expected (O-E)^2/E (O-E)^2/V
over50=0  6        1      3.6     1.876      2.75
over50=1 20       11      8.4     0.804      2.75

 Chisq= 2.7  on 1 degrees of freedom, p= 0.1

From the inferential results above, there is no evidence to reject the null hypothesis. In other words, there is no statistically sufficient evidence to prove that survival in the two groups is statistically different as shown by the p-value of 0.1 which is more than 0.05.

SURVIVAL ANALYSIS (Part Two)

Cox-Proportional Harzard Model (a)

The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.

In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:

the definition of hazard and survival functions,
the construction of Kaplan-Meier survival curves for different patient groups
the logrank test for comparing two or more survival curves

The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age. An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.

The Cox proportional hazards model is called a semi-parametric model, because there are no assumptions about the shape of the baseline hazard function. There are however, other assumptions as noted above (i.e., independence, changes in predictors produce proportional changes in the hazard regardless of time, and a linear association between the natural logarithm of the relative hazard and the predictors). There are other regression models used in survival analysis that assume specific distributions for the survival times such as the exponential, Weibull, Gompertz and log-normal distributions1,8. The exponential regression survival model, for example, assumes that the hazard function is constant. Other distributions assume that the hazard is increasing over time, decreasing over time, or increasing initially and then decreasing. Example 5 will illustrate estimation of a Cox proportional hazards regression model and discuss the interpretation of the regression coefficients.

Load the data set

data("lung")

Warning in data("lung"): data set 'lung' not found

head(lung)

tail(lung,5)

inst: Institution code
time: Survival time in days
status: censoring status 1=censored, 2=dead
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score (0=good 5=dead)
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months

names(lung)

 [1] "inst"      "time"      "status"    "age"       "sex"       "ph.ecog"  
 [7] "ph.karno"  "pat.karno" "meal.cal"  "wt.loss"

The data set above is an Rstudio inbuilt data set that can be used to run Cox Proportional hazard model, however, in this case, we are going to use the data below.

Load the data set

cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")

View the data set

head(cancer,10)

Change the variable of interest into factor

cancer$over50<-as.factor(cancer$over50)
cancer$ecog.ps<-as.factor(cancer$ecog.ps)

Attach the data set

attach(cancer)

The following object is masked _by_ .GlobalEnv:

    gender

The following objects are masked from my_data:

    age, ecog.ps, fustat, futime, resid.ds, rx

Fit the model

cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)

Display the model summary

summary(cox.model)

Call:
coxph(formula = Surv(futime, fustat) ~ over50 + ecog.ps)

  n= 26, number of events= 12 

           coef exp(coef) se(coef)     z Pr(>|z|)
over501  1.5356    4.6442   1.0541 1.457    0.145
ecog.ps2 0.2628    1.3006   0.5893 0.446    0.656

         exp(coef) exp(-coef) lower .95 upper .95
over501      4.644     0.2153    0.5884    36.655
ecog.ps2     1.301     0.7689    0.4098     4.128

Concordance= 0.603  (se = 0.081 )
Likelihood ratio test= 3.64  on 2 df,   p=0.2
Wald test            = 2.46  on 2 df,   p=0.3
Score (logrank) test = 2.95  on 2 df,   p=0.2

Remember, cox proportional hazard model do not have the coefficient for the intercept.

Interpretation of exponential coefficient.

At any given time, someone is above 50 years is 4.6442 times as likely to to die as someone who is under 50 years adjusting for ECOG performance status. ECOG performance status describes a patient’s level of functioning in terms of their ability to care for themselves, daily activity, and physical ability (walking, working, etc.). Researchers worldwide consider the ECOG Performance Status Scale when planning cancer clinical trials to study new treatments. From the results (4.6442- 1) indicates that someone who is over 50 years is 364.42% times likely to die compared to someone who is under 50 years when adjusted for ECOG performance status. What this means is that if we pick two individualz who have same the ECOG performance status, the one above 50 years is 364.42% times likely to dies than the one below 50 years. Additionally, someone who is below 50 years is 0.2153 times as likey to die as compared to someone who is above 50 years adjusting to ECOG performance status.

Interpretation of the Likelyhood Ratio Test

This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.

Comparing the nested models using likelihood ratio test (LRT); [can we drop ECOG performance status]

cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)
cox.model2<-coxph(Surv(futime,fustat)~over50)

Perform the Likelihood Ratio Test

The test aims to establish whether dropping ECOG performance would better or worsen our model significantly. Consider the command below.

anova(cox.model2,cox.model,test = "LRT")

From the results above(p-value>0.05), we cam drop the ECOG performance status without the loss of predictive power from the model. In other words, ECOG performance status is not very important in this model.

Additional Test

Load the data set

data("lung")

Warning in data("lung"): data set 'lung' not found

head(lung)

write.csv(lung, "C:\\Users\\user\\Downloads\\lung.csv", row.names=FALSE)

Now Import the data set

lung_data<-read.csv("C:\\Users\\user\\Downloads\\lung_data.csv")

Attach the data set

attach(lung_data)

The following objects are masked from cancer:

    age, over50

The following object is masked from my_data:

    age

View the data set

head(lung_data,10)

Cox proportional hazard model (b)

Fit the model with numeric X-variable

cox.model3<-coxph(Surv(time,status)~over50+sex)

Display summary statistics

summary(cox.model3)

Call:
coxph(formula = Surv(time, status) ~ over50 + sex)

  n= 228, number of events= 165 

          coef exp(coef) se(coef)      z Pr(>|z|)   
over50  0.2803    1.3236   0.3003  0.933  0.35057   
sex    -0.5270    0.5904   0.1672 -3.151  0.00162 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       exp(coef) exp(-coef) lower .95 upper .95
over50    1.3236     0.7555    0.7347    2.3844
sex       0.5904     1.6938    0.4254    0.8193

Concordance= 0.585  (se = 0.022 )
Likelihood ratio test= 11.58  on 2 df,   p=0.003
Wald test            = 10.94  on 2 df,   p=0.004
Score (logrank) test = 11.19  on 2 df,   p=0.004

Interpretation of exponential coefficient.

At any given time, someone is above 50 years is 1.3236 times as likely to to die as someone who is under 50 years adjusting for gender. From the results (1.3236- 1) indicates that someone who is over 50 years is 32.36% times likely to die compared to someone who is under 50 years when adjusted for gender. What this means is that if we pick two individuals who have same the gender, the one above 50 years is 32.36% times likely to die than the one below 50 years. Additionally, someone who is below 50 years is 0.7555 times likely to die as compared to someone who is above 50 years adjusting for gender.

Interpretation of the Likelyhood Ratio Test

This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.

Comparing the nested models using likelihood ratio test (LRT); [can we drop gender]

cox.model3<-coxph(Surv(time,status)~over50+sex)
cox.model4<-coxph(Surv(time,status)~over50)

Perform the Likelihood Ratio Test

The test aims to establish whether dropping gender would better or worsen our model significantly. Consider the command below.

anova(cox.model4,cox.model3,test = "LRT")

From the results above(p-value<0.05), we cannot drop the the X-variable sex from the model. Doing so leads to loss of predictive power from the model. In other words, the X-variable sex is very important in this model.

Now, we have worked with adding categorical X-variables, let us now add numeric X-variable in the model.

Run the model

cox.model5<-coxph(Surv(time,status)~age+wt.loss)

Display Summary statistics

summary(cox.model5)

Call:
coxph(formula = Surv(time, status) ~ age + wt.loss)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

            coef exp(coef) se(coef)     z Pr(>|z|)  
age     0.021697  1.021934 0.009627 2.254   0.0242 *
wt.loss 0.001339  1.001340 0.006239 0.215   0.8301  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.022     0.9785    1.0028     1.041
wt.loss     1.001     0.9987    0.9892     1.014

Concordance= 0.563  (se = 0.026 )
Likelihood ratio test= 5.27  on 2 df,   p=0.07
Wald test            = 5.12  on 2 df,   p=0.08
Score (logrank) test = 5.14  on 2 df,   p=0.08

Interpretation exponentiated age variable

At any given instant in time, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger adjusting for weight loss in the past six months. What this means is that if we pick two individuals who have similar weight loss in the past six months, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger.

Assumptions of Cox-Proportional Hazard Model

The Cox proportional hazards model makes two assumptions: * survival curves for different strata must have hazard functions that are proportional over the time t. * the relationship between the log hazard and each covariate is linear, which can be verified with residual plots. Examples of covariates can be categorical such as race or treatment group, or continuous such as biomarker concentrations.

Linearity

Fit the model

cox.model5<-coxph(Surv(time,status)~age+wt.loss)

Display the summary of the model

summary(cox.model5)

Call:
coxph(formula = Surv(time, status) ~ age + wt.loss)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

            coef exp(coef) se(coef)     z Pr(>|z|)  
age     0.021697  1.021934 0.009627 2.254   0.0242 *
wt.loss 0.001339  1.001340 0.006239 0.215   0.8301  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.022     0.9785    1.0028     1.041
wt.loss     1.001     0.9987    0.9892     1.014

Concordance= 0.563  (se = 0.026 )
Likelihood ratio test= 5.27  on 2 df,   p=0.07
Wald test            = 5.12  on 2 df,   p=0.08
Score (logrank) test = 5.14  on 2 df,   p=0.08

If you recall, we assume that the relationship between the covariates or rather X-variable and the log hazard is linear. We can check this assumption in the same way we check linearity in the linear regression model, logistic regression model, or Poisson regression model. In order to do, we can look at the residual plot. On the x-axis, we plot the predicted values and y-axis we plot the residuals.

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)

Add the horizontal line at zero

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)
abline(h=0)

Fit a smoother through the points

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)
abline(h=0)
lines(smooth.spline(predict(cox.model5),residuals(cox.model5,type = "martingale")),col="red")

Check linearity using the deviance residuals

plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))

Add the abline and smoother

plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
abline(h=0)
lines(smooth.spline(predict(cox.model5),
                    residuals(cox.model5,type = "deviance")),col="red")

From the results above, there is linearity between the covariates/X-variables and the log hazard.

Proporrtional Hazard Assumption

This assumption is tested using Schoenfeld test which has the following hypothesis;

Ho: HAZARDS are proportional, HAZARDs are constant over time

Ha: HAZARDS are not proportional, HAZARDS are not proportional over time.

Performing this test will return test for each X-variable and for overall model

cox.zph(cox.model5)

         chisq df    p
age     0.6910  1 0.41
wt.loss 0.0121  1 0.91
GLOBAL  0.7209  2 0.70

The p-values for the covariates and overall model are greater than 0.05. This is an indicator that there is statistically significant evidence to reject the null hypothesis. We therefore conculude that HAZARDS are proportional over time.

Plot of Schoemfeld test

We can see a plot of these as well …(one plot for each parameter). These are plots of “changes in b over time”, if we let “b” vary over time recall…if “b” vary over time, this means that there is NO PH!. The effect is not constant over time… it varies! However pay less attention to the extremes, as line is sensitive…

plot(cox.zph(cox.model5))

The smooth line in the plot tells us how HAZARds will change if we allow them to change. The dotted line on either side of the solid line is the confidence limit. Let us add the abline through zero.

plot(cox.zph(cox.model5)[1], main="Plot of Hazards")
abline(h=0,col="red")

From the graph above, the reference line [abline] lies with the confidence bound at least 1/3 of the time. However, that is not enough to show that coefficient of HAZARDS for age is not changing over time. Partially, the proportional hazards assumption seems to be partially met.

plot(cox.zph(cox.model5)[2], main="Plot of Hazards")
abline(h=0,col="red")

From the graph above, the reference line [abline] lies with the confidence bound at least throught the time. This is a clear indication that is enough evidence to enough to show that coefficient of HAZARDS for weight loss is not changing over time. From the graph, we can conlude that the coefficient for weight loss is not really changing. In other words proportional hazards assumption is fully met.

S.n	mathematics	arts
1	22	53
2	37	68
3	36	42
4	38	49
5	42	51
6	58	65
7	58	51
8	60	71
9	62	55
10	65	74
11	66	68
12	56	64
13	66	67
14	67	73
15	62	65

S.n	mathematics	arts
1	22	53
2	37	68
3	36	42
4	38	49
5	42	51
6	58	65
7	58	51
8	60	71
9	62	55
10	65	74
11	66	68
12	56	64
13	66	67
14	67	73
15	62	65

ELEMENTARY INTRODUCTORY IN DATA ANALYSIS IN R-STUDIO

Lumumba Wandera Victor

2022-09-19

INTRODUCTION

Parametric Tests

One sample t-test

Run the command below to eliminate # symbol in the Rmarkdown file

Activate the required package

Set up the script to remove scientific notation

theme_set(theme_dark())

Load the Excel data set

View the first few observations of the your data set

Declare some variables times series

Plot the series

The One-Sample t-Test

What is the one-sample t-test?

When can I use the test?

What if my data isn’t nearly normally distributed?

Interpretation

Independent T-Test

Sales per Region (West and East Regions)

View the data set

The the independent sample test

Interpretation

Sales per Category (Technology and Furniture)

Interpretation

Sales per Category (Technology and Office Supplies)

Interpretation

ONE-WAY ANALYSIS OF VARIANCE

The company assumes that from 2016 to 2019, the average sales malde from all the fours categories (Technology, Furniture, and Office Supplies)

Post Hoc Test

The company assumes that from 2016 to 2019, the average sales malde from all the fours categories (Technology, Furniture, and Office Supplies)

Post Hoc Test

Two-Way Analysis of Variance

What Is a 2-Way ANOVA?

Convert the string variables of interest into factors.

Run the multiple linear model.

Convert the linear model into its equivalent two-factor anova.

Visualize the results using stargazer

CHI-SQUARE TEST OF GOODNESS OF FIT AND INDEPENDENCE

CHI-SQUARE TEST OF GOODNESS OF FIT

Illustration one

Proportion

Interpretation

Illustration two

Proportion

Confirm the p-value (The area to the right)

Interpretation

Illustration three

Proportion

Probabilities

Calculate the chi-square test

Confirm the p-value

Interpretation

CHI-SQUARE TEST OF INDEPENDENCE

Using Chi-Square Statistic in Research

How does the Chi-Square statistic work?

Provide the dimension names

View the data

State the null and alternative hypothesis

Second illustration

View the entered data

Null and alternative hypothesis

Run the chi square test using the code chunk below

Interpretation

Third illustration

View the data set

Null and alternative hypothesis

Run the chi square test of dependence

Interpretation

EMPIRICAL DATA FROM THE FIELD

Import the data set using the code chunk below

Attach the data set

View the data set

Generate the chi square table (two-way table)

Provide names to the columns and row dimension

View the data entered

Create a one way table

Give column dimension

View the data set

S.n	mathematics	arts
1	22	53
2	37	68
3	36	42
4	38	49
5	42	51
6	58	65
7	58	51
8	60	71
9	62	55
10	65	74
11	66	68
12	56	64
13	66	67
14	67	73
15	62	65