Set up Rstudio

Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.

INTRODUCTION

In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although, it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.

Parametric Tests

Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following

One sample t-test

The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example: 1. A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.

Run the command below to eliminate # symbol in the Rmarkdown file

knitr::opts_chunk$set(comment = NA)

Activate the required package

library(haven)
library(tidyverse)
library(agricolae)
library(stargazer)
library(highcharter)
library(ggplot2)
library(ggthemes)
library(plotrix)
library(tseries)
library(tseries)
library(tidyverse)
library(vars)
library(olsrr)
library(car)
library(grid)
library(rmarkdown)
library(ggtext)
library(RColorBrewer)
library(rhandsontable)
library(forecast)
library(quantmod)
library(xts)
library(PerformanceAnalytics)
library(rugarch)
library(dplyr)
library(magrittr)
library(gapminder)
library(plyr)
library(plotrix)
library(survival)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(survminer)
library(magrittr)
library(ggpubr)
library(stargazer)

Set up the script to remove scientific notation

Use the command below to ensure that the output values are not written in scientific notation.

options(scipen=999)

theme_set(theme_dark())

theme_set(theme_economist())
theme_set(theme_base())

Load the Excel data set

Superstore_data<-read.csv("C:\\Users\\user\\Downloads\\Superstore_data.csv")
attach(Superstore_data)

View the first few observations of the your data set

head(Superstore_data,10)
   Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
1       1 CA-2018-152156  11/8/2018 11/11/2018   Second Class    CG-12520
2       2 CA-2018-152156  11/8/2018 11/11/2018   Second Class    CG-12520
3       3 CA-2018-138688  6/12/2018  6/16/2018   Second Class    DV-13045
4       4 US-2017-108966 10/11/2017 10/18/2017 Standard Class    SO-20335
5       5 US-2017-108966 10/11/2017 10/18/2017 Standard Class    SO-20335
6       6 CA-2016-115812   6/9/2016  6/14/2016 Standard Class    BH-11710
7       7 CA-2016-115812   6/9/2016  6/14/2016 Standard Class    BH-11710
8       8 CA-2016-115812   6/9/2016  6/14/2016 Standard Class    BH-11710
9       9 CA-2016-115812   6/9/2016  6/14/2016 Standard Class    BH-11710
10     10 CA-2016-115812   6/9/2016  6/14/2016 Standard Class    BH-11710
     Customer.Name   Segment Country.Region            City      State
1      Claire Gute  Consumer  United States       Henderson   Kentucky
2      Claire Gute  Consumer  United States       Henderson   Kentucky
3  Darrin Van Huff Corporate  United States     Los Angeles California
4   Sean O'Donnell  Consumer  United States Fort Lauderdale    Florida
5   Sean O'Donnell  Consumer  United States Fort Lauderdale    Florida
6  Brosina Hoffman  Consumer  United States     Los Angeles California
7  Brosina Hoffman  Consumer  United States     Los Angeles California
8  Brosina Hoffman  Consumer  United States     Los Angeles California
9  Brosina Hoffman  Consumer  United States     Los Angeles California
10 Brosina Hoffman  Consumer  United States     Los Angeles California
   Postal.Code Region      Product.ID        Category Sub.Category
1        42420  South FUR-BO-10001798       Furniture    Bookcases
2        42420  South FUR-CH-10000454       Furniture       Chairs
3        90036   West OFF-LA-10000240 Office Supplies       Labels
4        33311  South FUR-TA-10000577       Furniture       Tables
5        33311  South OFF-ST-10000760 Office Supplies      Storage
6        90032   West FUR-FU-10001487       Furniture  Furnishings
7        90032   West OFF-AR-10002833 Office Supplies          Art
8        90032   West TEC-PH-10002275      Technology       Phones
9        90032   West OFF-BI-10003910 Office Supplies      Binders
10       90032   West OFF-AP-10002892 Office Supplies   Appliances
                                                       Product.Name    Sales
1                                 Bush Somerset Collection Bookcase 261.9600
2       Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
3         Self-Adhesive Address Labels for Typewriters by Universal  14.6200
4                     Bretford CR4500 Series Slim Rectangular Table 957.5775
5                                    Eldon Fold 'N Roll Cart System  22.3680
6  Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
7                                                        Newell 322   7.2800
8                                    Mitel 5320 IP Phone VoIP phone 907.1520
9              DXL Angle-View Binders with Locking Rings by Samsill  18.5040
10                                 Belkin F5C206VTEL 6 Outlet Surge 114.9000
   Quantity Discount    Profit
1         2     0.00   41.9136
2         3     0.00  219.5820
3         2     0.00    6.8714
4         5     0.45 -383.0310
5         2     0.20    2.5164
6         7     0.00   14.1694
7         4     0.00    1.9656
8         6     0.20   90.7152
9         3     0.20    5.7825
10        5     0.00   34.4700
tail(Superstore_data,10)
     Row.ID       Order.ID Order.Date  Ship.Date      Ship.Mode Customer.ID
9985   9985 CA-2017-100251  5/17/2017  5/23/2017 Standard Class    DV-13465
9986   9986 CA-2017-100251  5/17/2017  5/23/2017 Standard Class    DV-13465
9987   9987 CA-2018-125794  9/29/2018  10/3/2018 Standard Class    ML-17410
9988   9988 CA-2019-163629 11/17/2019 11/21/2019 Standard Class    RA-19885
9989   9989 CA-2019-163629 11/17/2019 11/21/2019 Standard Class    RA-19885
9990   9990 CA-2016-110422  1/21/2016  1/23/2016   Second Class    TB-21400
9991   9991 CA-2019-121258  2/26/2019   3/3/2019 Standard Class    DB-13060
9992   9992 CA-2019-121258  2/26/2019   3/3/2019 Standard Class    DB-13060
9993   9993 CA-2019-121258  2/26/2019   3/3/2019 Standard Class    DB-13060
9994   9994 CA-2019-119914   5/4/2019   5/9/2019   Second Class    CC-12220
        Customer.Name   Segment Country.Region        City      State
9985 Dianna Vittorini  Consumer  United States  Long Beach   New York
9986 Dianna Vittorini  Consumer  United States  Long Beach   New York
9987     Maris LaWare  Consumer  United States Los Angeles California
9988     Ruben Ausman Corporate  United States      Athens    Georgia
9989     Ruben Ausman Corporate  United States      Athens    Georgia
9990 Tom Boeckenhauer  Consumer  United States       Miami    Florida
9991      Dave Brooks  Consumer  United States  Costa Mesa California
9992      Dave Brooks  Consumer  United States  Costa Mesa California
9993      Dave Brooks  Consumer  United States  Costa Mesa California
9994     Chris Cortes  Consumer  United States Westminster California
     Postal.Code Region      Product.ID        Category Sub.Category
9985       11561   East OFF-LA-10003766 Office Supplies       Labels
9986       11561   East OFF-SU-10000898 Office Supplies     Supplies
9987       90008   West TEC-AC-10003399      Technology  Accessories
9988       30605  South TEC-AC-10001539      Technology  Accessories
9989       30605  South TEC-PH-10004006      Technology       Phones
9990       33180  South FUR-FU-10001889       Furniture  Furnishings
9991       92627   West FUR-FU-10000747       Furniture  Furnishings
9992       92627   West TEC-PH-10003645      Technology       Phones
9993       92627   West OFF-PA-10004041 Office Supplies        Paper
9994       92683   West OFF-AP-10002684 Office Supplies   Appliances
                                                                          Product.Name
9985                                                    Self-Adhesive Removable Labels
9986 Acme Hot Forged Carbon Steel Scissors with Nickel-Plated Handles, 3 7/8" Cut, 8"L
9987                               Memorex Mini Travel Drive 64 GB USB 2.0 Flash Drive
9988             Logitech G430 Surround Sound Gaming Headset with Dolby 7.1 Technology
9989                                                   Panasonic KX - TS880B Telephone
9990                                                            Ultra Door Pull Handle
9991                                Tenex B1-RE Series Chair Mats for Low Pile Carpets
9992                                                             Aastra 57i VoIP phone
9993                                 It's Hot Message Books with Stickers, 2 3/4" x 5"
9994         Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
       Sales Quantity Discount  Profit
9985  31.500       10      0.0 15.1200
9986  55.600        4      0.0 16.1240
9987  36.240        1      0.0 15.2208
9988  79.990        1      0.0 28.7964
9989 206.100        5      0.0 55.6470
9990  25.248        3      0.2  4.1028
9991  91.960        2      0.0 15.6332
9992 258.576        2      0.2 19.3932
9993  29.600        4      0.0 13.3200
9994 243.160        2      0.0 72.9480

Declare some variables times series

Sales<-ts(Superstore_data$Sales,start=1265,frequency = 1)
Profit<-ts(Superstore_data$Profit,start = 1265,frequency = 1)

Plot the series

ts.plot(Sales, main="A time series plot of Sales from Superstore company",xlab="Time in years",ylab="Sales")
A time series plot of Sales from Superstore company

A time series plot of Sales from Superstore company

ts.plot(Profit, main="A time series plot of Profit from Superstore company",xlab="Time in years",ylab="Profit")
A time series plot of Profit from Superstore company

A time series plot of Profit from Superstore company

The One-Sample t-Test

What is the one-sample t-test?

The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.

When can I use the test?

You can use the test for continuous data. Your data should be a random sample from a normal population.

What if my data isn’t nearly normally distributed?

If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.

The company assumes that from 2016 to 2019, they made an average Sales of approximately 200. Test this hypothesis to determine whether the company management were right on their assumption.

t.test(Superstore_data$Sales,mu=200,alternative="two.sided",conf.level=0.95)

    One Sample t-test

data:  Superstore_data$Sales
t = 4.7893, df = 9993, p-value = 0.000001698
alternative hypothesis: true mean is not equal to 200
95 percent confidence interval:
 217.6375 242.0785
sample estimates:
mean of x 
  229.858 

Interpretation

In the above output, the p-value of the test is approximately close to 0.0001, which is significantly less than the significance level alpha = 0.05. We will therefore the null hypothesis and conclude that the true of sales from 2016 to 2019 is not equal to 200 at a 5% level of significance given the p-value of approximately 0.0001

Independent T-Test

The Independent Samples t Test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test.

Sales per Region (West and East Regions)

data22<-Superstore_data %>%
  dplyr::select(Region, Sales)%>%
  filter(Region == "West"|
           Region == "East")

View the data set

head(data22,5)
  Region   Sales
1   West  14.620
2   West  48.860
3   West   7.280
4   West 907.152
5   West  18.504
tail(data22,5)
     Region   Sales
6047   West  36.240
6048   West  91.960
6049   West 258.576
6050   West  29.600
6051   West 243.160

The the independent sample test

t.test(Sales ~ Region,alternative="two.sided",data = data22)

    Welch Two Sample t-test

data:  Sales by Region
t = 0.79611, df = 5603.9, p-value = 0.426
alternative hypothesis: true difference in means between group East and group West is not equal to 0
95 percent confidence interval:
 -17.31977  41.00552
sample estimates:
mean in group East mean in group West 
          238.3361           226.4932 

Interpretation

In the above output, the p-value of the test is 0.426, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from West region and East region is equal to zero with a p-value of 0.426 at a 5% level of significance.

Sales per Category (Technology and Furniture)

data23<-Superstore_data %>%
  dplyr::select(Category, Sales)%>%
  filter(Category == "Technology"|
           Category == "Furniture")

head(data23,5)
    Category    Sales
1  Furniture 261.9600
2  Furniture 731.9400
3  Furniture 957.5775
4  Furniture  48.8600
5 Technology 907.1520
tail(data23,5)
       Category   Sales
3964 Technology  79.990
3965 Technology 206.100
3966  Furniture  25.248
3967  Furniture  91.960
3968 Technology 258.576
t.test(Sales ~ Category,alternative="two.sided",data = data23)

    Welch Two Sample t-test

data:  Sales by Category
t = -3.6721, df = 2497.7, p-value = 0.0002456
alternative hypothesis: true difference in means between group Furniture and group Technology is not equal to 0
95 percent confidence interval:
 -157.8094  -47.9394
sample estimates:
 mean in group Furniture mean in group Technology 
                349.8349                 452.7093 

Interpretation

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and furniture category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.

Sales per Category (Technology and Office Supplies)

data24<-Superstore_data %>%
  dplyr::select(Category, Sales)%>%
  filter(Category == "Technology"|
           Category == "Office Supplies")

head(data24,5)
         Category   Sales
1 Office Supplies  14.620
2 Office Supplies  22.368
3 Office Supplies   7.280
4      Technology 907.152
5 Office Supplies  18.504
t.test(Sales ~ Category,alternative="two.sided",data = data24)

    Welch Two Sample t-test

data:  Sales by Category
t = -12.694, df = 1982.1, p-value < 0.00000000000000022
alternative hypothesis: true difference in means between group Office Supplies and group Technology is not equal to 0
95 percent confidence interval:
 -384.8897 -281.8807
sample estimates:
mean in group Office Supplies      mean in group Technology 
                     119.3241                      452.7093 

Interpretation

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and office supplies category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.

ONE-WAY ANALYSIS OF VARIANCE

One-Way ANOVA (“analysis of variance”) compares the means of two or more independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. One-Way ANOVA is a parametric test.

The company assumes that from 2016 to 2019, the average sales malde from all the fours categories (Technology, Furniture, and Office Supplies)

anova_model<-aov(Sales~Category)
summary(anova_model)
              Df     Sum Sq  Mean Sq F value              Pr(>F)    
Category       2  195881745 97940872   265.5 <0.0000000000000002 ***
Residuals   9991 3685743767   368906                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 265.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the three categories at a 5% level of significance.

Post Hoc Test

Tukey HSD is the commonly used post hoc test to determine which two categories are differing in sales from 2016 to 2019.

tukey_hsd<-TukeyHSD(anova_model)
tukey_hsd
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Sales ~ Category)

$Category
                                diff        lwr       upr     p adj
Office Supplies-Furniture  -230.5108 -266.45583 -194.5657 0.0000000
Technology-Furniture        102.8744   57.56303  148.1858 0.0000003
Technology-Office Supplies  333.3852  295.51937  371.2510 0.0000000

From the results above a significant difference in the average Sales exists across all the categories. This is indicated by the p-value less than 0.05 for all the pairs. This results can be visualized as shown below.

GGTukey.1<-function(tukey_hsd){
  A<-require("tidyverse")
  if(A==TRUE){
    library(tidyverse)
  } else {
    install.packages("tidyverse")
    library(tidyverse)
  }
  B<-as.data.frame(tukey_hsd[1])
  colnames(B)[2:4]<-c("min",
                      "max",
                      "p")
  C<-data.frame(id=row.names(B),
                min=B$min,
                max=B$max,
                idt=ifelse(B$p<0.05,
                           "significant",
                           "not significant")
                )
  D<-C%>%
    ggplot(aes(id,color=idt))+
    geom_errorbar(aes(ymin=min,
                      ymax=max),
                  width = 0.5,
                  size=1.25)+
    labs(x=NULL,
         color=NULL)+
    scale_color_manual(values=c("green",
                                "red")
                       )+
    coord_flip()+
    theme(text=element_text(family="TimesNewRoman"),
            title=element_text(color="black",size=15),
            axis.text = element_text(color="black",size=10),
            axis.title = element_text(color="black",size=10),
            panel.grid=element_line(color="grey75"),
            axis.line=element_blank(),
            plot.background=element_rect(fill="white",color="white"),
            panel.background=element_rect(fill="white"),
            panel.border = element_rect(colour = "black", fill = NA,size=0.59),
            legend.key= element_rect(color="white",fill="white")
          )
  return(D)
}
GGTukey.1(tukey_hsd)
Tukeys honesty significant difference

Tukeys honesty significant difference

plot(tukey_hsd,las =1)

The company assumes that from 2016 to 2019, the average sales malde from all the fours categories (Technology, Furniture, and Office Supplies)

anova_model2<-aov(Sales~Sub.Category)
summary(anova_model2)
               Df     Sum Sq  Mean Sq F value              Pr(>F)    
Sub.Category   16  776253968 48515873   155.9 <0.0000000000000002 ***
Residuals    9977 3105371543   311253                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 155.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the seventeen sub categories at a 5% level of significance.

Post Hoc Test

tukey_hsd2<-TukeyHSD(anova_model2)
tukey_hsd2
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Sales ~ Sub.Category)

$Sub.Category
                                 diff          lwr          upr     p adj
Appliances-Accessories     14.7811064   -98.352371   127.914584 1.0000000
Art-Accessories          -181.9057697  -279.299223   -84.512317 0.0000000
Binders-Accessories       -82.4140438  -167.571615     2.743527 0.0707858
Bookcases-Accessories     287.8850290   142.479391   433.290667 0.0000000
Chairs-Accessories        316.3578159   212.227967   420.487665 0.0000000
Copiers-Accessories      1982.9670138  1738.872771  2227.061256 0.0000000
Envelopes-Accessories    -151.1068795  -290.643768   -11.569991 0.0187882
Fasteners-Accessories    -202.0378297  -350.263756   -53.811903 0.0003125
Furnishings-Accessories  -120.1489362  -213.413396   -26.884476 0.0010412
Labels-Accessories       -181.6715489  -304.305120   -59.037978 0.0000400
Machines-Accessories     1429.5787092  1236.717765  1622.439653 0.0000000
Paper-Accessories        -158.6905119  -245.436934   -71.944090 0.0000000
Phones-Accessories        155.2369304    60.389846   250.084015 0.0000021
Storage-Accessories        48.6159493   -47.347208   144.579107 0.9455486
Supplies-Accessories       29.6755961  -126.561825   185.913017 0.9999997
Tables-Accessories        432.8201673   304.435984   561.204351 0.0000000
Art-Appliances           -196.6868761  -309.258580   -84.115172 0.0000002
Binders-Appliances        -97.1951502  -199.365184     4.974884 0.0839321
Bookcases-Appliances      273.1039226   117.124105   429.083741 0.0000002
Chairs-Appliances         301.5767095   183.128706   420.024713 0.0000000
Copiers-Appliances       1968.1859073  1717.648678  2218.723136 0.0000000
Envelopes-Appliances     -165.8879859  -316.411895   -15.364077 0.0147424
Fasteners-Appliances     -216.8189361  -375.431134   -58.206739 0.0002927
Furnishings-Appliances   -134.9300426  -243.949139   -25.910946 0.0022497
Labels-Appliances        -196.4526554  -331.455976   -61.449334 0.0000641
Machines-Appliances      1414.7976027  1213.844256  1615.750949 0.0000000
Paper-Appliances         -173.4716183  -276.969665   -69.973572 0.0000009
Phones-Appliances         140.4558240    30.079770   250.831878 0.0013283
Storage-Appliances         33.8348429   -77.501726   145.171412 0.9998021
Supplies-Appliances        14.8944897  -151.229065   181.018045 1.0000000
Tables-Appliances         418.0390609   277.791414   558.286708 0.0000000
Binders-Art                99.4917259    15.081912   183.901540 0.0052743
Bookcases-Art             469.7907987   324.821821   614.759777 0.0000000
Chairs-Art                498.2635856   394.744359   601.782813 0.0000000
Copiers-Art              2164.8727835  1921.038405  2408.707162 0.0000000
Envelopes-Art              30.7988902  -108.282913   169.880694 0.9999974
Fasteners-Art             -20.1320600  -167.929659   127.665539 1.0000000
Furnishings-Art            61.7568335   -30.825370   154.339037 0.6491218
Labels-Art                  0.2342208  -121.881288   122.349730 1.0000000
Machines-Art             1611.4844789  1418.952537  1804.016420 0.0000000
Paper-Art                  23.2152578   -62.797221   109.227737 0.9999589
Phones-Art                337.1427001   242.966407   431.318994 0.0000000
Storage-Art               230.5217190   135.221496   325.821942 0.0000000
Supplies-Art              211.5813658    55.750251   367.412481 0.0003411
Tables-Art                614.7259370   486.836518   742.615356 0.0000000
Bookcases-Binders         370.2990728   233.250425   507.347720 0.0000000
Chairs-Binders            398.7718597   306.671058   490.872661 0.0000000
Copiers-Binders          2065.3810576  1826.170834  2304.591281 0.0000000
Envelopes-Binders         -68.6928357  -199.498322    62.112651 0.9265977
Fasteners-Binders        -119.6237859  -259.661129    20.413557 0.2055723
Furnishings-Binders       -37.7348924  -117.345140    41.875355 0.9698971
Labels-Binders            -99.2575051  -211.856461    13.341450 0.1629283
Machines-Binders         1511.9927530  1325.351719  1698.633787 0.0000000
Paper-Binders             -76.2764681  -148.140741    -4.412195 0.0244962
Phones-Binders            237.6509742   156.192387   319.109562 0.0000000
Storage-Binders           131.0299931    48.274572   213.785414 0.0000059
Supplies-Binders          112.0896399   -36.401652   260.580932 0.4181442
Tables-Binders            515.2342111   396.398061   634.070362 0.0000000
Chairs-Bookcases           28.4727869  -121.105104   178.050678 0.9999997
Copiers-Bookcases        1695.0819848  1428.412678  1961.751291 0.0000000
Envelopes-Bookcases      -438.9919085  -615.063096  -262.920721 0.0000000
Fasteners-Bookcases      -489.9228587  -672.956859  -306.888859 0.0000000
Furnishings-Bookcases    -408.0339652  -550.261879  -265.806052 0.0000000
Labels-Bookcases         -469.5565779  -632.558150  -306.555006 0.0000000
Machines-Bookcases       1141.6936801   920.954324  1362.433037 0.0000000
Paper-Bookcases          -446.5755409  -584.617062  -308.534020 0.0000000
Phones-Bookcases         -132.6480986  -275.918784    10.622587 0.1080946
Storage-Bookcases        -239.2690797  -383.281050   -95.257110 0.0000013
Supplies-Bookcases       -258.2094329  -447.789631   -68.629235 0.0003179
Tables-Bookcases          144.9351383   -22.435763   312.306039 0.1863844
Copiers-Chairs           1666.6091979  1420.006796  1913.211600 0.0000000
Envelopes-Chairs         -467.4646954  -611.344120  -323.585271 0.0000000
Fasteners-Chairs         -518.3956456  -670.716593  -366.074698 0.0000000
Furnishings-Chairs       -436.5067521  -536.151146  -336.862358 0.0000000
Labels-Chairs            -498.0293648  -625.582250  -370.476480 0.0000000
Machines-Chairs          1113.2208933   917.195157  1309.246630 0.0000000
Paper-Chairs             -475.0483278  -568.620158  -381.476498 0.0000000
Phones-Chairs            -161.1208855  -262.248108   -59.993663 0.0000049
Storage-Chairs           -267.7418666  -369.916586  -165.567147 0.0000000
Supplies-Chairs          -286.6822198  -446.809910  -126.554529 0.0000001
Tables-Chairs             116.4623514   -16.628761   249.553464 0.1723836
Envelopes-Copiers       -2134.0738932 -2397.589097 -1870.558690 0.0000000
Fasteners-Copiers       -2185.0048435 -2453.222376 -1916.787311 0.0000000
Furnishings-Copiers     -2103.1159499 -2345.330687 -1860.901213 0.0000000
Labels-Copiers          -2164.6385627 -2419.606624 -1909.670502 0.0000000
Machines-Copiers         -553.3883046  -848.625604  -258.151005 0.0000000
Paper-Copiers           -2141.6575257 -2381.437969 -1901.877082 0.0000000
Phones-Copiers          -1827.7300833 -2070.558600 -1584.901566 0.0000000
Storage-Copiers         -1934.3510645 -2177.617681 -1691.084447 0.0000000
Supplies-Copiers        -1953.2914176 -2226.018114 -1680.564721 0.0000000
Tables-Copiers          -1550.1468465 -1807.930124 -1292.363569 0.0000000
Fasteners-Envelopes       -50.9309502  -229.338317   127.476417 0.9999128
Furnishings-Envelopes      30.9579433  -105.264385   167.180272 0.9999962
Labels-Envelopes          -30.5646695  -188.353313   127.223974 0.9999996
Machines-Envelopes       1580.6855886  1363.767155  1797.604023 0.0000000
Paper-Envelopes            -7.5836324  -139.429015   124.261751 1.0000000
Phones-Envelopes          306.3438099   169.033094   443.654526 0.0000000
Storage-Envelopes         199.7228288    61.638829   337.806829 0.0000749
Supplies-Envelopes        180.7824756    -4.334771   365.899722 0.0644947
Tables-Envelopes          583.9270468   421.628674   746.225420 0.0000000
Furnishings-Fasteners      81.8888935   -63.221082   226.998869 0.8728069
Labels-Fasteners           20.3662808  -145.156039   185.888600 1.0000000
Machines-Fasteners       1631.6165388  1409.009286  1854.223792 0.0000000
Paper-Fasteners            43.3473178   -97.661856   184.356492 0.9997704
Phones-Fasteners          357.2747601   211.142577   503.406943 0.0000000
Storage-Fasteners         250.6537790   103.794754   397.512804 0.0000005
Supplies-Fasteners        231.7134258    39.961562   423.465289 0.0034553
Tables-Fasteners          634.8579970   465.031191   804.684803 0.0000000
Labels-Furnishings        -61.5226128  -180.371140    57.325914 0.9347411
Machines-Furnishings     1549.7276453  1359.251076  1740.204214 0.0000000
Paper-Furnishings         -38.5415757  -119.849148    42.765997 0.9698815
Phones-Furnishings        275.3858666   185.486205   365.285528 0.0000000
Storage-Furnishings       168.7648855    77.688504   259.841267 0.0000000
Supplies-Furnishings      149.8245323    -3.459881   303.108945 0.0639163
Tables-Furnishings        552.9691034   428.195395   677.742812 0.0000000
Machines-Labels          1611.2502581  1404.799158  1817.701358 0.0000000
Paper-Labels               22.9810370   -90.824299   136.786373 0.9999993
Phones-Labels             336.9084794   216.814007   457.002952 0.0000000
Storage-Labels            230.2874982   109.309647   351.265350 0.0000000
Supplies-Labels           211.3471451    38.613696   384.080594 0.0027717
Tables-Labels             614.4917162   466.474111   762.509322 0.0000000
Paper-Machines          -1588.2692211 -1775.640525 -1400.897917 0.0000000
Phones-Machines         -1274.3417787 -1465.598238 -1083.085319 0.0000000
Storage-Machines        -1380.9627599 -1572.775146 -1189.150374 0.0000000
Supplies-Machines       -1399.9031130 -1627.923278 -1171.882949 0.0000000
Tables-Machines          -996.7585419 -1206.676532  -786.840552 0.0000000
Phones-Paper              313.9274423   230.809266   397.045619 0.0000000
Storage-Paper             207.3064612   122.916951   291.695972 0.0000000
Supplies-Paper            188.3661080    38.957964   337.774252 0.0015901
Tables-Paper              591.5106792   471.530846   711.490513 0.0000000
Storage-Phones           -106.6209811  -199.317352   -13.924610 0.0078092
Supplies-Phones          -125.5613343  -279.813794    28.691125 0.2813586
Tables-Phones             277.5832369   151.622179   403.544295 0.0000000
Supplies-Storage          -18.9403532  -173.881566   136.000860 1.0000000
Tables-Storage            384.2042180   257.400644   511.007792 0.0000000
Tables-Supplies           403.1445712   226.282053   580.007090 0.0000000
GGTukey.2<-function(tukey_hsd2){
  A<-require("tidyverse")
  if(A==TRUE){
    library(tidyverse)
  } else {
    install.packages("tidyverse")
    library(tidyverse)
  }
  B<-as.data.frame(tukey_hsd2[1])
  colnames(B)[2:4]<-c("min",
                      "max",
                      "p")
  C<-data.frame(id=row.names(B),
                min=B$min,
                max=B$max,
                idt=ifelse(B$p<0.05,
                           "significant",
                           "not significant")
                )
  D<-C%>%
    ggplot(aes(id,color=idt))+
    geom_errorbar(aes(ymin=min,
                      ymax=max),
                  width = 0.5,
                  size=1.25)+
    labs(x=NULL,
         color=NULL)+
    scale_color_manual(values=c("red",
                                "green")
                       )+
    coord_flip()+
    theme(text=element_text(family="TimesNewRoman"),
            title=element_text(color="black",size=15),
            axis.text = element_text(color="black",size=10),
            axis.title = element_text(color="black",size=10),
            panel.grid=element_line(color="grey75"),
            axis.line=element_blank(),
            plot.background=element_rect(fill="white",color="white"),
            panel.background=element_rect(fill="white"),
            panel.border = element_rect(colour = "black", fill = NA,size=0.59),
            legend.key= element_rect(color="white",fill="white")
          )
  return(D)
}

Make an interpretation of the results above.

Plot

GGTukey.2(tukey_hsd2)

Two-Way Analysis of Variance

What Is a 2-Way ANOVA?

ANOVA stands for analysis of variance and tests for differences in the effects of independent variables on a dependent variable. A two-way ANOVA test is a statistical test used to determine the effect of two nominal predictor variables on a continuous outcome variable.

Convert the string variables of interest into factors.

Superstore_data$Category<-as.factor(Superstore_data$Category)
Superstore_data$Sub.Category<- as.factor((Superstore_data$Sub.Category))

Run the multiple linear model.

linear_model<- lm(Sales~Category+Sub.Category, data = Superstore_data)

Convert the linear model into its equivalent two-factor anova.

Two_factor_anova<-anova(linear_model)
Two_factor_anova
Analysis of Variance Table

Response: Sales
               Df     Sum Sq  Mean Sq F value                Pr(>F)    
Category        2  195881745 97940872  314.67 < 0.00000000000000022 ***
Sub.Category   14  580372224 41455159  133.19 < 0.00000000000000022 ***
Residuals    9977 3105371543   311253                                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics for category is given as 314.67 with its associated p-value of approximately 0.0001, which is significantly less than 0.01. This is an indication that there is a significant difference in the mean Sales across various categories at a 1% level of significance. On the other, F statistics for sub category is 133.19 with its associated p-value of approximately 0.0001, which is less than 0.01. This is a clear indication that there exists a significant difference in the average Sales across sub categories at a 1% level of significant.

Visualize the results using stargazer

stargazer(Two_factor_anova, type="text")

=================================================================================
Statistic N       Mean            St. Dev.            Min              Max       
---------------------------------------------------------------------------------
Df        3     3,331.000         5,755.608            2              9,977      
Sum Sq    3 1,293,875,171.000 1,580,537,112.000 195,881,745.000 3,105,371,543.000
Mean Sq   3  46,569,095.000    49,015,303.000     311,253.000    97,940,872.000  
F value   2      223.927           128.325          133.188          314.666     
Pr(> F)   2       0.000             0.000              0                0        
---------------------------------------------------------------------------------

CHI-SQUARE TEST OF GOODNESS OF FIT AND INDEPENDENCE

The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. It is often used to evaluate whether sample data is representative of the full population.

CHI-SQUARE TEST OF GOODNESS OF FIT

Illustration one

Let us enter some data using the command below: The data entered is gotten from linkedin regarding the number of jobs opened from various professions. There are fours different categories of professions.

jobs<-c(11091,11282,15378,12696)

Input the following categories that have the following names

names(jobs)<-c("Project Management","Supply chain","Services","Quality control")

Run the following command to view the available jobs from various categories of professions

jobs
Project Management       Supply chain           Services    Quality control 
             11091              11282              15378              12696 

Proportion

It is important to calculate the proportion of job vacancies in various categories.

jobs/sum(jobs)
Project Management       Supply chain           Services    Quality control 
         0.2198545          0.2236407          0.3048348          0.2516701 

These are different proportions that we get. Remember these proportions may change on a daily basis. In other words, new jobs will keep opening as others closes, therefore, this proportion is prone to variation. We would therefore like to find out if these proportion in job vacancies are real or out of chances. In other words, we would to established if these proportions are statistically significant or not, that is by chances. From the results above, let us assume that the proportion are all the same in the fours categories. Assuming that the proportion is approximately 25% in all the four categories. We shall create the object probability as shown below.

probability<-c(0.25,0.25,0.25,0.25)

For the chi square test of goodness, we shall have our null hypothesis as follows; * Null hypothesis: Proportions in job categories is not statistically different from 0.25 * Alternative hypothesis: Proportions in job categories is statistically different from 0.25

The alternative hypothesis indicates that fluctuations in job categories will be below or above 0.25. Run the following code chunk to run the chi square test

chisq.test(jobs,p=probability)

    Chi-squared test for given probabilities

data:  jobs
X-squared = 930.89, df = 3, p-value < 0.00000000000000022

Interpretation

From the result, the square test value if 930.89 and the associated probability of approximately 0.001. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that the proportion of job is statistically different from 0.25, that is, it is below or above 0.25 at a 5% level of significance. The degree of freedom is 3 because we gave four categories. (n-1)

Illustration two

Students in a social studies class hypothesize that the literacy rates across the world for every region are 82%. The table below shows the actual literacy rates across the world broken down by region. What are the test statistic and the degrees of freedom?

illit_lavel<-c(99,99.5,67.3,62.5,91,93.8,61.9,91.9,84.5,66.4)

Input the following categories that have the following names

names(illit_lavel)<-c("Developed Regions","Commonwealth of Independent States","Northern Africa", "Sub-saharan Africa","Latin American and the caribbean","Eastern Asia","Southern Asia","South-Eastern Asia","Western Asia","Oceania")

Run the following command to view the illiteracy level from various regions

illit_lavel
                 Developed Regions Commonwealth of Independent States 
                              99.0                               99.5 
                   Northern Africa                 Sub-saharan Africa 
                              67.3                               62.5 
  Latin American and the caribbean                       Eastern Asia 
                              91.0                               93.8 
                     Southern Asia                 South-Eastern Asia 
                              61.9                               91.9 
                      Western Asia                            Oceania 
                              84.5                               66.4 

Proportion

It is important to calculate the proportion of illiteracy level in various regions.

illit_lavel/sum(illit_lavel)
                 Developed Regions Commonwealth of Independent States 
                        0.12105649                         0.12166789 
                   Northern Africa                 Sub-saharan Africa 
                        0.08229396                         0.07642455 
  Latin American and the caribbean                       Eastern Asia 
                        0.11127415                         0.11469797 
                     Southern Asia                 South-Eastern Asia 
                        0.07569088                         0.11237466 
                      Western Asia                            Oceania 
                        0.10332600                         0.08119345 
chisq.test(illit_lavel)

    Chi-squared test for given probabilities

data:  illit_lavel
X-squared = 26.449, df = 9, p-value = 0.001725

Confirm the p-value (The area to the right)

1-pchisq(26.449,9)
[1] 0.001724323

Interpretation

From the result, the square test value if 26.449 and the associated probability of approximately 0.002. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that the illiteracy level across the world is statistically different from 82%, that is, it is below or above 82% at a 5% level of significance. The degree of freedom is 9 because we gave ten regions, (n-1)

Illustration three

Employers want to know which days of the week employees are absent in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of 60 managers were asked on which day of the week they had the highest number of employee absences. The results were distributed as in the table below. For the population of employees, do the days for the highest number of absences occur with equal frequencies during a five-day work week? Test at a 5% significance level.

absenteeism<-c(15,12,9,9,15)

Input the following categories that have the following names

names(absenteeism)<-c("Monday","Tuesday","Wednesday","Thursday","Friday")

Run the following command to view the illiteracy level from various regions

absenteeism
   Monday   Tuesday Wednesday  Thursday    Friday 
       15        12         9         9        15 

Proportion

It is important to calculate the proportion of illiteracy level in various regions.

absenteeism/sum(absenteeism)
   Monday   Tuesday Wednesday  Thursday    Friday 
     0.25      0.20      0.15      0.15      0.25 

Probabilities

prob<-c(0.20,0.20,0.20,0.20,0.20)

Calculate the chi-square test

chisq.test(absenteeism,p=prob)

    Chi-squared test for given probabilities

data:  absenteeism
X-squared = 3, df = 4, p-value = 0.5578

Confirm the p-value

1- pchisq(3,4)
[1] 0.5578254

Interpretation

From the result, the square test value if 3 and the associated probability of approximately 0.5578. This probability is far above 0.05. This is an indication that there is sufficient evidence to support the null hypothesis and therefore conclude that at a 5% level of significance, from the sample data, there is not sufficient evidence to conclude that the absent days do not occur with equal frequencies. In other words, absent days occurs with equal variation. ## For further practice consider the example below.

knitr::include_graphics("chi.png")
Distribution of voters

Distribution of voters

CHI-SQUARE TEST OF INDEPENDENCE

Using Chi-Square Statistic in Research

The Chi Square statistic is commonly used for testing relationships between categorical variables.
The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent. An example research question that could be answered using a Chi-Square analysis would be: * Is there a significant relationship between voter intent and political party membership?

How does the Chi-Square statistic work?

The Chi-Square statistic is most commonly used to evaluate Tests of Independence when using a cross tabulation (also known as a bivariate table). Cross tabulation presents the distributions of two categorical variables simultaneously, with the intersections of the categories of the variables appearing in the cells of the table. The Test of Independence assesses whether an association exists between the two variables by comparing the observed pattern of responses in the cells to the pattern that would be expected if the variables were truly independent of each other. Calculating the Chi-Square statistic and comparing it against a critical value from the Chi-Square distribution allows the researcher to assess whether the observed cell counts are significantly different from the expected cell counts.

In other words the Chi-square test of independence checks whether two variables are likely to be related or not. We have counts for two categorical or nominal variables. We also have an idea that the two variables are not related. The test gives us a way to decide if our idea is plausible or not. For instance, assume that we have two variables, gender and voting. Because we have two variables, we shall use rbind function to store our variable in the object voters

voters<-rbind(c(1486,2131),c(2792,3591))
voters
     [,1] [,2]
[1,] 1486 2131
[2,] 2792 3591

Provide the dimension names

dimnames(voters)<-list(gender=c("male","female"),
                       voting=c("voted","didn't vote")
                       )

View the data

Run the code chunk to look at the data we have just entered

voters
        voting
gender   voted didn't vote
  male    1486        2131
  female  2792        3591

State the null and alternative hypothesis

Run the code chunk below to run the chi square test of independence

chisq.test(voters)

    Pearson's Chi-squared test with Yates' continuity correction

data:  voters
X-squared = 6.5523, df = 1, p-value = 0.01047

From the result, the square test value if 6.5523 and the associated probability of approximately 0.01. This probability is below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of voting on gender at a 5% level of significance. We can reject the null hypothesis with 95% confidence.

Second illustration

Consider the information below

knitr::include_graphics("chi2.png")
Automobile owners

Automobile owners

Let us enter the information in the Rstudio

Buyer<-rbind(c(70,130,150),c(30,20,100))

Give names to the dimension columns and rows using the following code chunk

dimnames(Buyer)<-list(Gender=c("Male","Female"),
                      Reason=c("Styling","Engineering","Fuel economy"))

View the entered data

Buyer
        Reason
Gender   Styling Engineering Fuel economy
  Male        70         130          150
  Female      30          20          100

Null and alternative hypothesis

Run the chi square test using the code chunk below

chisq.test(Buyer)

    Pearson's Chi-squared test

data:  Buyer
X-squared = 31.746, df = 2, p-value = 0.0000001278

Interpretation

From the result, the square test value if 6.5523 and the associated probability of approximately 0.001. This probability is far below 0.1. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of reason for purchase on gender at a 10% level of significance. We can reject the null hypothesis with 90% confidence. The output above has the degree of freedom of 2 calculated as follows;

\[ (Number of columns-1)(Number of rows-1) = (3-1)(2-1)= (2*1) = 2 \]

Third illustration

A random sample of 400 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from this survey is summarized in the following contingency table.

Education<-rbind(c(60,56,46,42),c(40,46,53,57))

Give names to the dimension columns and rows using the following code chunk

dimnames(Education)<-list(Gender=c("Male","Female"),
                      Educ_level=c("High school","Bachelors","Masterd","PhD"))

View the data set

Education
        Educ_level
Gender   High school Bachelors Masterd PhD
  Male            60        56      46  42
  Female          40        46      53  57

Null and alternative hypothesis

Run the chi square test of dependence

chisq.test(Education)

    Pearson's Chi-squared test

data:  Education
X-squared = 7.5911, df = 3, p-value = 0.05526

Interpretation

From the result, the square test value if 7.5911 and the associated probability of approximately 0.05526. This probability is slightly above 0.05. This is an indication that there is sufficient evidence to support the null hypothesis and therefore conclude that there is no statistically significance dependence of education level on gender at a 5% level of significance. We therefore retain the null hypothesis with 95% confidence.

EMPIRICAL DATA FROM THE FIELD

Import the data set using the code chunk below

Ch_sqt <- read_sav("C:/Users/user/Downloads/Effects of employees voice mechanism on academic staff turnover intentions1.sav")

Attach the data set

attach(Ch_sqt)

View the data set

head(Ch_sqt,10)
# A tibble: 10 × 141
      ID Q1      Q2      Q3      Q4      Q5      Q6      Q7      Q8      ESa    
   <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l>
 1     1 2 [fem… 5 [51-… 2 [Ph.… 2 [con… 2 [Ass… 2 [Ken… 3 [11-… 4 [6-1… 1 [str…
 2     2 1 [mal… 3 [31-… 2 [Ph.… 1 [Per… 3 [Sen… 1 [Uni… 4 [21-… 3 [4-5… 3 [Nei…
 3     3 1 [mal… 5 [51-… 3 [Mas… 1 [Per… 3 [Sen… 2 [Ken… 2 [2-1… 2 [2-3… 4 [Agr…
 4     4 1 [mal… 4 [41-… 2 [Ph.… 1 [Per… 3 [Sen… 2 [Ken… 2 [2-1… 5 [11-… 3 [Nei…
 5     5 1 [mal… 4 [41-… 2 [Ph.… 2 [con… 4 [Lec… 1 [Uni… 6 [41-… 4 [6-1… 3 [Nei…
 6     6 1 [mal… 5 [51-… 1 [Pos… 2 [con… 4 [Lec… 1 [Uni… 4 [21-… 2 [2-3… 3 [Nei…
 7     7 1 [mal… 4 [41-… 2 [Ph.… 1 [Per… 4 [Lec… 1 [Uni… 2 [2-1… 2 [2-3… 1 [str…
 8     8 1 [mal… 2 [21-… 3 [Mas… 2 [con… 4 [Lec… 4 [Not… 2 [2-1… 6 [21 … 2 [Dis…
 9     9 2 [fem… 4 [41-… 2 [Ph.… 1 [Per… 4 [Lec… 1 [Uni… 2 [2-1… 2 [2-3… 5 [Str…
10    10 1 [mal… 3 [31-… 3 [Mas… 1 [Per… 4 [Lec… 2 [Ken… 2 [2-1… 5 [11-… 2 [Dis…
# … with 131 more variables: ESb <dbl+lbl>, ESc <dbl+lbl>, ESd <dbl+lbl>,
#   ESe <dbl+lbl>, ESf <dbl+lbl>, ESg <dbl+lbl>, ESh <dbl+lbl>, ESi <dbl+lbl>,
#   ERa <dbl+lbl>, ERb <dbl+lbl>, ERc <dbl+lbl>, ERd <dbl+lbl>, ERe <dbl+lbl>,
#   ERf <dbl+lbl>, ERg <dbl+lbl>, QCa <dbl+lbl>, QCb <dbl+lbl>, QCc <dbl+lbl>,
#   QCd <dbl+lbl>, QCe <dbl+lbl>, QCf <dbl+lbl>, QCg <dbl+lbl>, QCh <dbl+lbl>,
#   QCi <dbl+lbl>, QCj <dbl+lbl>, QCk <dbl+lbl>, SSa <dbl+lbl>, SSb <dbl+lbl>,
#   SSc <dbl+lbl>, SSd <dbl+lbl>, SSe <dbl+lbl>, SSf <dbl+lbl>, …

Generate the chi square table (two-way table)

TBL<-table(Ch_sqt$Q4, Ch_sqt$Q6) 
TBL
   
      1   2   3   4
  1 155  16   5  52
  2  53   3   2  67

The two-way table above has no column and row names and therefore it is very important to pride columns names and row names for easier row dimension and identification.

Provide names to the columns and row dimension

dimnames(TBL)<-list(Employment_terms=c("Permanent","Contractual"),
                      Trade_union=c("UASU","KUSU","Any other","Not a member of any"))

View the data entered

TBL
                Trade_union
Employment_terms UASU KUSU Any other Not a member of any
     Permanent    155   16         5                  52
     Contractual   53    3         2                  67

Run the following code chunk to run the chi square test

chisq.test(TBL)

    Pearson's Chi-squared test

data:  TBL
X-squared = 35.018, df = 3, p-value = 0.0000001208

From the result, the square test value of 35.018 and the associated probability of approximately 0.001. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of trade union membership on terms of employment at a 5% level of significance. We can reject the null hypothesis with 95% confidence

Create a one way table

TAB<-table(Ch_sqt$Q6)
TAB

  1   2   3   4 
208  19   7 119 

Give column dimension

dimnames(TAB)<-list(Trade_Union=c("UASU","KUSU","Member of any other","Not a member of any"))

View the data set

TAB
Trade_Union
               UASU                KUSU Member of any other Not a member of any 
                208                  19                   7                 119 

Linear Regression

Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.

In order to run a linear regression, it is always important to ensure that there is a well established linear association between the two variable, for instance, one need to create a scatter plot of the two variable to determine the linear association. Consider the scatter plot below for sales and profit.

Scatter Plot

plot(Sales, Profit, main = "A scatter plot of Sales and Profit")
A scstter plot of profit and sales

A scstter plot of profit and sales

Transform one or both variable and create the scatter plot.

plot(log(Sales),Profit)

plot(Sales, log(Profit))

Log tranform both variables

plot(log(Sales),log(Profit))
Scatter plot of sales and profit

Scatter plot of sales and profit

Run the correlation to determine the strength of association between the two variables of interest

cor(Sales,Profit)
[1] 0.4790643

Run the Regression model

reg_model<-lm(Profit~Sales, data = Superstore_data)
summary(reg_model)

Call:
lm(formula = Profit ~ Sales, data = Superstore_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7397.5     2.6    14.6    21.7  5261.6 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -12.732867   2.192459  -5.808        0.00000000653 ***
Sales         0.180067   0.003301  54.555 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 205.6 on 9992 degrees of freedom
Multiple R-squared:  0.2295,    Adjusted R-squared:  0.2294 
F-statistic:  2976 on 1 and 9992 DF,  p-value: < 0.00000000000000022

Visualize the results

stargazer(reg_model, type="text")

===============================================
                        Dependent variable:    
                    ---------------------------
                              Profit           
-----------------------------------------------
Sales                        0.180***          
                              (0.003)          
                                               
Constant                    -12.733***         
                              (2.192)          
                                               
-----------------------------------------------
Observations                   9,994           
R2                             0.230           
Adjusted R2                    0.229           
Residual Std. Error     205.639 (df = 9992)    
F Statistic         2,976.247*** (df = 1; 9992)
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Generate the anova model for the regression model

anova(reg_model)
Analysis of Variance Table

Response: Profit
            Df    Sum Sq   Mean Sq F value                Pr(>F)    
Sales        1 125857839 125857839  2976.2 < 0.00000000000000022 ***
Residuals 9992 422535997     42287                                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gt four plots, including normal probability plot, of residuals:

plot(reg_model)
Residuals vs Leverage

Residuals vs Leverage

Residuals vs Leverage

Residuals vs Leverage

Residuals vs Leverage

Residuals vs Leverage

Residuals vs Leverage

Residuals vs Leverage

Confidence Intervals for all parameters:

confint(reg_model)
                  2.5 %     97.5 %
(Intercept) -17.0305284 -8.4352058
Sales         0.1735967  0.1865366

Generate the resgression residuals

Resid<-residuals(reg_model)

Summarize the residuals

summary(Resid)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-7397.542     2.581    14.634     0.000    21.658  5261.551 

From the results above, the residuals have a mean of 0.00. This indicates the zero conditional mean is met. Next it is important to test the covariance between the residuals and the independent variable. If the covariance is given as zero, that is an indication that the change in dependent is causal.

cov(Resid,Sales)
[1] -0.00000000005062127

The value (covariance) is approximately close to zero, an indicator that changes in Profit is caused by changes in sales. In other words, changes in profits is causal.

Declare the residuals times series

Resid<-ts(Resid, start = 1265, frequency = 1)

Plot the residuals

ts.plot(Resid, main="A time series plot of regression residuals",xlab="Time in years",ylab="Residuals")
abline(0,-0.00000000005062127)
Regression residuals

Regression residuals

Data Visualization

Histogram

hist(log(Sales),main="Histogram showing the distribution of sales from Superstore",xlab="log of sales", ylab="Frequencies",breaks = 15)

summary(log(Sales))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.8119  2.8495  3.9980  4.1098  5.3468 10.0274 
ggplot(data=Superstore_data, aes(log(Sales))) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(-2,12))+
  stat_function(fun = dnorm,
                args = list(mean = mean(log(Sales)),
                            sd = sd(log(Sales))),
                col = "#1b98e0",
                size = 1)+
 labs(title = "Histogram showing the distribution of 
      Sales for Superstore Company",
      subtitle = "Sales from 2016 to 2019",
       caption="Data Collected and compiled by the Superstore Company")
Histogram showing the distribution of sales

Histogram showing the distribution of sales

Put title and Sub title at the centre of the graph

ggplot(data=Superstore_data, aes(log(Sales))) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(-2,12))+
  stat_function(fun = dnorm,
                args = list(mean = mean(log(Sales)),
                            sd = sd(log(Sales))),
                col = "#1b98e0",
                size = 1)+
 labs(title = "Histogram showing the distribution of Sales for 
      Superstore Company",
      subtitle = "Sales from 2016 to 2019",
       caption="Data Collected and compiled by the Superstore Company")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.subtitle = element_text(hjust = 0.5))
Histogram showing the distribution of Sales

Histogram showing the distribution of Sales

Pie Chart (Average Sale per Region)

Summarize the average sales from various regions

group_mean_sales <- aggregate(Sales ~ Region, data = Superstore_data, mean)
group_mean_sales
   Region    Sales
1 Central 215.7727
2    East 238.3361
3   South 241.8036
4    West 226.4932
# Create data for the graph.
average_sales <-  c(215.7727, 236.1127, 241.8036,226.4932)
Regions <-  c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Regions,explode = 0.06, main = "Pie Chart of 
      Sales from Various of the United States ")

Create data for the graph.

average_sales <-  c(215.7727, 236.1127, 241.8036,226.4932)
Regions <-  c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)

Plot the chart.

pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from 
    Various regions of the United States",col = rainbow(length(average_sales)))
legend("topright", c("Central","East","South","West"), cex = 0.8,
   fill = rainbow(length(average_sales)))
Pie chart of sales for various regions

Pie chart of sales for various regions

Pie Chart (Average Sales per category)

Summarize the average sales from various categories

group_mean_sales1 <- aggregate(Sales ~ Category, data = Superstore_data, mean)
group_mean_sales1
         Category    Sales
1       Furniture 349.8349
2 Office Supplies 119.3241
3      Technology 452.7093
pie_data <- data.frame(
  category = c("Furniture", "Office Supplies", "Technology"),
  value = c(349.8349, 119.3241, 452.7093)
)

# Create ggplot object
pie_chart <- ggplot(pie_data, aes(x = "", y = value, fill = category))

# Create pie chart
pie_chart + geom_bar(stat = "identity") + coord_polar(theta = "y")+
  labs(title = "Pie Chart Average Sales Across Categories")

# Customize pie chart
pie_chart + 
  geom_bar(stat = "identity") + 
  coord_polar(theta = "y") +
  labs(title = "Pie Chart", fill = "Category") +
  scale_fill_manual(values = c("blue", "green", "orange")) +
  theme_void()+
  labs(title = "Pie Chart Average Sales Across Categories")

Alternatively

Create the matrix

# Create data for the graph.
average_sales <-  c(349.8349, 119.3241,452.7093)
Sales_category <-  c("Furniture","Office Supplies","Technology")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Sales_category,explode = 0.06, main = "Pie Chart of 
      Sales from Various Categories")

Create sales Pie chart with percentages

pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from 
    Various Categories",col = rainbow(length(average_sales)))
legend("topright", c("Furniture","Office Supplies","Technology"), cex = 0.8,
   fill = rainbow(length(average_sales)))
Pie chart of sales for various categories

Pie chart of sales for various categories

Bar graphs

Create the data for the chart (average sales per category)

average_sales1 <-  c(347.7488, 119.0760,452.5782)
Sales_category1 <-  c("Furniture","Office Supplies","Technology")

Plot the bar chart

barplot(average_sales1,names.arg=Sales_category1,xlab="Category",ylab="Average Sales",
        col="blue",
main="Bar graph for the average sales for various category",border="red")
Bar graph for various categories

Bar graph for various categories

Bar graph2

group_mean_sales2 <- aggregate(Sales ~ Sub.Category, data = Superstore_data, mean)
group_mean_sales2
   Sub.Category      Sales
1   Accessories  215.97460
2    Appliances  230.75571
3           Art   34.06883
4       Binders  133.56056
5     Bookcases  503.85963
6        Chairs  532.33242
7       Copiers 2198.94162
8     Envelopes   64.86772
9     Fasteners   13.93677
10  Furnishings   95.82567
11       Labels   34.30305
12     Machines 1645.55331
13        Paper   57.28409
14       Phones  371.21153
15      Storage  264.59055
16     Supplies  245.65020
17       Tables  648.79477

Create the data for the chart (average sales per sub category)

average_sales2 <-  c(216.13882,230.08435,34.10157,133.56056,486.67443,532.03556,2198.94162,65.11606,13.93677,95.82567,34.30305, 1645.55331,57.30044,370.17151,263.05245,245.65020,648.79477 )
sub_catgory_sales <-  c("Accessories","Appliances","Art","Binders","Bookcases","Chairs","Copiers","Envelopes","Fasteners","Furnishings","Labels","Machines","Paper","Phones", "Storage","Supplies","Tables")

Plot the bar graph 1

barplot(average_sales2,names.arg=sub_catgory_sales,xlab="sub_category",ylab="Average Sales",col="grey",main="Bar graph for the average sales for various sub categories",border="blue")
Bar graph for various sub categories

Bar graph for various sub categories

Bar graph 2

ggplot(Superstore_data,aes(x=Sub.Category, y=Sales))+
  geom_bar(stat = "identity")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and 
       Sub categories")+
  theme(plot.title = element_text(hjust = 0.5))

Change the angle of the x-axis

ggplot(Superstore_data,aes(x=Sub.Category, y=mean(Sales)))+
  geom_bar(stat = "identity")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
  theme(axis.text.x = element_markdown(angle = 90))
Grouped Barplot of Sales for various Categories and Sub categories

Grouped Barplot of Sales for various Categories and Sub categories

Grouped Bar graphs

Superstore_data %>%
  filter(Category == "Furniture"|
           Category == "Office Supplies"|
         Category == "Technology")%>%
  ggplot(aes(Category, fill= Sub.Category))+
  geom_bar(position = "dodge",
           alpha=0.5)+
  theme_economist()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "Grouped Bar Graph",
       y="Number",
       x="Category")
Grouped bar graph

Grouped bar graph

Stacked Bar graphs (100%)

Superstore_data %>%
  filter(Category == "Furniture"|
           Category == "Office Supplies"|
         Category == "Technology")%>%
  ggplot(aes(Category, fill= Sub.Category))+
  geom_bar(position = "fill",
           alpha=0.5)+
  theme_economist()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "Grouped Bar Graph",
       y="Number",
       x="Category")
Stacked Bar graphs 100%

Stacked Bar graphs 100%

Stacked bar graph (Proportion)

Superstore_data %>%
  filter(Category == "Furniture"|
           Category == "Office Supplies"|
         Category == "Technology")%>%
  ggplot(aes(Category, fill= Sub.Category))+
  geom_bar(alpha=0.5)+
  theme_economist_white()+
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())+
  labs(title = "Grouped Bar Graph",
       y="Frequency",
       x="Category")
Stacked bar graph

Stacked bar graph

Stacked Bar Graphs Part 2

ggplot(Superstore_data,aes(x=Category, y=mean(Sales), fill=Sub.Category))+
  geom_bar(stat = "identity")+
  theme_calc()
Stacked Bar Graph for proportion

Stacked Bar Graph for proportion

Grouped bar graph

ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
  geom_bar(stat = "identity", position = "dodge")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")
Grouped Barplot of Sales for various Categories and Sub categories

Grouped Barplot of Sales for various Categories and Sub categories

Facetting

ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
  geom_bar(stat = "identity", position = "dodge")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
  facet_wrap(Category~.)+
  theme(axis.text.x = element_markdown(angle = 90))
Grouped Barplot of Sales for various Categories and Sub categories

Grouped Barplot of Sales for various Categories and Sub categories

Facet and remove legend

ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
  geom_bar(stat = "identity", position = "dodge")+
  theme_bw()+
  labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
  theme(legend.position = "none")

  facet_wrap(Category~.)
<ggproto object: Class FacetWrap, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetWrap, Facet, gg>

LINEAR REGRESSION PART TWO

Import the data on inflation, unemployment and Fed Fund Rate

unemploymnt_data<-read.csv("C:\\Users\\user\\Downloads\\Unemployment.csv")
attach(unemploymnt_data)

View the data set

head(unemploymnt_data, 10)
   year Unemployment Inflation  FedRate
1  1859     5.133333 0.9084719 3.933333
2  1860     5.233333 1.8107772 3.696667
3  1861     5.533333 1.6227203 2.936667
4  1862     6.266667 1.7953352 2.296667
5  1863     6.800000 0.5370330 2.003333
6  1864     7.000000 0.7149242 1.733333
7  1865     6.766667 0.8918621 1.683333
8  1866     6.200000 1.0676163 2.400000
9  1867     5.633333 2.3034394 2.456667
10 1868     5.533333 1.2348411 2.606667
tail(unemploymnt_data,10)
    year Unemployment Inflation  FedRate
155 2013     4.533333  1.395350 5.533333
156 2014     4.433333  1.081917 4.860000
157 2015     4.266667  1.694265 4.733333
158 2016     4.300000  1.342605 4.746667
159 2017     4.233333  1.376280 5.093333
160 2018     4.100000  1.789712 5.306667
161 2019     4.033333  3.668536 5.680000
162 2020     4.000000  2.102699 6.273333
163 2021     4.066667  1.868115 6.520000
164 2022     3.966667  1.748108 6.473333

This is an annual data from 1980 first quarter to 2021 last quarter.

Attach the data

attach(unemploymnt_data)

Declare the data quarterly time series from 1980 first quarter to 2021 last quarter.

Unemployment<-ts(unemploymnt_data$Unemployment,frequency = 1,start = 1859)

Plot the series

ts.plot(Unemployment,main="Yearly time series plot of Unemployment from 1859 to 2022", xlab="Time in quarters",ylab="Unemployment rate")
Yearly time series plot of Unemployment from 1859 to 2022

Yearly time series plot of Unemployment from 1859 to 2022

Declare the data quarterly time series (Inflation) from 1980 first quarter to 2021 last quarter.

Inflation<-ts(unemploymnt_data$Inflation,frequency = 1,start = 1859)

Plot the series

ts.plot(Inflation,main="Yearly time series plot of Inflation from 1859 to 2022", xlab="Time in quarters",ylab="Inflation rate")
Yearly time series plot of Inflation from 1859 to 2022

Yearly time series plot of Inflation from 1859 to 2022

Declare the data quarterly time series (FedRate) from 1980 first quarter to 2021 last quarter.

FedRate<-ts(unemploymnt_data$FedRate,frequency = 1,start = 1859)

Plot the series

ts.plot(FedRate,main="Yearly time series plot of FedRate from 1859 to 2022", xlab="Time in quarters",ylab="Fed Funds rate")
Yearly time series plot of FedRate from 1859 to 2022

Yearly time series plot of FedRate from 1859 to 2022

Establish the association between Inflation and Unemployment

plot(Unemployment,Inflation)
Scatter plot of unemployment and inflation

Scatter plot of unemployment and inflation

Establish the association between Inflation and FedRate

plot(FedRate,Inflation)
Scatter plot of FedRate and Inflation

Scatter plot of FedRate and Inflation

Correlation

cor(Inflation,FedRate)
[1] 0.6716283
cor(Inflation,Unemployment)
[1] 0.2172026

Histogram

hist(Inflation, main="Histogram showing the distribution of Inflation")
Histogram showing the distribution of Inflation

Histogram showing the distribution of Inflation

hist(Unemployment, main="Histogram showing the distribution of Unemployment")
Histogram showing the distribution of Unemployment

Histogram showing the distribution of Unemployment

hist(FedRate, main = "Histogram showing the distribution of FedRate")
Histogram showing the distribution of FedRate

Histogram showing the distribution of FedRate

Test Normality

shapiro.test(Inflation)

    Shapiro-Wilk normality test

data:  Inflation
W = 0.91272, p-value = 0.00000002459
shapiro.test(Unemployment)

    Shapiro-Wilk normality test

data:  Unemployment
W = 0.96757, p-value = 0.0006884
shapiro.test(FedRate)

    Shapiro-Wilk normality test

data:  FedRate
W = 0.9102, p-value = 0.00000001704

Estimate the Model

Estimate the model using the log transformed values of inflation, unemployment and FedRate to bring about linearity between the dependent variable and independent variables.

REG_MODEL<-lm(log(Inflation)~log(Unemployment)+log(FedRate),data = unemploymnt_data)

Display the results

stargazer(REG_MODEL, report = "vc*stp",type = "text",out = "./q7results.txt")

===============================================
                        Dependent variable:    
                    ---------------------------
                          log(Inflation)       
-----------------------------------------------
log(Unemployment)              0.176           
                              (0.157)          
                             t = 1.127         
                             p = 0.262         
                                               
log(FedRate)                 0.976***          
                              (0.085)          
                            t = 11.555         
                             p = 0.000         
                                               
Constant                     -0.902***         
                              (0.291)          
                            t = -3.102         
                             p = 0.003         
                                               
-----------------------------------------------
Observations                    164            
R2                             0.473           
Adjusted R2                    0.466           
Residual Std. Error      0.498 (df = 161)      
F Statistic           72.167*** (df = 2; 161)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Model Diagnostic

1. Test for the presence of outliers in the model

outlierTest(REG_MODEL)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
    rstudent unadjusted p-value Bonferroni p
14 -3.172999          0.0018094      0.29673

The p-value for Boniferron test statistics shows that there are outliers in the data set

qqPlot(REG_MODEL, main= "QQ Plot Showing the Possible Presence of outliers")
QQ Plot Showing the Possible Presence of outliers

QQ Plot Showing the Possible Presence of outliers

[1] 14 15
leveragePlots(REG_MODEL)
QQ Plot Showing the Possible Presence of outliers

QQ Plot Showing the Possible Presence of outliers

2. test the presence of multicollinearity in the model

vif(REG_MODEL) 
log(Unemployment)      log(FedRate) 
         1.034369          1.034369 

VIF of 1.034369 for unemployment and 1.034369 for FedRate is an indication that predictors are not correlated

Testing for the presence of autocorrelation

3. Testing for heteroscedasticty

ncvTest(REG_MODEL)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 2.623628, Df = 1, p = 0.10528

the results show that the variance of the error terms is constant

spreadLevelPlot(REG_MODEL)
Spread-Level Plot for Regression Model

Spread-Level Plot for Regression Model


Suggested power transformation:  1.024086 

4. Autocorrelation

durbinWatsonTest(REG_MODEL)
 lag Autocorrelation D-W Statistic p-value
   1       0.7001071     0.5736665       0
 Alternative hypothesis: rho != 0

The results shows that there is an correlation of the regression residuals between periods

Correcting Econometric problems

Heteroscedasticity

coeftest(REG_MODEL, hccm(REG_MODEL, type = "hc0"))

t test of coefficients:

                   Estimate Std. Error t value              Pr(>|t|)    
(Intercept)       -0.901611   0.272771 -3.3054              0.001169 ** 
log(Unemployment)  0.176483   0.148486  1.1885              0.236368    
log(FedRate)       0.976433   0.074715 13.0688 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
stargazer(coeftest(REG_MODEL, hccm(REG_MODEL, type = "hc0")),type="text")

=============================================
                      Dependent variable:    
                  ---------------------------
                                             
---------------------------------------------
log(Unemployment)            0.176           
                            (0.148)          
                                             
log(FedRate)               0.976***          
                            (0.075)          
                                             
Constant                   -0.902***         
                            (0.273)          
                                             
=============================================
=============================================
Note:             *p<0.1; **p<0.05; ***p<0.01

TEST THE ZERO CONDITIONAL MEAN AND COVARIANCE

RSD<-residuals(REG_MODEL)
summary(RSD)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.52073 -0.35028 -0.01588  0.00000  0.35593  1.13292 

The zero conditional mean assumption is met, because the residuals mean is approximately zero.

Covariance

cov(RSD,log(FedRate))
[1] 0.00000000000000007316962
cov(RSD,log(Unemployment))
[1] 0.0000000000000001019151

Plot the residuals with the mean as the referrence line

RSD<-ts(RSD, frequency = 1,start = 1858)
ts.plot(RSD)
Times series of residual

Times series of residual

Add referreence line to the residual plot

ts.plot(RSD)
abline(0,0.000)
Time series plot of residuals with an abline line

Time series plot of residuals with an abline line

Visualize graphs using ggplot2 and economist theme

1

ggplot(data=unemploymnt_data,aes(x=year,y=Inflation))+geom_line()
Times series plot of inflation

Times series plot of inflation

2

ggplot(data=unemploymnt_data,aes(x=year,y=Inflation))+geom_line()+
  labs(title="Time Series plot of Inflation Rate",
       caption="source:Federal Reserve",
       y="Inflation Rate", x="Year",
       color=3) + # title and caption
  theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
        axis.title=element_text(size=12,face="bold"),
        panel.grid = element_blank())+
  #theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
  theme(legend.text = element_text(size=12,face="bold"))+
  theme_set(theme_economist())
Time Series plot of Inflation Rate

Time Series plot of Inflation Rate

3

ggplot(data=unemploymnt_data,aes(x=year,y=Unemployment))+geom_line()+
  labs(title="Time Series plot of Unemployment Rate",
       caption="source:Federal Reserve",
       y="Unemployment Rate", x="Year",
       color=3) + # title and caption
  theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
        axis.title=element_text(size=12,face="bold"),
        panel.grid = element_blank())+
  #theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
  theme(legend.text = element_text(size=12,face="bold"))+
  theme_set(theme_economist())
Time Series plot of Unemployment Rate

Time Series plot of Unemployment Rate

4

ggplot(data=unemploymnt_data,aes(x=year,y=FedRate))+geom_line()+
  labs(title="Time Series plot of Fed Fund Rate",
       caption="source:Federal Reserve",
       y="Fed Fund Rate", x="Year",
       color=3) + # title and caption
  theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
        axis.title=element_text(size=12,face="bold"),
        panel.grid = element_blank())+
  #theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
  theme(legend.text = element_text(size=12,face="bold"))+
  theme_set(theme_economist())
Time Series plot of Fed Fund Rate

Time Series plot of Fed Fund Rate

Make a plot of all the four plots

ggplot(data=unemploymnt_data,aes(x=year))+
  geom_line(aes(y=Inflation,colour="Inflation Rate"))+
  geom_line(aes(y=FedRate,colour="FedRate"))+
  geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
  labs(title="Trends of Inflation, Unemployment and FedRate from 1859 to 2022",
       caption="", y="Rate", x="Time in Years", color="Key")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 10))
Trends of Inflation, Unemployment and FedRate from 1859 to 2022

Trends of Inflation, Unemployment and FedRate from 1859 to 2022

Rename time variable

date<-seq(as.Date("1859-01-01"),by="1 year",length.out=length(unemploymnt_data$year))
date
  [1] "1859-01-01" "1860-01-01" "1861-01-01" "1862-01-01" "1863-01-01"
  [6] "1864-01-01" "1865-01-01" "1866-01-01" "1867-01-01" "1868-01-01"
 [11] "1869-01-01" "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01"
 [16] "1874-01-01" "1875-01-01" "1876-01-01" "1877-01-01" "1878-01-01"
 [21] "1879-01-01" "1880-01-01" "1881-01-01" "1882-01-01" "1883-01-01"
 [26] "1884-01-01" "1885-01-01" "1886-01-01" "1887-01-01" "1888-01-01"
 [31] "1889-01-01" "1890-01-01" "1891-01-01" "1892-01-01" "1893-01-01"
 [36] "1894-01-01" "1895-01-01" "1896-01-01" "1897-01-01" "1898-01-01"
 [41] "1899-01-01" "1900-01-01" "1901-01-01" "1902-01-01" "1903-01-01"
 [46] "1904-01-01" "1905-01-01" "1906-01-01" "1907-01-01" "1908-01-01"
 [51] "1909-01-01" "1910-01-01" "1911-01-01" "1912-01-01" "1913-01-01"
 [56] "1914-01-01" "1915-01-01" "1916-01-01" "1917-01-01" "1918-01-01"
 [61] "1919-01-01" "1920-01-01" "1921-01-01" "1922-01-01" "1923-01-01"
 [66] "1924-01-01" "1925-01-01" "1926-01-01" "1927-01-01" "1928-01-01"
 [71] "1929-01-01" "1930-01-01" "1931-01-01" "1932-01-01" "1933-01-01"
 [76] "1934-01-01" "1935-01-01" "1936-01-01" "1937-01-01" "1938-01-01"
 [81] "1939-01-01" "1940-01-01" "1941-01-01" "1942-01-01" "1943-01-01"
 [86] "1944-01-01" "1945-01-01" "1946-01-01" "1947-01-01" "1948-01-01"
 [91] "1949-01-01" "1950-01-01" "1951-01-01" "1952-01-01" "1953-01-01"
 [96] "1954-01-01" "1955-01-01" "1956-01-01" "1957-01-01" "1958-01-01"
[101] "1959-01-01" "1960-01-01" "1961-01-01" "1962-01-01" "1963-01-01"
[106] "1964-01-01" "1965-01-01" "1966-01-01" "1967-01-01" "1968-01-01"
[111] "1969-01-01" "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01"
[116] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01"
[121] "1979-01-01" "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01"
[126] "1984-01-01" "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01"
[131] "1989-01-01" "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01"
[136] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[141] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[146] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[151] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[156] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01"
[161] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01"

Plot the series 1

ggplot(data=unemploymnt_data,aes(x=date))+
  geom_line(aes(y=Inflation,colour="Inflation Rate"))+
  labs(title="Trends of Inflation Rate from 1859 to 2022",
       caption="", y="Inflation Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y", breaks = "4 year")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Inflation Rate from 1859 to 2022

Trends of Inflation Rate from 1859 to 2022

Plot the series 2

ggplot(data=unemploymnt_data,aes(x=date))+
  geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
  labs(title="Trends of Unemployment Rate from 1859 to 2022",
       caption="", y="Unemployment Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y", breaks = "4 year")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Plot the series 3

ggplot(data=unemploymnt_data,aes(x=date))+
  geom_line(aes(y=FedRate,colour="Fed Fund Rate"))+
  labs(title="Trends of Fed Fund Rate from 1859 to 2022",
       caption="", y="Fed Fund Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y", breaks = "4 year")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Fed Fund Rate from 1859 to 2022

Trends of Fed Fund Rate from 1859 to 2022

Combined Plot 5

ggplot(data=unemploymnt_data,aes(x=date))+
  geom_line(aes(y=FedRate,colour="Fed Fund Rate"))+
  geom_line(aes(y=Inflation,colour="Inflation Rate"))+
  geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
  labs(title="Trends of Fed Fund Rate, Inflation Rate and Unemployment
       Rate from 1859 to 2022",
       caption="", y="Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y", breaks = "4 year")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Fed Rate, Inflation and Unemployment from 1859 to 2022

Trends of Fed Rate, Inflation and Unemployment from 1859 to 2022

Create plot1

plot_1<-ggplot(data=unemploymnt_data,aes(x=date,y=Inflation))+geom_line()+
  labs(title="Inflation Rate from 1859 to 2022",
       caption="", y="Inflation Rate", x="Time in Years", color=3)+
  scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Create plot2

plot_2<-ggplot(data=unemploymnt_data,aes(x=date,y=Unemployment))+geom_line()+
  labs(title="Unemployment Rate from 1859 to 2022",
       caption="", y="Unemployment Rate", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Create Plot3

plot_3<-ggplot(data=unemploymnt_data,aes(x=date,y=FedRate))+geom_line()+
  labs(title="Fed Fund Rates from 1859 to 2022",
       caption="", y="Fed Fund Rates", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Create the three plots above in one figure.

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_2),ggplotGrob(plot_3),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_2),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_3),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_2),ggplotGrob(plot_3),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_2),size="last"))

grid.newpage()
grid.draw(rbind(ggplotGrob(plot_3),size="last"))

Data Frame the residuals

data11<-data.frame(unemploymnt_data$year,unemploymnt_data$Inflation,unemploymnt_data$FedRate,unemploymnt_data$Unemployment,RSD)
head(data11,10)
   unemploymnt_data.year unemploymnt_data.Inflation unemploymnt_data.FedRate
1                   1859                  0.9084719                 3.933333
2                   1860                  1.8107772                 3.696667
3                   1861                  1.6227203                 2.936667
4                   1862                  1.7953352                 2.296667
5                   1863                  0.5370330                 2.003333
6                   1864                  0.7149242                 1.733333
7                   1865                  0.8918621                 1.683333
8                   1866                  1.0676163                 2.400000
9                   1867                  2.3034394                 2.456667
10                  1868                  1.2348411                 2.606667
   unemploymnt_data.Unemployment         RSD
1                       5.133333 -0.82027544
2                       5.233333 -0.07333963
3                       5.533333  0.03190288
4                       6.266667  0.35104981
5                       6.800000 -0.73682650
6                       7.000000 -0.31447083
7                       6.766667 -0.05877223
8                       6.200000 -0.20979873
9                       5.633333  0.55330508
10                      5.533333 -0.12486512

Export the data set above

write.csv(data11,"C:\\Users\\user\\Downloads\\residual_data.csv", row.names = FALSE)

Plot the residuals

ggplot(data=data11,aes(x=year,y=RSD))+geom_line()+
  labs(title="Regression Residuals",
       caption="Source:Federal Reserve", y="Residuals", x="Time in Years", color=3)

Import the residuals

residual_data<-read.csv("C:\\Users\\user\\Downloads\\residual_data.csv")
attach(residual_data)

View the data set

head(residual_data,10)
   unemploymnt_data.year unemploymnt_data.Inflation unemploymnt_data.FedRate
1                   1859                  0.9084719                 3.933333
2                   1860                  1.8107772                 3.696667
3                   1861                  1.6227203                 2.936667
4                   1862                  1.7953352                 2.296667
5                   1863                  0.5370330                 2.003333
6                   1864                  0.7149242                 1.733333
7                   1865                  0.8918621                 1.683333
8                   1866                  1.0676163                 2.400000
9                   1867                  2.3034394                 2.456667
10                  1868                  1.2348411                 2.606667
   unemploymnt_data.Unemployment         RSD
1                       5.133333 -0.82027544
2                       5.233333 -0.07333963
3                       5.533333  0.03190288
4                       6.266667  0.35104981
5                       6.800000 -0.73682650
6                       7.000000 -0.31447083
7                       6.766667 -0.05877223
8                       6.200000 -0.20979873
9                       5.633333  0.55330508
10                      5.533333 -0.12486512

Create a time series object

Time<-seq(as.Date("1859-01-01"),by="1 year",length.out=length(residual_data$unemploymnt_data.year))
Time
  [1] "1859-01-01" "1860-01-01" "1861-01-01" "1862-01-01" "1863-01-01"
  [6] "1864-01-01" "1865-01-01" "1866-01-01" "1867-01-01" "1868-01-01"
 [11] "1869-01-01" "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01"
 [16] "1874-01-01" "1875-01-01" "1876-01-01" "1877-01-01" "1878-01-01"
 [21] "1879-01-01" "1880-01-01" "1881-01-01" "1882-01-01" "1883-01-01"
 [26] "1884-01-01" "1885-01-01" "1886-01-01" "1887-01-01" "1888-01-01"
 [31] "1889-01-01" "1890-01-01" "1891-01-01" "1892-01-01" "1893-01-01"
 [36] "1894-01-01" "1895-01-01" "1896-01-01" "1897-01-01" "1898-01-01"
 [41] "1899-01-01" "1900-01-01" "1901-01-01" "1902-01-01" "1903-01-01"
 [46] "1904-01-01" "1905-01-01" "1906-01-01" "1907-01-01" "1908-01-01"
 [51] "1909-01-01" "1910-01-01" "1911-01-01" "1912-01-01" "1913-01-01"
 [56] "1914-01-01" "1915-01-01" "1916-01-01" "1917-01-01" "1918-01-01"
 [61] "1919-01-01" "1920-01-01" "1921-01-01" "1922-01-01" "1923-01-01"
 [66] "1924-01-01" "1925-01-01" "1926-01-01" "1927-01-01" "1928-01-01"
 [71] "1929-01-01" "1930-01-01" "1931-01-01" "1932-01-01" "1933-01-01"
 [76] "1934-01-01" "1935-01-01" "1936-01-01" "1937-01-01" "1938-01-01"
 [81] "1939-01-01" "1940-01-01" "1941-01-01" "1942-01-01" "1943-01-01"
 [86] "1944-01-01" "1945-01-01" "1946-01-01" "1947-01-01" "1948-01-01"
 [91] "1949-01-01" "1950-01-01" "1951-01-01" "1952-01-01" "1953-01-01"
 [96] "1954-01-01" "1955-01-01" "1956-01-01" "1957-01-01" "1958-01-01"
[101] "1959-01-01" "1960-01-01" "1961-01-01" "1962-01-01" "1963-01-01"
[106] "1964-01-01" "1965-01-01" "1966-01-01" "1967-01-01" "1968-01-01"
[111] "1969-01-01" "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01"
[116] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01"
[121] "1979-01-01" "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01"
[126] "1984-01-01" "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01"
[131] "1989-01-01" "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01"
[136] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[141] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[146] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[151] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[156] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01"
[161] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01"

Create the plot for the residuals

residual_plot<-ggplot(data=residual_data,aes(x=Time,y=RSD))+geom_line()+
  labs(title="Regression Residuals",
       caption="", y="Residual", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Plot the residual

grid.newpage()
grid.draw(rbind(ggplotGrob(residual_plot),size="last"))

Additional Tests

Random Sampling

sample_sales<-sample(Superstore_data$Sales,size = 200, replace = FALSE)
sample_profit<-sample(Superstore_data$Profit,size = 200, replace = FALSE)
sample_quantity<-sample(Superstore_data$Quantity,size = 200, replace = FALSE)

Data frame the variables

sample_data<-data.frame(sample_sales,sample_profit,sample_quantity)

View the sample

head(sample_data,10)
   sample_sales sample_profit sample_quantity
1        14.940        2.2240               2
2        45.960        7.2000               3
3        30.816       10.2480               3
4       119.960       24.8832               5
5       719.960       15.7640               3
6       181.860        1.3068               7
7       479.970       57.5016               5
8         6.848      -12.4146               1
9        10.368       18.6606               4
10       23.744       11.1900               5
tail(sample_data,10)
    sample_sales sample_profit sample_quantity
191       22.580        4.3600               3
192       56.280        1.0430               4
193       23.976       14.2956               3
194       41.960        1.7640               2
195       25.344       18.0378               8
196      899.970        1.4742               5
197       49.080        3.6018               3
198        9.088      146.3880               4
199      129.920        2.3094               3
200       95.144       22.2264               5

OMITTED VARIABLE BIAS

In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias occurs when the omitted variable is correlated with one or more other independent variable and dependent variable. Consider the data below.

Load the dataset

research_data<-read.csv("C:\\Users\\user\\Downloads\\research_data.csv")
attach(research_data)

View the data set

head(research_data,10)
   year   gdp Private_investment income_tax value_added_tax import_tax
1  1973 15790               3645     1124.8          639.80     795.44
2  1974 18776               4075     1531.4          937.26     842.24
3  1975 21140               4837     1796.8         1185.48     983.62
4  1976 25562               5808     2149.4         1308.44    1057.18
5  1977 32699               7800     2846.8         1855.26    2083.94
6  1978 35601              10280     3121.4         1995.38    2025.48
7  1979 39543              10809     3437.0         3098.14    2049.64
8  1980 44648              12451     3951.6         3588.00    2919.00
9  1981 51641              14508     3993.4         3895.90    3674.24
10 1982 58214              13364     4624.6         3917.50    3305.84
tail(research_data,10)
   year     gdp Private_investment income_tax value_added_tax import_tax
37 2009 2365453             292488   288168.0        146791.0   41372.00
38 2010 2551161             238938   268291.0        172360.0   48903.00
39 2011 3725918             213845   290621.6        163940.4   48436.93
40 2012 4261370             242365   353693.7        174716.2   55397.81
41 2013 4745594             234586   422357.9        218297.2   61692.72
42 2014 5403471             273641   470102.0        243156.2   70245.12
43 2015 6284191             293645   540440.4        276504.4   75410.29
44 2016 7194163             261548   589921.4        287766.5   86329.96
45 2017 7749435             269542   557959.3        309977.4   85243.79
46 2018 7946248             256345   683377.3        381419.9   95354.98

Estimate the first Multiple Regression Equation

mlr1 <- lm(Private_investment~income_tax+value_added_tax, data = research_data)
stargazer(mlr1,report = "vc*stp",type = "text",out = "./q7results.txt")

===============================================
                        Dependent variable:    
                    ---------------------------
                        Private_investment     
-----------------------------------------------
income_tax                   -1.017***         
                              (0.344)          
                            t = -2.954         
                             p = 0.006         
                                               
value_added_tax              2.828***          
                              (0.658)          
                             t = 4.301         
                            p = 0.0001         
                                               
Constant                   40,402.690***       
                           (12,398.920)        
                             t = 3.259         
                             p = 0.003         
                                               
-----------------------------------------------
Observations                    46             
R2                             0.712           
Adjusted R2                    0.698           
Residual Std. Error    60,037.290 (df = 43)    
F Statistic           53.103*** (df = 2; 43)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Estimate the second model with the omitted variable

mlr2<- lm(Private_investment~income_tax+value_added_tax+import_tax, data = research_data)
stargazer(mlr2, report = "vc*stp",type = "text",out = "./q7results.txt")

===============================================
                        Dependent variable:    
                    ---------------------------
                        Private_investment     
-----------------------------------------------
income_tax                   -1.026***         
                              (0.323)          
                            t = -3.174         
                             p = 0.003         
                                               
value_added_tax               1.752**          
                              (0.743)          
                             t = 2.358         
                             p = 0.024         
                                               
import_tax                    4.249**          
                              (1.630)          
                             t = 2.606         
                             p = 0.013         
                                               
Constant                    15,523.390         
                           (15,054.700)        
                             t = 1.031         
                             p = 0.309         
                                               
-----------------------------------------------
Observations                    46             
R2                             0.752           
Adjusted R2                    0.734           
Residual Std. Error    56,362.240 (df = 42)    
F Statistic           42.433*** (df = 3; 42)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Test whether there is an omitted variable bias

waldtest(mlr1,mlr2,test = "F")
Wald test

Model 1: Private_investment ~ income_tax + value_added_tax
Model 2: Private_investment ~ income_tax + value_added_tax + import_tax
  Res.Df Df      F  Pr(>F)  
1     43                    
2     42  1 6.7904 0.01263 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation

The null hypothesis for this test states that the coefficient of the omitted variable is zero. Here the implication is that if we accept the null hypothesis, the variable was correctly omitted. On the other hand, the alternative hypothesis states that the coefficient of the omitted variable is not equal to zero. Therefore, rejecting the null hypothesis indicates that the variable was incorrectly omitted. From the results above, the p-value is approximately 0.01, which indicates that we should reject the null hypothesis and conclude that the variable was incorrectly omitted. Thus, the omitted variable helps to explain the variation in the response variable.

Foreign Direct Investment

fdi_data<-read.csv("C:\\Users\\user\\Downloads\\fdi.csv")
attach(fdi_data)

View data set

head(fdi_data,10)
   year FDI_USA....of.GDP. FDI_Kenya....of.GDP. FDI_UK....of.GDP.
1  1970         0.11366781            0.8606457         1.1387286
2  1971         0.06610293            0.4161064         1.1958529
3  1972         0.09928779            0.2989637         0.7107128
4  1973         0.13540287            0.6879239         1.4141275
5  1974         0.22909018            0.7885675         2.1218233
6  1975         0.13709980            0.5264477         1.3726762
7  1976         0.15533156            1.3346175         1.2923558
8  1977         0.14026148            1.2581322         1.6826721
9  1978         0.23388341            0.6488659         1.1275862
10 1979         0.30639436            1.3475238         1.4737058

Declare the data quarterly time series from 1970 first quarter to 2021 last quarter.

FDI_USA....of.GDP.<-ts(fdi_data$FDI_USA....of.GDP.,frequency = 1,start = 1970)

Plot the series

ts.plot(FDI_USA....of.GDP.,main="Yearly time series plot of Foreign Direct 
        Investment Net Flow in the USA from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in USA

Yearly time series of FDI in USA

Declate FDI_UK time series

FDI_UK....of.GDP.<-ts(fdi_data$FDI_UK....of.GDP.,frequency = 1,start = 1970)

Plot the series

ts.plot(FDI_UK....of.GDP.,main="Yearly time series plot of Foreign Direct 
        Investment Net Flow in the UK from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in UK

Yearly time series of FDI in UK

Declare FDI_Kenya a time series

FDI_Kenya....of.GDP.<-ts(fdi_data$FDI_Kenya....of.GDP.,frequency = 1,start = 1970)

Plot the series

ts.plot(FDI_Kenya....of.GDP.,main="Yearly time series plot of Foreign Direct 
        Investment Net Flow in Kenya from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in Kenya

Yearly time series of FDI in Kenya

Create a time series object

date_time<-seq(as.Date("1970-01-01"),by="1 year",length.out=length(fdi_data$year))
date_time
 [1] "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01" "1974-01-01"
 [6] "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01" "1979-01-01"
[11] "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01" "1984-01-01"
[16] "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01" "1989-01-01"
[21] "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01" "1994-01-01"
[26] "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01" "1999-01-01"
[31] "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01" "2004-01-01"
[36] "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01" "2009-01-01"
[41] "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01"
[46] "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01" "2019-01-01"
[51] "2020-01-01"

Visualize the plots using ggplot

The United States

USA<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_USA....of.GDP.))+geom_line()+
  labs(title="FDI Inflow in the USA from 1970 to 2020",
       caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Kenya

Kenya<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_Kenya....of.GDP.))+geom_line()+
  labs(title="FDI Inflow in Kenya from 1970 to 2020",
       caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

The United Kingdom (UK)

UK<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_UK....of.GDP.))+geom_line()+
  labs(title="FDI Inflow in UK from 1970 to 2020",
       caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
  scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))

Plot the graph for the USA

grid.newpage()
grid.draw(rbind(ggplotGrob(USA),size="last"))

Plot the graph for Kenya

grid.newpage()
grid.draw(rbind(ggplotGrob(Kenya),size="last"))

Plot the graph for the UK

grid.newpage()
grid.draw(rbind(ggplotGrob(UK),size="last"))

Plot the three graphs all together

ggplot(data=fdi_data,aes(x=year))+
  geom_line(aes(y=FDI_USA....of.GDP.,colour="FDI_USA"))+
  geom_line(aes(y=FDI_UK....of.GDP.,colour="FDI_UK"))+
  geom_line(aes(y=FDI_Kenya....of.GDP.,colour="FDI_Kenya"))+
  labs(title="Trends of FDI inflow as a percentage of GDP from 1970 to 2020",
       caption="", y="FDI", x="Time in Years",color="Key")+
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 10))
Trends of FDI inflow as a percentage of GDP from 1970 to 2020

Trends of FDI inflow as a percentage of GDP from 1970 to 2020

FOREIGN DIRECT INVESTMENT

ForDI<-read.csv("C:\\Users\\user\\Downloads\\ForDI.csv")
attach(ForDI)

View the data set

head(ForDI,10)
   Year United.States United.Kingdom    Germany     France South.Africa
1  1970    5270000000      189599849         NA         NA            0
2  1971    4850000000      216920701   96932868         NA            0
3  1972    6060000000      809492007  -92717508         NA            0
4  1973    7410000000     2258637186 -138757334         NA            0
5  1974    1620000000        2246137  -54769412         NA            0
6  1975   11420000000     -317828335 1490736257 -225901953            0
7  1976    8410000000     1331645274 1287619730  660768470            0
8  1977    8360000000     -253403562 1457079065 -893591988            0
9  1978    8870000000     3027697093 2336498531 -570258945            0
10 1979   16670000000     6069585369 3282971189 -601262768            0
       Kenya    Nigeria
1         NA         NA
2         NA         NA
3         NA         NA
4         NA         NA
5         NA         NA
6  -15796942         NA
7  -42069308         NA
8  -53887117 -440514243
9  -32085354 -210933271
10 -78123859 -304632042

Declare the data time series

ForDI_TR<- ts(data=ForDI[,-1],start = min(ForDI$Year),end =max(ForDI$Year))

Plot the series

plot(ForDI_TR,plot.type = "single",col=1:7)
legend("topleft",legend = colnames(ForDI_TR),ncol = 2,lty = 1, col = 1:7,cex = 0.9)
Time series plot Foreign Direct Investment

Time series plot Foreign Direct Investment

Plot the graphs seperately

plot(ForDI_TR)

Multivariate Time series Forecasting

ARIMA Modelling

ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. Before we introduce ARIMA models, we must first discuss the concept of stationarity and the technique of differencing time series.

Stationarity

A stationary time series is one whose properties do not depend on the time at which the series is observed.14 Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time. Some cases can be confusing — a time series with cyclic behaviour (but with no trend or seasonality) is stationary. This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be. In general, a stationary time series will have no predictable patterns in the long-term. Time plots will show the series to be roughly horizontal (although some cyclic behaviour is possible), with constant variance.

Unit root tests

One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required. A number of unit root tests are available, which are based on different assumptions and may lead to conflicting answers. In our analysis, we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test (Kwiatkowski, Phillips, Schmidt, & Shin, 1992). In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required. The test can be computed using the ur.kpss() function from the urca package.

Import the data set

gdp_data<-read.csv("C:\\Users\\user\\Downloads\\ARIMA.csv")

View the data

head(gdp_data,5)
  year realgdp
1 1817  1610.5
2 1818  1658.8
3 1819  1723.0
4 1820  1753.9
5 1821  1773.5

the structure of the data

str(gdp_data)
'data.frame':   204 obs. of  2 variables:
 $ year   : int  1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 ...
 $ realgdp: num  1610 1659 1723 1754 1774 ...

Class of the data

class(gdp_data)
[1] "data.frame"

Declare the data time series

gdp_TS<-ts(gdp_data$realgdp,frequency = 1,start = 1817)

The structure of the new data set

class(gdp_TS)
[1] "ts"

Plot the dataset

plot(gdp_TS)

Augumented Dicky Fuller Test (Unit root test)

adf.test(gdp_TS)

    Augmented Dickey-Fuller Test

data:  gdp_TS
Dickey-Fuller = 0.28354, Lag order = 5, p-value = 0.99
alternative hypothesis: stationary

From the result above, GDP is highly correlated with itself and therefore not stationary.

Check for autocorrelation (ACF)

acf(gdp_TS)

pacf(gdp_TS) 

Run the ARIMA by automatically Converting non-Stationary data to stationary data

auto.arima(gdp_TS,ic="aic", trace = TRUE)

 Fitting models using approximations to speed things up...

 ARIMA(2,2,2)                    : 2039.769
 ARIMA(0,2,0)                    : 2114.799
 ARIMA(1,2,0)                    : 2073.859
 ARIMA(0,2,1)                    : 2049.559
 ARIMA(1,2,2)                    : 2047.647
 ARIMA(2,2,1)                    : 2037.924
 ARIMA(1,2,1)                    : 2046.277
 ARIMA(2,2,0)                    : 2067.973
 ARIMA(3,2,1)                    : 2041.463
 ARIMA(3,2,0)                    : 2064.69
 ARIMA(3,2,2)                    : 2043.092

 Now re-fitting the best model(s) without approximations...

 ARIMA(2,2,1)                    : 2054.273

 Best model: ARIMA(2,2,1)                    
Series: gdp_TS 
ARIMA(2,2,1) 

Coefficients:
         ar1     ar2      ma1
      0.2633  0.1442  -0.9656
s.e.  0.0725  0.0726   0.0195

sigma^2 = 1477:  log likelihood = -1023.14
AIC=2054.27   AICc=2054.48   BIC=2067.51

Fit the model

ARIMAMODEL<-auto.arima(gdp_TS,ic="aic", trace = TRUE)

 Fitting models using approximations to speed things up...

 ARIMA(2,2,2)                    : 2039.769
 ARIMA(0,2,0)                    : 2114.799
 ARIMA(1,2,0)                    : 2073.859
 ARIMA(0,2,1)                    : 2049.559
 ARIMA(1,2,2)                    : 2047.647
 ARIMA(2,2,1)                    : 2037.924
 ARIMA(1,2,1)                    : 2046.277
 ARIMA(2,2,0)                    : 2067.973
 ARIMA(3,2,1)                    : 2041.463
 ARIMA(3,2,0)                    : 2064.69
 ARIMA(3,2,2)                    : 2043.092

 Now re-fitting the best model(s) without approximations...

 ARIMA(2,2,1)                    : 2054.273

 Best model: ARIMA(2,2,1)                    

View the model

ARIMAMODEL
Series: gdp_TS 
ARIMA(2,2,1) 

Coefficients:
         ar1     ar2      ma1
      0.2633  0.1442  -0.9656
s.e.  0.0725  0.0726   0.0195

sigma^2 = 1477:  log likelihood = -1023.14
AIC=2054.27   AICc=2054.48   BIC=2067.51
summary(ARIMAMODEL)
Series: gdp_TS 
ARIMA(2,2,1) 

Coefficients:
         ar1     ar2      ma1
      0.2633  0.1442  -0.9656
s.e.  0.0725  0.0726   0.0195

sigma^2 = 1477:  log likelihood = -1023.14
AIC=2054.27   AICc=2054.48   BIC=2067.51

Training set error measures:
                   ME     RMSE      MAE        MPE      MAPE      MASE
Training set 2.803157 37.96217 29.06838 0.02152401 0.7154865 0.6297229
                     ACF1
Training set -0.008628346

Test the stationarity of the model

acf(ts(ARIMAMODEL$residuals))

pacf(ts(ARIMAMODEL$residuals))

From the many articles that we have read, the ACF is used to identify order of Moving Average (MA) term, and PACF for Auto-regressive term (AR). From the results above, MA will have one lag,with AR having two lags.

Forecasting

MYGDPF<-forecast(ARIMAMODEL,level=c(95), h=50)
MYGDPF
     Point Forecast     Lo 95     Hi 95
2021       9357.475  9282.142  9432.809
2022       9415.513  9292.099  9538.927
2023       9476.136  9306.002  9646.271
2024       9538.083  9325.159  9751.008
2025       9600.752  9347.949  9853.555
2026       9663.801  9373.544  9954.058
2027       9727.055  9401.215 10052.895
2028       9790.417  9430.445 10150.390
2029       9853.838  9460.849 10246.827
2030       9917.289  9492.143 10342.436
2031       9980.757  9524.114 10437.401
2032      10044.234  9556.599 10531.870
2033      10107.716  9589.474 10625.958
2034      10171.200  9622.640 10719.761
2035      10234.686  9656.019 10813.352
2036      10298.172  9689.550 10906.793
2037      10361.658  9723.181 11000.136
2038      10425.145  9756.870 11093.420
2039      10488.632  9790.583 11186.681
2040      10552.119  9824.290 11279.948
2041      10615.606  9857.967 11373.244
2042      10679.093  9891.594 11466.592
2043      10742.580  9925.152 11560.008
2044      10806.067  9958.626 11653.508
2045      10869.554  9992.003 11747.105
2046      10933.041 10025.271 11840.810
2047      10996.528 10058.422 11934.634
2048      11060.015 10091.445 12028.584
2049      11123.502 10124.335 12122.669
2050      11186.989 10157.083 12216.894
2051      11250.476 10189.685 12311.266
2052      11313.963 10222.136 12405.790
2053      11377.450 10254.430 12500.470
2054      11440.937 10286.565 12595.309
2055      11504.424 10318.536 12690.312
2056      11567.911 10350.341 12785.481
2057      11631.398 10381.976 12880.819
2058      11694.885 10413.441 12976.329
2059      11758.372 10444.733 13072.011
2060      11821.859 10475.849 13167.869
2061      11885.346 10506.789 13263.903
2062      11948.833 10537.552 13360.114
2063      12012.320 10568.135 13456.505
2064      12075.807 10598.538 13553.075
2065      12139.294 10628.761 13649.826
2066      12202.781 10658.803 13746.759
2067      12266.268 10688.663 13843.873
2068      12329.755 10718.340 13941.170
2069      12393.242 10747.834 14038.649
2070      12456.729 10777.146 14136.312

Plot the predicted series

plot(MYGDPF,type="l",main="Time Series plot of Forecasted GDP (ARIMA 2,2,1)",xlab="Time in Years",ylab="MYGDPF")
ARIMA forecast

ARIMA forecast

Diagnosing the ARIMA Model/Validating the Model

Box.test(MYGDPF$resid, lag=5,type="Ljung-Box")

    Box-Ljung test

data:  MYGDPF$resid
X-squared = 1.6084, df = 5, p-value = 0.9002
Box.test(MYGDPF$resid, lag=2,type="Ljung-Box")

    Box-Ljung test

data:  MYGDPF$resid
X-squared = 0.025225, df = 2, p-value = 0.9875
Box.test(MYGDPF$resid, lag=20,type="Ljung-Box")

    Box-Ljung test

data:  MYGDPF$resid
X-squared = 23.863, df = 20, p-value = 0.2484

From the results, ARIMA(2,2,1) is good for prediction and has no autocorrelation issue. Box-Ljung test gave a p-value value of 0.9875, showing no autocorrelation

Plot the residuals using the ggplot2

autoplot(forecast(MYGDPF))
ARIMA 2,2,1 Forecast

ARIMA 2,2,1 Forecast

Predict the residuals

RESID<-residuals(ARIMAMODEL)

Summarize the residuals

summary(RESID)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-127.529  -20.085    2.219    2.803   25.356  154.703 

Plot the residuals

plot(RESID)
abline(2.803,0)
The plot of residuals with an abline

The plot of residuals with an abline

VECTOR AUTOREGRESSIVE MODEL (VAR)

In today’s lesson, you’ll learn the basics of the vector autoregressive model. We lay the foundation for getting started with this crucial multivariate time series model and cover the important details including:

What is a vector autoregressive model?

The vector autoregressive (VAR) model is a workhouse multivariate time series model that relates current observations of a variable with past observations of itself and past observations of other variables in the system. VAR models differ from univariate autoregressive models because they allow feedback to occur between the variables in the model. For example, we could use a VAR model to show how Inflation rate is a function of unemployment rate and how unemployment rate is, in turn, a function of inflation rate. In economics we might be interested in establishing a bi-directional relationship between personal income and personal consumption spending? A two-equation VAR system is used to model the relationship between income and consumption over time.

Who uses VAR models?

VAR models are traditionally widely used in finance and econometrics because they offer a framework for accomplishing important modeling goals, including (Stock and Watson 2001):

  • Data description.
  • Forecasting.
  • Structural inference.
  • Policy analysis. However, more recently VAR models have been gaining traction in other fields like epidemiology, medicine, and biology.

The reduced form, recursive, and structural VAR

There are three broad types of VAR models, the reduced form, the recursive form, and the structural VAR model.

Reduced form VAR models consider each variable to be a function of:

  • Its own past values.
  • The past values of other variables in the model.
  • While reduced form models are the simplest of the VAR models, they do come with disadvantages:

Recursive VAR models.

This model contain all the components of the reduced form model, but also allow some variables to be functions of other concurrent variables. By imposing these short-run relationships, the recursive model allows us to model structural shocks.

Specifying a VAR model

What makes up a VAR model?

A VAR model is made up of a system of equations that represents the relationships between multiple variables. When referring to VAR models, we often use special language to specify:

  • How many endogenous variables there are included.
  • How many autoregressive terms are included. For example, if we have two endogenous variables and autoregressive terms, we say the model is a Bivariate VAR(2) model. If we have three endogenous variables and four autoregressive terms, we say the model is a Trivariate VAR(4) model.

In general, a VAR model is composed of n-equations (representing n endogenous variables) and includes p-lags of the variables.

How do we choose the number of lags in a VAR model?

Lag selection is one of the important aspects of VAR model specification. In practical applications, we generally choose a maximum number of lags, , and evaluate the performance of the model including. The optimal model is then the model VAR(p) which minimizes some lag selection criteria. The most commonly used lag selection criteria are:

  • Akaike (AIC)
  • Schwarz-Bayesian (BIC)
  • Hannan-Quinn (HQ). These methods are usually built into software and lag selection is almost completely automated now.

How do we decide what endogenous variables to include in our VAR model?

From an estimation standpoint, it is important to be deliberate about how many variables we include in our VAR model. Adding additional variables: * Increases the number of coefficients to be estimated for each equation and each number of lags.

Introduce additional estimation error.

Deciding what variables to include in a VAR model should be founded in theory, as much as possible. We can use additional tools, like Granger causality or Sims causality, to test the forecasting relevance of variables.

Granger causality tests

Tests whether a variable is “helpful” for forecasting the behavior of another variable. It’s important to note that Granger causality only allows us to make inferences about forecasting capabilities – not about true causality.

Estimating and inference in VAR models

Despite their seeming complexities, VAR models are quite easy to estimate. The equation can be estimated using ordinary least squares given a few assumptions:

  • The error term has a conditional mean of zero.

  • The variables in the model are stationary.

  • Large outliers are unlikely.

  • No perfect multicollinearity. Under these assumptions, the ordinary least squares estimates:

  • Will be consistent.

  • Can be evaluated using traditional t-statistics and p-values.

  • Can be used to jointly test restrictions across multiple equations.

Forecasting

One of the most important functions of VAR models is to generate forecasts. Forecasts are generated for VAR models using an iterative forecasting algorithm:

Estimate the VAR model using OLS for each equation.

  • Compute the one-period-ahead forecast for all variables.
  • Compute the two-period-ahead forecasts, using the one-period-ahead forecast.
  • Iterate until the h-step ahead forecasts are computed.

Reporting and evaluating VAR models

Often we are more interested in the dynamics that are predicted by our VAR models than the actual coefficients that are estimated. For this reason, it is most common that VAR studies report:

  • Granger-causality statistics.
  • Impulse response functions.
  • Forecast error decompositions
  • Granger-causality statistics

As we previously discussed, Granger-causality statistics test whether one variable is statistically significant when predicting another variable. In other words, the Granger-causality statistics are F-statistics that test if the coefficients of all lags of a variable are jointly equal to zero in the equation for another variable. As the p-value of the F-statistic decreases, evidence that a variable is relevant for predict another variable increases.

For example, in the Granger-causality test of on , if the p-value is 0.02 we would say that does help predict at the 5% level. However, if the p-value is 0.3 we would say that there is no evidence that helps predict .

Impulse response functions

The impulse response function traces the dynamic path of variables in the system to shocks to other variables in the system. This is done by:

  • Estimating the VAR model.
  • Implementing a one-unit increase in the error of one of the variables in the model, while holding the other errors equal to zero.
  • Predicting the impacts h-period ahead of the error shock.
  • Plotting the forecasted impacts, along with the one-standard-deviation confidence intervals.

Forecast error decomposition

Forecast error decomposition separates the forecast error variance into proportions attributed to each variable in the model. Intuitively, this measure helps us judge how much of an impact one variable has on another variable in the VAR model and how intertwined our variables’ dynamics are. For example, if X is responsible for 85% of the forecast error variance of Y, it is explaining a large amount of the forecast variation in Y. However, if X is only responsible for 20% of the forecast error variance of Y, much of the forecast error variance of is left unexplained by X.

Conclusion

VAR models are an essential component of multivariate time series modeling. After today’s blog, you should have a better understanding of the fundamentals of the VAR model including:

  • What a VAR model is.
  • Who uses VAR models.
  • Basic types of VAR models.
  • How to specify a VAR model.
  • Estimation and forecasting with VAR models.

ESTIMATING VAR MODEL IN RSTUDIO

### Loas the dataset
var<-read.csv("C:\\Users\\user\\Downloads\\estimation.csv")

View data set

head(var,10)
    Inflation unemployment
1   0.0000000    0.0000000
2  -1.2058566    0.7304148
3  -0.3206307   -1.8469421
4   0.5431992   -1.5017315
5  -0.6235946   -1.2889562
6  -0.5725677   -1.1614199
7  -0.6476382   -0.8271906
8   0.5254592   -0.9146628
9   0.1391321    0.6737464
10 -1.3313781   -0.1606959

WHAT TO CONSIDER WHEN ESTIMATING VAR MODEL

  • Check the integrating properties
  • Check the required lag length (optimal lag length)
  • Use Archaic Information criteria
  • Estimate the VAR model
  • Test the stability of the model
  • Granger causality test
  • Generating Impulse response function

Converting variables into time series objects

Inflation<-ts(var$Inflation,start=c(2000,1),frequency = 12)
plot.ts(Inflation,type="l",main="Time Series plot Inflation",xlab="Year",ylab="inflation")
The time Series plot Inflation

The time Series plot Inflation

unemployment<-ts(var$unemployment,start=c(2000,1),frequency = 12)
plot.ts(unemployment,type="l",main="Time Series plot unemployment",xlab="Year",ylab="unemployment")
Time Series plot unemployment

Time Series plot unemployment

Augumented Dickey Fuller Test

Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of a series.

The null hypothesis however is still the same as the Dickey Fuller test. A key point to remember here is: Since the null hypothesis assumes the presence of unit root, that is alpha=1, the p-value obtained should be less than the significance level (say 0.05) in order to reject the null hypothesis. Thereby, inferring that the series is stationary. However, this is a very common mistake analysts commit with this test. That is, if the p-value is less than significance level, people mistakenly take the series to be non-stationary.

ADF<-adf.test(Inflation)
ADF

    Augmented Dickey-Fuller Test

data:  Inflation
Dickey-Fuller = -5.7974, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
ADF1<-adf.test(unemployment)
ADF1

    Augmented Dickey-Fuller Test

data:  unemployment
Dickey-Fuller = -4.0617, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

Optimal lag length

lag<-VARselect(var, lag.max=4)
lag
$selection
AIC(n)  HQ(n)  SC(n) FPE(n) 
     1      1      1      1 

$criteria
                  1           2           3           4
AIC(n) -0.093318151 -0.08006220 -0.08372238 -0.07044136
HQ(n)  -0.052691480 -0.01235108  0.01107318  0.05143865
SC(n)   0.007032298  0.08718855  0.15042866  0.23060999
FPE(n)  0.910908005  0.92307937  0.91974245  0.93210286

The four criteria above indicates that the appropriate number of lags is one. Including more than one lag in the model will be inappropriate.

VAR Model Estimation

EST<-VAR(var, p=1, type = "none")

Summarize the Model

summary(EST)

VAR Estimation Results:
========================= 
Endogenous variables: Inflation, unemployment 
Deterministic variables: none 
Sample size: 199 
Log Likelihood: -549.51 
Roots of the characteristic polynomial:
 0.43  0.43
Call:
VAR(y = var, p = 1, type = "none")


Estimation results for equation Inflation: 
========================================== 
Inflation = Inflation.l1 + unemployment.l1 

                Estimate Std. Error t value Pr(>|t|)    
Inflation.l1     0.25493    0.06887   3.702 0.000278 ***
unemployment.l1 -0.05589    0.04771  -1.171 0.242875    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.9565 on 197 degrees of freedom
Multiple R-Squared: 0.06981,    Adjusted R-squared: 0.06037 
F-statistic: 7.392 on 2 and 197 DF,  p-value: 0.0008023 


Estimation results for equation unemployment: 
============================================= 
unemployment = Inflation.l1 + unemployment.l1 

                Estimate Std. Error t value             Pr(>|t|)    
Inflation.l1     0.58693    0.07050   8.325    0.000000000000014 ***
unemployment.l1  0.59648    0.04885  12.211 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.9792 on 197 degrees of freedom
Multiple R-Squared: 0.5342, Adjusted R-squared: 0.5295 
F-statistic:   113 on 2 and 197 DF,  p-value: < 0.00000000000000022 



Covariance matrix of residuals:
             Inflation unemployment
Inflation       0.9104      -0.0391
unemployment   -0.0391       0.9578

Correlation matrix of residuals:
             Inflation unemployment
Inflation      1.00000     -0.04187
unemployment  -0.04187      1.00000

Use stargaze to organize the results

stargazer(EST[["varresult"]], type='text')

===========================================================
                                   Dependent variable:     
                               ----------------------------
                                            y              
                                    (1)            (2)     
-----------------------------------------------------------
Inflation.l1                      0.255***      0.587***   
                                  (0.069)        (0.071)   
                                                           
unemployment.l1                    -0.056       0.596***   
                                  (0.048)        (0.049)   
                                                           
-----------------------------------------------------------
Observations                        199            199     
R2                                 0.070          0.534    
Adjusted R2                        0.060          0.530    
Residual Std. Error (df = 197)     0.957          0.979    
F Statistic (df = 2; 197)         7.392***     112.979***  
===========================================================
Note:                           *p<0.1; **p<0.05; ***p<0.01

Interpretation of the coefficients

In practice, you usually don’t don’t make interpretation for the coefficient of a VAR model. And there is a pretty intuitive reason for that. As you recall, VARs assume that all relevant variables are somehow affecting each other through time as a unique universe, so much that in practice VAR estimates as much equations as there are variables. For example, if you estimate a 3-var VAR, then you get an output that consists of 3 estimated equations. This means that each variable gets the “opportunity” of being endogenous while the others are exogenous. So, since VAR assumes this, you won’t want to interpret coefficients as doing that would imply that an x- var is causally affecting the y-var, which by construction, VAR does not assumes that. That’s why most analysis skip that and jump to other analysis, like impulse-response functions and graphs.

Additionally

Strictly speaking, the coefficients of a reduced-form VAR have the same interpretation as the coefficients in any regression, ie. they are the estimated partial derivatives of the dependent variable with respect to the lagged explanatory variables. However, we are typically not interested in these coefficients, but rather on the dynamic properties of the whole system. After all, a VAR is intended to summarize the dynamic behavior of a group of variables, all of which are endogenous with respect to one another. So, we typically focus on the impulse response functions (IRFs) of the VAR, which provide the (cumulative) total derivatives of the endogenous variables with respect to an exogenous shock to one of the variables. More technically, calculating IRFs is equivalent to inverting the vector autoregressive representation to obtain the vector moving average representation. Note that doing so requires that the matrix be invertible, meaning no unit roots in the underlying data.

However, there is an additional complication, which gets to the root of the difference between VARs and the linear simultaneous equation models of the 1970s that they replaced. Since only lags appear as regressors in a VAR, any contemporaneous correlations between variables is captured in the residual covariance matrix. But IRFs become invalid when errors are correlated since the former explicitly assumes that only one variable is being shocked at t=0. Therefore, in addition to the reduced-form coefficient matrix being invertible, we require a means by which the residuals from the VAR can be rendered orthogonal (uncorrelated) with respect to one another. This is how we get from a reduced-form to a structural VAR whose IRFs are valid. Unfortunately, the orthogonality requirement alone is not sufficient to produce a unique structural representation. In other words, there exists an infinity of different structural representations of the VAR that (a) are consistent with the estimated reduced form parameters of the VAR, and (b) possess orthogonal shocks. One popular example is the Cholesky factorization of the residual covariance matrix.

To summarize, the IRFs of VAR are much easier to interpret than the parameters of the reduced-form VAR. But calculating valid IRFs requires (a) stationary ( I(0) ) endogenous variables (or cointegration of the variables) and (b) some form of factorization of the residual covariance matrix that renders the resulting shocks orthogonal with respect to one another. There are other more subtle requirements such as linearity of the underlying data generating process, but these are the 2 big ones.

Testing the stability of the model; Use Eigen value, which should be less than one

roots(EST, modulus=TRUE) 
[1] 0.4299625 0.4299625

The model is stable since the roots are inside the unit circle, that is, the roots are less than one

GRANGER CAUSALITY TEST

GRA<-causality(EST, cause = "Inflation")
GRA
$Granger

    Granger causality H0: Inflation do not Granger-cause unemployment

data:  VAR object EST
F-Test = 69.304, df1 = 1, df2 = 394, p-value = 0.000000000000001332


$Instant

    H0: No instantaneous causality between: Inflation and unemployment

data:  VAR object EST
Chi-squared = 0.38572, df = 1, p-value = 0.5346
GRA1<-causality(EST, cause = "unemployment")
GRA1
$Granger

    Granger causality H0: unemployment do not Granger-cause Inflation

data:  VAR object EST
F-Test = 1.372, df1 = 1, df2 = 394, p-value = 0.2422


$Instant

    H0: No instantaneous causality between: unemployment and Inflation

data:  VAR object EST
Chi-squared = 0.38572, df = 1, p-value = 0.5346

From the analysis, it was established that inflation granger cause unemployment, however, unemployment do not granger cause inflation.

Variance Decomposition and Impulse Response Function

IRF1<-irf(EST, impulse = "Inflation", response = "unemployment",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)

IRF1

IRF1

Impulse response coefficients
$Inflation
          unemployment
 [1,] -0.0431533864513
 [2,]  0.5356735487953
 [3,]  0.4640598613174
 [4,]  0.2960802457858
 [5,]  0.1662984086264
 [6,]  0.0868538042711
 [7,]  0.0432056980416
 [8,]  0.0207296508563
 [9,]  0.0096622621431
[10,]  0.0043943820861
[11,]  0.0019552162997
[12,]  0.0008523272496
[13,]  0.0003642305110
[14,]  0.0001525446726
[15,]  0.0000625448388
[16,]  0.0000250512251
[17,]  0.0000097665459
[18,]  0.0000036842516
[19,]  0.0000013313196
[20,]  0.0000004524105
[21,]  0.0000001390725


Lower Band, CI= 0.95 
$Inflation
         unemployment
 [1,] -0.184075991840
 [2,]  0.321066563874
 [3,]  0.315294143889
 [4,]  0.166563256552
 [5,]  0.058447095129
 [6,]  0.002749794944
 [7,] -0.013138211838
 [8,] -0.011395791645
 [9,] -0.009979606392
[10,] -0.007013775897
[11,] -0.004085913025
[12,] -0.001923095405
[13,] -0.000775530661
[14,] -0.000368941838
[15,] -0.000179202716
[16,] -0.000080672026
[17,] -0.000045792068
[18,] -0.000021790150
[19,] -0.000007270298
[20,] -0.000003185182
[21,] -0.000002275386


Upper Band, CI= 0.95 
$Inflation
      unemployment
 [1,] 0.0885639204
 [2,] 0.7114816015
 [3,] 0.6124293905
 [4,] 0.4025760799
 [5,] 0.2535375202
 [6,] 0.1595431549
 [7,] 0.1059810511
 [8,] 0.0714119748
 [9,] 0.0483910704
[10,] 0.0332510192
[11,] 0.0228477491
[12,] 0.0156285875
[13,] 0.0105281439
[14,] 0.0070995015
[15,] 0.0047940361
[16,] 0.0032382581
[17,] 0.0021387138
[18,] 0.0013962543
[19,] 0.0009116266
[20,] 0.0005953314
[21,] 0.0003893387

Inflation response to unemployment shock

plot(IRF1, ylab="Inflation",
     main="Inflation response to unemployment shock")
Inflation response to the unemployment shock

Inflation response to the unemployment shock

Interpretation of the impulse response fucntion above.

Usually, the impulse response functions are interpreted as something like “a one standard deviation shock to x causes significant increases (decreases) in y for m periods (determined by the length of period for which the SE bands are above 0 or below 0 in case of decrease) after which the effect dissipates. The IRF curve above shows that a one standard deviation shock from unemployment causes inflation to increase from 0 to 3 months and fall there after. The effect of do not last longer. The effects lasted for 3 months.

Unemployment response to inflation shock

IRF2<-irf(EST, impulse = "unemployment", response = "Inflation",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF2, ylab="unemployment",
     main="unemployment response to Inflation shock")
unemployment response to Inflation shock

unemployment response to Inflation shock

Interpretation of the IRF graph above

From the graph above, a one standard deviation shock from inflation on unemployment cause unemployment to decrease upto about 2 months and starts to increase. However, the confidence bound contains a zero, which is an implication that the shock of inflation on unemployment on inflation is not statistically significant. This results confirms to the results gotten above which indicated that unemployment granger cause inflation but inflation do not granger cause unemployment.

Effect of inflation on itself

IRF3<-irf(EST, impulse = "Inflation", response = "Inflation",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF3, ylab="Inflation",
     main="Inflation response to Inflation shock")
Inflation response to Inflation shock

Inflation response to Inflation shock

The response of Unemployment on unemployment shock

IRF4<-irf(EST, impulse = "unemployment", response = "unemployment",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF4, ylab="unemployment",
     main="unemployment response to unemployment shock")
unemployment response to unemployment shock

unemployment response to unemployment shock

Variance Decomposition

The variance decomposition indicates the amount of information each variable contributes to the other variables in the autoregression. It determines how much of the forecast error variance of each of the variables can be explained by exogenous shocks to the other variables.

VARD<-fevd(EST, n.ahead=20)
VARD
$Inflation
      Inflation unemployment
 [1,] 1.0000000  0.000000000
 [2,] 0.9969451  0.003054924
 [3,] 0.9947479  0.005252059
 [4,] 0.9938673  0.006132705
 [5,] 0.9935938  0.006406153
 [6,] 0.9935202  0.006479764
 [7,] 0.9935022  0.006497786
 [8,] 0.9934981  0.006501898
 [9,] 0.9934972  0.006502784
[10,] 0.9934970  0.006502966
[11,] 0.9934970  0.006503001
[12,] 0.9934970  0.006503008
[13,] 0.9934970  0.006503009
[14,] 0.9934970  0.006503010
[15,] 0.9934970  0.006503010
[16,] 0.9934970  0.006503010
[17,] 0.9934970  0.006503010
[18,] 0.9934970  0.006503010
[19,] 0.9934970  0.006503010
[20,] 0.9934970  0.006503010

$unemployment
        Inflation unemployment
 [1,] 0.001942077    0.9980579
 [2,] 0.182061322    0.8179387
 [3,] 0.265135962    0.7348640
 [4,] 0.293687415    0.7063126
 [5,] 0.302330557    0.6976694
 [6,] 0.304683598    0.6953164
 [7,] 0.305269961    0.6947300
 [8,] 0.305405969    0.6945940
 [9,] 0.305435712    0.6945643
[10,] 0.305441899    0.6945581
[11,] 0.305443130    0.6945569
[12,] 0.305443365    0.6945566
[13,] 0.305443408    0.6945566
[14,] 0.305443415    0.6945566
[15,] 0.305443416    0.6945566
[16,] 0.305443417    0.6945566
[17,] 0.305443417    0.6945566
[18,] 0.305443417    0.6945566
[19,] 0.305443417    0.6945566
[20,] 0.305443417    0.6945566
plot(VARD)
Variance decomposition function for inflation and unemployment

Variance decomposition function for inflation and unemployment

VECTOR AUTOREGRESSIVE MODEL (Inflation, Unemployment and Fed Rate)

Import the data set

my_var<-read.csv("C:\\Users\\user\\Downloads\\my_var.csv")
attach(my_var)

View the data set

head(my_var,15)
   Unemployment Inflation Fed.Rate
1         5.133     0.908    3.933
2         5.233     1.811    3.697
3         5.533     1.623    2.937
4         6.267     1.795    2.297
5         6.800     0.537    2.003
6         7.000     0.715    1.733
7         6.767     0.892    1.683
8         6.200     1.068    2.400
9         5.633     2.303    2.457
10        5.533     1.235    2.607
11        5.567     1.055    2.847
12        5.533     1.228    2.923
13        5.767     1.573    2.967
14        5.733     0.349    2.963
15        5.500     0.523    3.330

Declare the data time series and plot

Inflation<-ts(my_var$Inflation,start=c(1960),frequency = 4)
plot.ts(Inflation,type="l",main="Time Series plot Inflation",xlab="Year",ylab="inflation")
Time series plot of inflation

Time series plot of inflation

Unemployment<-ts(my_var$Unemployment,start=c(1960),frequency = 4)
plot.ts(Unemployment,type="l",main="Time Series plot Unemployment",xlab="Year",ylab="Unemployment")
Time series plot of unemployment

Time series plot of unemployment

Fed.Rate<-ts(my_var$Fed.Rate,start=c(1960),frequency = 4)
plot.ts(Fed.Rate,type="l",main="Time Series plot of FedRate",xlab="Year",ylab="FedRate")
Time series plot of FedRate

Time series plot of FedRate

Unit Root Test (Inflation)

URT<-adf.test(my_var$Inflation)
URT

    Augmented Dickey-Fuller Test

data:  my_var$Inflation
Dickey-Fuller = -2.2805, Lag order = 5, p-value = 0.4593
alternative hypothesis: stationary

Unit Root Test (Unemployment)

URT1<-adf.test(my_var$Unemployment)
URT1

    Augmented Dickey-Fuller Test

data:  my_var$Unemployment
Dickey-Fuller = -2.0581, Lag order = 5, p-value = 0.5521
alternative hypothesis: stationary

Unit Root Test (FedRate)

URT2<-adf.test(my_var$Fed.Rate)
URT2

    Augmented Dickey-Fuller Test

data:  my_var$Fed.Rate
Dickey-Fuller = -3.3068, Lag order = 5, p-value = 0.07246
alternative hypothesis: stationary

Lag Length determination

lag<-VARselect(my_var, lag.max=4)
lag
$selection
AIC(n)  HQ(n)  SC(n) FPE(n) 
     3      3      2      3 

$criteria
                1           2           3           4
AIC(n) -2.6746095 -3.15420316 -3.28434359 -3.25963208
HQ(n)  -2.5809554 -2.99030845 -3.05020830 -2.95525620
SC(n)  -2.4439714 -2.75058659 -2.70774850 -2.51005846
FPE(n)  0.0689359  0.04267955  0.03748351  0.03844391

VAR Model Estimation

VAR_mdl<-VAR(my_var, p=3, type = "none")
VAR_mdl

VAR Estimation Results:
======================= 

Estimated coefficients for equation Unemployment: 
================================================= 
Call:
Unemployment = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3 

Unemployment.l1    Inflation.l1     Fed.Rate.l1 Unemployment.l2    Inflation.l2 
    1.540461079     0.036092543     0.015635976    -0.567700896    -0.026670658 
    Fed.Rate.l2 Unemployment.l3    Inflation.l3     Fed.Rate.l3 
    0.034688923    -0.011735675    -0.003682298    -0.019977419 


Estimated coefficients for equation Inflation: 
============================================== 
Call:
Inflation = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3 

Unemployment.l1    Inflation.l1     Fed.Rate.l1 Unemployment.l2    Inflation.l2 
    -1.01182762      0.68232932      0.14845919      1.56424639      0.09094890 
    Fed.Rate.l2 Unemployment.l3    Inflation.l3     Fed.Rate.l3 
    -0.06004687     -0.57197523      0.18422864     -0.05318817 


Estimated coefficients for equation Fed.Rate: 
============================================= 
Call:
Fed.Rate = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3 

Unemployment.l1    Inflation.l1     Fed.Rate.l1 Unemployment.l2    Inflation.l2 
     -1.5314206       0.0541425       0.9752619       1.3715074       0.2532385 
    Fed.Rate.l2 Unemployment.l3    Inflation.l3     Fed.Rate.l3 
     -0.3649440       0.1113407      -0.1461168       0.3350045 

Use stargaze to organize the results

stargazer(VAR_mdl[["varresult"]], type='text')

====================================================================
                                        Dependent variable:         
                               -------------------------------------
                                                 y                  
                                    (1)         (2)         (3)     
--------------------------------------------------------------------
Unemployment.l1                  1.540***     -1.012**   -1.531***  
                                  (0.088)     (0.389)     (0.343)   
                                                                    
Inflation.l1                      0.036**     0.682***     0.054    
                                  (0.018)     (0.079)     (0.070)   
                                                                    
Fed.Rate.l1                        0.016       0.148      0.975***  
                                  (0.022)     (0.096)     (0.084)   
                                                                    
Unemployment.l2                  -0.568***    1.564**     1.372**   
                                  (0.147)     (0.646)     (0.570)   
                                                                    
Inflation.l2                      -0.027       0.091      0.253***  
                                  (0.022)     (0.096)     (0.085)   
                                                                    
Fed.Rate.l2                        0.035       -0.060    -0.365***  
                                  (0.030)     (0.133)     (0.117)   
                                                                    
Unemployment.l3                   -0.012       -0.572      0.111    
                                  (0.086)     (0.376)     (0.332)   
                                                                    
Inflation.l3                      -0.004      0.184**     -0.146**  
                                  (0.019)     (0.084)     (0.074)   
                                                                    
Fed.Rate.l3                       -0.020       -0.053     0.335***  
                                  (0.022)     (0.098)     (0.087)   
                                                                    
--------------------------------------------------------------------
Observations                        161         161         161     
R2                                 0.999       0.955       0.986    
Adjusted R2                        0.999       0.952       0.985    
Residual Std. Error (df = 152)     0.231       1.015       0.895    
F Statistic (df = 9; 152)      12,729.890*** 356.559*** 1,193.802***
====================================================================
Note:                                    *p<0.1; **p<0.05; ***p<0.01

Testing the stability of the model; Use Eigen value, which should be less than one

roots(VAR_mdl, modulus=TRUE)
[1] 0.99513614 0.94079541 0.76174764 0.76174764 0.59992529 0.59992529 0.38418659
[8] 0.38418659 0.08928642

The mdoel is stable since the roots are inside the unit circle, that is, the roots are less than one

GRANGER CAUSALITY TEST

GrI<-causality(VAR_mdl, cause = "Inflation")
GrI
$Granger

    Granger causality H0: Inflation do not Granger-cause Unemployment
    Fed.Rate

data:  VAR object VAR_mdl
F-Test = 6.0224, df1 = 6, df2 = 456, p-value = 0.000004455


$Instant

    H0: No instantaneous causality between: Inflation and Unemployment
    Fed.Rate

data:  VAR object VAR_mdl
Chi-squared = 4.219, df = 2, p-value = 0.1213
GrU<-causality(VAR_mdl, cause = "Unemployment")
GrU
$Granger

    Granger causality H0: Unemployment do not Granger-cause Inflation
    Fed.Rate

data:  VAR object VAR_mdl
F-Test = 5.6633, df1 = 6, df2 = 456, p-value = 0.00001088


$Instant

    H0: No instantaneous causality between: Unemployment and Inflation
    Fed.Rate

data:  VAR object VAR_mdl
Chi-squared = 23.73, df = 2, p-value = 0.000007031
GrF<-causality(VAR_mdl, cause = "Fed.Rate")
GrF
$Granger

    Granger causality H0: Fed.Rate do not Granger-cause Unemployment
    Inflation

data:  VAR object VAR_mdl
F-Test = 3.2328, df1 = 6, df2 = 456, p-value = 0.004033


$Instant

    H0: No instantaneous causality between: Fed.Rate and Unemployment
    Inflation

data:  VAR object VAR_mdl
Chi-squared = 26.054, df = 2, p-value = 0.0000022

From the analysis, it was established that both inflation, unemployment and fed rate granger cause each other.

Variance Decomposition and Impulse Response Function

IRF_1<-irf(VAR_mdl, impulse = "Inflation", response = "Unemployment",
         n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF_1, ylab="Inflation",
     main="Inflation response to Unemployment shock")
Inflation response to unemployment shock

Inflation response to unemployment shock

Variance Decomposition (Cholesky Decomposition)

VARD_1<-fevd(VAR_mdl, n.ahead=5)
VARD_1
$Unemployment
     Unemployment   Inflation     Fed.Rate
[1,]    1.0000000 0.000000000 0.0000000000
[2,]    0.9906356 0.008466579 0.0008978494
[3,]    0.9689308 0.017920429 0.0131487360
[4,]    0.9348733 0.028723840 0.0364029017
[5,]    0.8903648 0.046527944 0.0631072218

$Inflation
     Unemployment Inflation    Fed.Rate
[1,]  0.001789413 0.9982106 0.000000000
[2,]  0.062362243 0.9290021 0.008635625
[3,]  0.092268984 0.8917762 0.015954800
[4,]  0.100141867 0.8864991 0.013359037
[5,]  0.105821679 0.8826761 0.011502228

$Fed.Rate
     Unemployment  Inflation  Fed.Rate
[1,]    0.1722401 0.02083114 0.8069287
[2,]    0.3326729 0.02515215 0.6421749
[3,]    0.4415973 0.06171751 0.4966851
[4,]    0.5042660 0.07830998 0.4174240
[5,]    0.5349494 0.08422492 0.3808256

Plot the variance decomposition

plot(VARD_1)
Variance decomposition function

Variance decomposition function

GARCH MODEL

Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH) is a statistical model used in analyzing time-series data where the variance error is believed to be serially auto-correlated. GARCH models assume that the variance of the error term follows an auto-regressive moving average process.

Understanding Generalized AutoRegressive Conditional Heteroskedasticity (GARCH)

Although GARCH models can be used in the analysis of a number of different types of financial data, such as macroeconomic data, financial institutions typically use them to estimate the volatility of returns for stocks, bonds, and market indices. They use the resulting information to help determine pricing and judge which assets will potentially provide higher returns, as well as to forecast the returns of current investments to help in their asset allocation, hedging, risk management, and portfolio optimization decisions.

GARCH models are used when the variance of the error term is not constant. That is, the error term is heteroskedastic. Heteroskedasticity describes the irregular pattern of variation of an error term, or variable, in a statistical model. Essentially, wherever there is heteroskedasticity, observations do not conform to a linear pattern. Instead, they tend to cluster. Therefore, if statistical models that assume constant variance are used on this data, then the conclusions and predictive value one can draw from the model will not be reliable.

The variance of the error term in GARCH models is assumed to vary systematically, conditional on the average size of the error terms in previous periods. In other words, it has conditional heteroskedasticity, and the reason for the heteroskedasticity is that the error term is following an auto-regressive moving average pattern. This means that it is a function of an average of its own past values.

History of GARCH

GARCH was developed in 1986 by Dr. Tim Bollerslev, a doctoral student at the time, as a way to address the problem of forecasting volatility in asset prices. It built on economist Robert Engle’s breakthrough 1982 work in introducing the Autoregressive Conditional Heteroskedasticity (ARCH) model. His model assumed the variation of financial returns was not constant over time but are autocorrelated, or conditional to/dependent on each other. For instance, one can see this in stock returns where periods of volatility in returns tend to be clustered together.

Since the original introduction, many variations of GARCH have emerged. These include Nonlinear (NGARCH), which addresses correlation and observed “volatility clustering” of returns, and Integrated GARCH (IGARCH), which restricts the volatility parameter. All the GARCH model variations seek to incorporate the direction, positive or negative, of returns in addition to the magnitude (addressed in the original model).

Each derivation of GARCH can be used to accommodate the specific qualities of the stock, industry, or economic data. When assessing risk, financial institutions incorporate GARCH models into their Value-at-Risk (VAR), maximum expected loss (whether for a single investment or trading position, portfolio, or at a division or firm-wide level) over a specified time period. GARCH models are viewed to provide better gauges of risk than can be obtained through tracking standard deviation alone.

Various studies have been conducted on the reliability of various GARCH models during different market conditions, including during the periods leading up to and after the Great Recession.

Practical example using APPLE stock prices

We are going to download the data for the apple stock price from the internet. We shall therefore use the command below.

apple<-getSymbols("AAPL",
           from="2007-01-01",
           to="2023-03-10")
plot(AAPL$AAPL.Close["2020"])
The plot of Returns for 2020

The plot of Returns for 2020

View the first few observations of the data set

head(AAPL,10)
           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2007-01-03  3.081786  3.092143 2.925000   2.992857  1238319600      2.547275
2007-01-04  3.001786  3.069643 2.993571   3.059286   847260400      2.603815
2007-01-05  3.063214  3.078571 3.014286   3.037500   834741600      2.585272
2007-01-08  3.070000  3.090357 3.045714   3.052500   797106800      2.598039
2007-01-09  3.087500  3.320714 3.041071   3.306071  3349298400      2.813858
2007-01-10  3.383929  3.492857 3.337500   3.464286  2952880000      2.948517
2007-01-11  3.426429  3.456429 3.396429   3.421429  1440252800      2.912041
2007-01-12  3.378214  3.395000 3.329643   3.379286  1312690400      2.876172
2007-01-16  3.417143  3.473214 3.408929   3.467857  1244076400      2.951556
2007-01-17  3.484286  3.485714 3.386429   3.391071  1646260000      2.886203

View the last few observations of the data set

tail(AAPL,10)
           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2023-02-24    147.11    147.19   145.72     146.71    55469600        146.71
2023-02-27    147.71    149.17   147.45     147.92    44998500        147.92
2023-02-28    147.05    149.08   146.83     147.41    50547000        147.41
2023-03-01    146.83    147.23   145.01     145.31    55479000        145.31
2023-03-02    144.38    146.71   143.90     145.91    52238100        145.91
2023-03-03    148.04    151.11   147.33     151.03    70732300        151.03
2023-03-06    153.79    156.30   153.46     153.83    87558000        153.83
2023-03-07    153.70    154.03   151.13     151.60    56182000        151.60
2023-03-08    152.81    153.47   151.83     152.87    47204800        152.87
2023-03-09    153.56    154.54   150.23     150.59    53833600        150.59

Plot the general series

chartSeries(AAPL)
Chartseries for returns

Chartseries for returns

Plot the series/chart for a specific year (2020)

chartSeries(AAPL["2020-12"])
Chart series showing returns for 2020/12

Chart series showing returns for 2020/12

Note that the yellow candle sticks indicates that AAPL experience a lower closing stock than the opening stock price. On the other hand, the green candle stick shows that the company experienced a higher closing stock prices than opening stock prices.

chartSeries(AAPL["2008-12"])
Chart series showing returns for 2008/12

Chart series showing returns for 2008/12

Daily Returns

NOTE: It is difficult to work on this data the way it is, so what we always is to convert the series to daily returns as shown below.

Returns<-CalculateReturns(AAPL$AAPL.Close)

View the returns

head(Returns,10)
             AAPL.Close
2007-01-03           NA
2007-01-04  0.022195848
2007-01-05 -0.007121269
2007-01-08  0.004938272
2007-01-09  0.083069943
2007-01-10  0.047855899
2007-01-11 -0.012371092
2007-01-12 -0.012317368
2007-01-16  0.026209975
2007-01-17 -0.022142205

Removing the missing value from the returns

Returns<-Returns[-1]

View the new returns

head(Returns,10)
             AAPL.Close
2007-01-04  0.022195848
2007-01-05 -0.007121269
2007-01-08  0.004938272
2007-01-09  0.083069943
2007-01-10  0.047855899
2007-01-11 -0.012371092
2007-01-12 -0.012317368
2007-01-16  0.026209975
2007-01-17 -0.022142205
2007-01-18 -0.061927338

Plot the histograms of our returns

hist(Returns, breaks = 15)
Histogram for the returns

Histogram for the returns

Histogram superimposed with a normal curve

chart.Histogram(Returns,
                methods = c('add.density','add.normal'),
                colorset = c('blue','green','red'))
Histogram showing the distribution of returns superimposed with the normal curve

Histogram showing the distribution of returns superimposed with the normal curve

Seasonality in the series

chartSeries(Returns,theme = 'white')
Volatility chart series

Volatility chart series

The graph above shows that there is volatility in the series, however, the chart shows that there is no seasonality in the series. High volatility is seen in the year 2008, specifically after the stock market crash in the united states due to a fall in the housing pricing. Similarly, the graph shows some periods where we had a higher variability in the returns.

Annualized Volatility.

We can capture volatility in the market with the help of standard deviation. First, we calculate the standard deviation of our series as shown below

sd(Returns)
[1] 0.02034422

The return’s standard deviation is approximately 2.03%. Now suppose we assume that we have 252 trading days in a year. Therefore, this standard deviation will be multiplied by the square root of 252.

sqrt(252)*sd(Returns)
[1] 0.3229545

Generally speaking, the average volatility in returns is approximately 32.19% per year.

Analyzed volatility per year, say 2008

sqrt(252)*sd(Returns["2008"])
[1] 0.5820481

Volatility in returns for the Apple company in the year 2008 was approximately 58.20%. Volatility measures how fast returns from an investment changes due to some factors, such as economic shocks,and or political shocks. Compare the volatility across the years, 2008, 2015 and 2020. In which year did the AAPL experienced a higher volatility in returns.

sqrt(252)*sd(Returns["2015"])
[1] 0.2673398

Volatility in returns for the Apple company in the year 2008 was approximately 26.73%.

sqrt(252)*sd(Returns["2020"])
[1] 0.4665857

Volatility in returns for the Apple company in the year 2008 was approximately 46.66%.

We can as well make a plot of volatility of returns from the year 2007 to 2022 in a nice looking graph, given monthwise where we have 22 monthly trading days and 252 annually trading days.

Monthly

chart.RollingPerformance(R=Returns["2007::2023"],
                         width = 22,
                         FUN = "sd.annualized",
                         scale=252,
                         main = "Apple's monthly rolling volatility")
Apple's monthly rolling volatility

Apple’s monthly rolling volatility

Weekly

chart.RollingPerformance(R=Returns["2007::2023"],
                         width = 6,
                         FUN = "sd.annualized",
                         scale=252,
                         main = "Apple's monthly rolling volatility")
Apple's weekly rolling volatility

Apple’s weekly rolling volatility

Annually

chart.RollingPerformance(R=Returns["2007::2023"],
                         width = 252,
                         FUN = "sd.annualized",
                         scale=252,
                         main = "Apple's monthly rolling volatility")
Apple's Annual rolling volatility

Apple’s Annual rolling volatility

GARCH MODEL: Standard Model with interpretation

1. GARCH model with a constant mean

Before we create the model, we create the specification for the model and so store the specification, and how do we do that;

s<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="sGARCH"),
              distribution.model = 'norm')
m<- ugarchfit(data=Returns,spec = s)
m

*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : norm 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001939    0.000250   7.7438        0
omega   0.000014    0.000001   9.5552        0
alpha1  0.105801    0.001912  55.3270        0
beta1   0.860030    0.008249 104.2613        0

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001939    0.000328   5.9202 0.000000
omega   0.000014    0.000004   3.3442 0.000825
alpha1  0.105801    0.017987   5.8823 0.000000
beta1   0.860030    0.013021  66.0519 0.000000

LogLikelihood : 10554.42 

Information Criteria
------------------------------------
                    
Akaike       -5.1807
Bayes        -5.1745
Shibata      -5.1807
Hannan-Quinn -5.1785

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.9189  0.3378
Lag[2*(p+q)+(p+q)-1][2]    1.4893  0.3635
Lag[4*(p+q)+(p+q)-1][5]    4.3042  0.2185
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.1366  0.7117
Lag[2*(p+q)+(p+q)-1][5]    1.3903  0.7668
Lag[4*(p+q)+(p+q)-1][9]    2.6138  0.8206
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.6404 0.500 2.000  0.4236
ARCH Lag[5]    2.3106 1.440 1.667  0.4066
ARCH Lag[7]    2.4778 2.315 1.543  0.6170

Nyblom stability test
------------------------------------
Joint Statistic:  15.8903
Individual Statistics:             
mu     0.1630
omega  3.2265
alpha1 0.3831
beta1  0.4430

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.07 1.24 1.6
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value    prob sig
Sign Bias           0.7662 0.44359    
Negative Sign Bias  1.4747 0.14036    
Positive Sign Bias  0.8164 0.41431    
Joint Effect        9.8021 0.02033  **


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic              p-value(g-1)
1    20     136.5 0.00000000000000000008531
2    30     153.3 0.00000000000000000073860
3    40     172.2 0.00000000000000000116447
4    50     171.7 0.00000000000000156776602


Elapsed time : 0.66782 

Model interpretation

From the results above, the standard GARCH model is sGARCH(1,1). Remember we are using a constant mean model ASRIMA(0,0,0) with a normal distribution of error. The model has three parameters, mu, omega, alpha1 and beta1. From the results above, all the parameter with their estimates shows that the parameters are statistically significant.

We can write the equation in Rmarkdown using the command below; \[ Returns_{t}= \mu + e_{t} \] The main equation now becomes the following. \[ Returns_{t}= 0.001968 + e_{t} \] Remember the mean is constant. On the other hand, the equation for sigma squares variable at time t is given as follows; \[ \sigma_{t}^2 = w + \alpha*e_{t-1}^2 + \beta \sigma_{t-1}^2 \]

If we replace the value from the output above into our equation, the new equation will be as shown below.

\[ \sigma_{t}^2 = 0.000014 + 0.108468e_{t-1}^2 + 0.855942 \sigma_{t-1}^2 \]

The model above is our variance GARCH model. Remember if we do not have the beta term in the model, the entire model reduces to ARCH(1) model, which is basically as shown below; \[ \sigma_{t}^2 = w + \alpha*e_{t-1}^2 \]

Model Output

Information criteria

This model also provides a lot of other useful analysis and information that we can use. One of the most useful information we can use is the information criteria, which include Akaike information criteria, Bayes information criteria, Hannan Quinn information criteria and Shibata information criteria. The information criteria helps to the best model without necessarily choosing a very complex model. The simple always has a lower information criteria. The model that has a lower information criteria will be useful in calculating returns. Therefore, we shall choose the model with lowest information value.

Ljung Box Test

Another important information we have is the Ljung-Box test on standardized residuals. The null hypothesis in this case is that there is no serial correlation. From the output above, there is no sufficient evidence to reject the null hypothesis. As a result, we can conclude that there is no serial correlation between the standardized residuals. Besides, there is no serial correlation in the standardized squared residuals, and they more or less behave like the white noise process. This also justifies that our GARCH model is valid.

Adjusted Pearson Goodness of Fit test

One thing we can notice is that the probability values are all less than 0.05 indicating that the model for the residuals that we used is not really a good choice. If the distribution was a good fit, then the p-value for the goodness fit test would be greater than 0.05. Thus, we shall assume that there is a scope to improve our model.

The Twelve (12) Plots

s<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="sGARCH"),
              distribution.model = 'norm')
m<- ugarchfit(data=Returns,spec = s)
m

*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : norm 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001939    0.000250   7.7438        0
omega   0.000014    0.000001   9.5552        0
alpha1  0.105801    0.001912  55.3270        0
beta1   0.860030    0.008249 104.2613        0

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001939    0.000328   5.9202 0.000000
omega   0.000014    0.000004   3.3442 0.000825
alpha1  0.105801    0.017987   5.8823 0.000000
beta1   0.860030    0.013021  66.0519 0.000000

LogLikelihood : 10554.42 

Information Criteria
------------------------------------
                    
Akaike       -5.1807
Bayes        -5.1745
Shibata      -5.1807
Hannan-Quinn -5.1785

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.9189  0.3378
Lag[2*(p+q)+(p+q)-1][2]    1.4893  0.3635
Lag[4*(p+q)+(p+q)-1][5]    4.3042  0.2185
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.1366  0.7117
Lag[2*(p+q)+(p+q)-1][5]    1.3903  0.7668
Lag[4*(p+q)+(p+q)-1][9]    2.6138  0.8206
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.6404 0.500 2.000  0.4236
ARCH Lag[5]    2.3106 1.440 1.667  0.4066
ARCH Lag[7]    2.4778 2.315 1.543  0.6170

Nyblom stability test
------------------------------------
Joint Statistic:  15.8903
Individual Statistics:             
mu     0.1630
omega  3.2265
alpha1 0.3831
beta1  0.4430

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.07 1.24 1.6
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value    prob sig
Sign Bias           0.7662 0.44359    
Negative Sign Bias  1.4747 0.14036    
Positive Sign Bias  0.8164 0.41431    
Joint Effect        9.8021 0.02033  **


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic              p-value(g-1)
1    20     136.5 0.00000000000000000008531
2    30     153.3 0.00000000000000000073860
3    40     172.2 0.00000000000000000116447
4    50     171.7 0.00000000000000156776602


Elapsed time : 0.8176961 
plot(m,which ='all')

please wait...calculating quantiles...
The 12 plots for the distribution of residuals

The 12 plots for the distribution of residuals

Note: The command above give 12 different plots.

The first plot shows the series with 2 conditional standard deviation superimposed. Remember you have this data for returns and the red lines you are seeing on the plot are at 2 sd. Basically, the two lines represents 95% of the data on returns. The second plot shows the series with a 1% value risk limit, therefore the red line you are seeing are at 1%. The third plot is for conditional standard deviation versus returns. Remember returns are absolute returns. The blue line you are seeing on the third plot is the standard deviation. The forth plot shows the autocorrelation for the observations. In the fifth plot there is a significant autocorrelation as shown by all the bars being above the red line. There is also a similar situation in the sixth plot. The seventh plot shows the autocorrelation for the squared versus the actual observations. The eighth plot shows histogram for the residuals. The tails of the histograms deviates from the normal distribution. This can be well established by looking at the normal Q-Q plot. In other words, there is a significant deviation of the standard residuals from the normal distribution for both low and high values. The ninth plot shows the autocorrelation of the standardized residuals where we are able to see that the residuals improved significantly and for squared standardized residuals, all the values are within the red lines

FORECAST

We shall therefore use this model to make our forecast and we will store our forecast in the object F.

f<-ugarchforecast(fitORspec = m,
                  n.ahead = 20)

Plot the forecast

plot(fitted(f))
Fitted plot

Fitted plot

Remember this model was estimated on a constant mean model. In other words, our prediction will be constant. We can also plot the variability using the command below.

plot(sigma(f))
Variability plot

Variability plot

The graph shows that volatility in returns is going to increase for the next 20 days predicted.

VARIOUS VARIANTS OF GARCH MODEL

Application example of portfolio allocation.

Portfolio allocation example will help and guide what percentage of portfolio should go to the riskier assets such as stocks and what percentage should go to free riskier assets. This will be followed by four different variants. * Model-2: GARCH with sstd * Model-3: GJR-GARCH * Model-4: AR(1) GJG-GARCH * Model-5: GJG-GARCH in mean

Application example; Portfolio Allocation

Let us now calculate the analyzed volatility using the code below and store the results in V; remember we shall assume that there are 252 trading days.

V<- sqrt(252) * sigma(m)

We shall also create an object W which represent the weight assigned to the risky assets. In this case, we shall assign a target of 5%, that is, 0.05.

W<-0.05/V

Merge and two and plot

plot(merge(V,W),
     multi.panel=T)
Weight assigned to risky assets

Weight assigned to risky assets

The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end.

tail(W,20)
                [,1]
2023-02-09 0.1640393
2023-02-10 0.1710039
2023-02-13 0.1801228
2023-02-14 0.1797704
2023-02-15 0.1875874
2023-02-16 0.1911529
2023-02-17 0.1940037
2023-02-21 0.1991621
2023-02-22 0.1772824
2023-02-23 0.1864209
2023-02-24 0.1954879
2023-02-27 0.1884858
2023-02-28 0.1959915
2023-03-01 0.2037025
2023-03-02 0.2003178
2023-03-03 0.2090398
2023-03-06 0.1746004
2023-03-07 0.1752742
2023-03-08 0.1760180
2023-03-09 0.1837988

Let us assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 18.3%. In this case the model suggests that the company can allocate the following amount to the risky assets. \[ Amount_{riskyAssets}= 2000000*0.1831701 = 366340.2 \]

The calculations above shows that the company should have allocated $366340.2 on the risky assets as suggested by the model and allocate the following amount on the free risky assets; Some of the suggested free risky assets include but are limited to bank accounts

\[ Amount_{freeRisk}=2000000-366340.2= 1633659.8 \]

Suppose we increase the assigned to the risky assets to 10%, that is, 0.1. Consider the graph below

W_1<-0.1/V

Merge and two and plot

plot(merge(V,W_1),
     multi.panel=T)
Weight assigned to risky assets

Weight assigned to risky assets

The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end. However, the weight assigned to the risky assets is twice as compared to the previous graph.

tail(W_1,20)
                [,1]
2023-02-09 0.3280787
2023-02-10 0.3420079
2023-02-13 0.3602457
2023-02-14 0.3595409
2023-02-15 0.3751747
2023-02-16 0.3823057
2023-02-17 0.3880073
2023-02-21 0.3983243
2023-02-22 0.3545649
2023-02-23 0.3728417
2023-02-24 0.3909758
2023-02-27 0.3769716
2023-02-28 0.3919831
2023-03-01 0.4074049
2023-03-02 0.4006355
2023-03-03 0.4180797
2023-03-06 0.3492009
2023-03-07 0.3505484
2023-03-08 0.3520361
2023-03-09 0.3675977

Let us assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 36.6%. In this case the model suggests that the company can allocate the following amount to the risky assets.

2000000*0.3663402
[1] 732680.4

According to this model, the company should therefore allocate $732680.4 to the risky assets and the remaining amount to be allocated to the free risky assets. This allocation is only possible if the compamy accepts set 10% as the target for the assigned weight to the risky assets.

Model-2: GARCH Model with skewed student’s t-distribution

In the previous model we saw that the normal distribution is not a good distribution for the residuals. Now let us make use of skewed student’s t-distribution to capture this distribution more accurately. We shall edit the model to skewed student’s t-distribution

S<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="sGARCH"),
              distribution.model = 'sstd')
M<- ugarchfit(data=Returns,spec = S)
M

*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001636    0.000253   6.4593 0.000000
omega   0.000007    0.000003   2.3531 0.018617
alpha1  0.095489    0.011945   7.9939 0.000000
beta1   0.891651    0.012970  68.7478 0.000000
skew    1.009803    0.021639  46.6649 0.000000
shape   5.103620    0.450734  11.3229 0.000000

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001636    0.000255   6.4029  0.00000
omega   0.000007    0.000007   1.0254  0.30519
alpha1  0.095489    0.017466   5.4672  0.00000
beta1   0.891651    0.019261  46.2933  0.00000
skew    1.009803    0.020503  49.2508  0.00000
shape   5.103620    0.686130   7.4383  0.00000

LogLikelihood : 10711.67 

Information Criteria
------------------------------------
                    
Akaike       -5.2569
Bayes        -5.2476
Shibata      -5.2569
Hannan-Quinn -5.2536

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.8989  0.3431
Lag[2*(p+q)+(p+q)-1][2]    1.4222  0.3794
Lag[4*(p+q)+(p+q)-1][5]    4.4234  0.2058
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                   0.004239  0.9481
Lag[2*(p+q)+(p+q)-1][5]  1.043397  0.8498
Lag[4*(p+q)+(p+q)-1][9]  2.107863  0.8917
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.5377 0.500 2.000  0.4634
ARCH Lag[5]    1.9446 1.440 1.667  0.4836
ARCH Lag[7]    2.2279 2.315 1.543  0.6691

Nyblom stability test
------------------------------------
Joint Statistic:  3.7977
Individual Statistics:              
mu     0.19072
omega  0.74030
alpha1 0.74349
beta1  0.86433
skew   0.04262
shape  1.03582

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.49 1.68 2.12
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value    prob sig
Sign Bias           1.0011 0.31684    
Negative Sign Bias  1.2064 0.22773    
Positive Sign Bias  0.8987 0.36885    
Joint Effect       10.3737 0.01564  **


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     27.25      0.09906
2    30     41.89      0.05744
3    40     49.85      0.11431
4    50     53.92      0.29176


Elapsed time : 1.565172 

Having created the skewed student’s t-distribution, let us now have the 12 plots and see the distribution of the residuals.

plot(M,which ='all')

please wait...calculating quantiles...
The 12 plots for residual distribution

The 12 plots for residual distribution

The plots above shows that our models improved compared to the previous model as indicated by the distribution by of the residuals. From the plots above, the sstd Q-Q plot shows that residuals relatively aligns to the diagonal line. Besides, the information criteria also improved. Additionally, the skew statistics shows that the distribution is symmetry as indicated by the skewed estimated of 1.007, which is quite close to 1.

V_1<- sqrt(252) * sigma(M)

We shall also create an object W which represent the weight assigned to the risky assets. In this case, we shall assign a target of 5%, that is, 0.05.

W_2<-0.05/V

Merge and two and plot

plot(merge(V_1,W_2),
     multi.panel=T)
Weight assigned to risky and non-risky assets

Weight assigned to risky and non-risky assets

The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end.

tail(W_2,20)
                [,1]
2023-02-09 0.1640393
2023-02-10 0.1710039
2023-02-13 0.1801228
2023-02-14 0.1797704
2023-02-15 0.1875874
2023-02-16 0.1911529
2023-02-17 0.1940037
2023-02-21 0.1991621
2023-02-22 0.1772824
2023-02-23 0.1864209
2023-02-24 0.1954879
2023-02-27 0.1884858
2023-02-28 0.1959915
2023-03-01 0.2037025
2023-03-02 0.2003178
2023-03-03 0.2090398
2023-03-06 0.1746004
2023-03-07 0.1752742
2023-03-08 0.1760180
2023-03-09 0.1837988

Similarly, if we assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 18.3%. In this case the model suggests that the company can allocate the following amount to the risky assets and the remaining to be allocated to the free risky assets.

2000000*0.1831701
[1] 366340.2

Model-3: GJR_GARCH

We are going to copy the same model we used above and modify so as to estimate the GJR GARCH model. The GJR GARCH model was developed by Glosen Jagannathan-Runkle. Consider the results below.

GJR<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="gjrGARCH"),
              distribution.model = 'sstd')
M_1<- ugarchfit(data=Returns,spec = GJR)
M_1

*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001423    0.000244   5.8404        0
omega   0.000010    0.000001   8.0677        0
alpha1  0.037218    0.001654  22.5009        0
beta1   0.874450    0.009426  92.7727        0
gamma1  0.137875    0.019465   7.0831        0
skew    1.004774    0.021448  46.8466        0
shape   5.434271    0.393137  13.8228        0

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001423    0.000267   5.3386  0.00000
omega   0.000010    0.000002   5.4622  0.00000
alpha1  0.037218    0.010454   3.5603  0.00037
beta1   0.874450    0.010689  81.8099  0.00000
gamma1  0.137875    0.024481   5.6319  0.00000
skew    1.004774    0.020432  49.1765  0.00000
shape   5.434271    0.495804  10.9605  0.00000

LogLikelihood : 10740.52 

Information Criteria
------------------------------------
                    
Akaike       -5.2706
Bayes        -5.2597
Shibata      -5.2706
Hannan-Quinn -5.2667

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                      1.885  0.1698
Lag[2*(p+q)+(p+q)-1][2]     2.319  0.2150
Lag[4*(p+q)+(p+q)-1][5]     5.179  0.1394
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.1851  0.6671
Lag[2*(p+q)+(p+q)-1][5]    1.3287  0.7819
Lag[4*(p+q)+(p+q)-1][9]    2.3390  0.8609
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.7895 0.500 2.000  0.3743
ARCH Lag[5]    2.1081 1.440 1.667  0.4478
ARCH Lag[7]    2.3698 2.315 1.543  0.6394

Nyblom stability test
------------------------------------
Joint Statistic:  21.7937
Individual Statistics:              
mu     0.49535
omega  5.11751
alpha1 1.44995
beta1  1.43854
gamma1 0.97126
skew   0.03061
shape  1.36371

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.69 1.9 2.35
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value   prob sig
Sign Bias           1.2770 0.2017    
Negative Sign Bias  0.5576 0.5771    
Positive Sign Bias  0.1498 0.8809    
Joint Effect        2.6469 0.4493    


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     23.23       0.2274
2    30     34.73       0.2135
3    40     40.62       0.3990
4    50     48.86       0.4786


Elapsed time : 2.870686 

We can notice that there is a very big change in the last plot among the 12 plots as shown below;

plot(M_1,which ='all')

please wait...calculating quantiles...

Plot number 12 changed drastically due to the gjr garch modeling. According to this model, when the news have a positive impacts on the stock price, the increase in prices are gradual. However, when the news have a negative impacts on the stock price, the impact is significantly higher. Therefore, as per this model, the impacts of news on the stock market is not symmetric. However, the impacts is higher for the negative new than the positive news.

The model parameter are all significant. This model is written as follows

\[ Returns_{t} = \mu + e_{t}^2 \]

When we plug in values, the model will be as follows: \[ Returns_{t} = 0.001448 + e_{t} \]

Remember in this model, et follows a normal distribution with the mean of zero and a variance of sigma square. However, for sigma t-square, we will have two equations for the negative and positive news as shown below; \[ \sigma_{t}^2 (e_{t-1}<0) \] The equation above is for the impacts of the negative news. The full equation is going to as follows; \[ \omega + (\alpha + \gamma)e_{t-1}^2 + \beta\sigma_{t-1}^2 \]

If we plug in the values, this is going to be as follows; \[ 0.000011 + (0.039228 + 0.142040)e_{t-1}^2 + 0.869758 \sigma_{t-1}^2 \] When the news have a positive impacts on the stock market, then et square will be as follows; \[ \sigma_{t}^2 (e_{t-1}>0) \]

However, if the news have a positive impacts on the stocm market, we shall have a standard model as we’ve seen earlier.

\[ \omega + \alpha e_{t-1}^2 + \beta\sigma_{t-1}^2 \]

When we plug in values, we will have the following model.

\[ 0.000011 + 0.039228e_{t-1}^2 + 0.869758\sigma_{t-1}^2 \]

Generally, the third model is better than the first two models. To validate this claim we shall focus on the following * skeweness statistics * Information criteria The skeweness statistics for this models is close to one as compared to the first two models. Besides, the values for the information criteria for this models are relatively lower compared to the information criteria value in the other two models. This is enough evidence to shows that this model performed better than the first two models.

Model-4: AR(1) GJR GARCH Model

We shall edit the model we used above to have the AR(1) GJR GARCH model;

AR_GJR<-ugarchspec(mean.model = list(armaOrder =c(1,0)),
              variance.model = list(model="gjrGARCH"),
              distribution.model = 'sstd')
M_2<- ugarchfit(data=Returns,spec = AR_GJR)
M_2

*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model  : ARFIMA(1,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001413    0.000246  5.74103  0.00000
ar1     0.010777    0.015836  0.68056  0.49615
omega   0.000010    0.000001  8.19257  0.00000
alpha1  0.036718    0.001790 20.51119  0.00000
beta1   0.874146    0.009433 92.66428  0.00000
gamma1  0.139487    0.019706  7.07832  0.00000
skew    1.005077    0.021486 46.77744  0.00000
shape   5.458812    0.396628 13.76304  0.00000

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001413    0.000267  5.29314 0.000000
ar1     0.010777    0.014465  0.74508 0.456226
omega   0.000010    0.000002  5.55291 0.000000
alpha1  0.036718    0.010358  3.54498 0.000393
beta1   0.874146    0.010674 81.89345 0.000000
gamma1  0.139487    0.024749  5.63612 0.000000
skew    1.005077    0.020616 48.75308 0.000000
shape   5.458812    0.497708 10.96789 0.000000

LogLikelihood : 10740.75 

Information Criteria
------------------------------------
                    
Akaike       -5.2702
Bayes        -5.2578
Shibata      -5.2702
Hannan-Quinn -5.2658

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.6056  0.4365
Lag[2*(p+q)+(p+q)-1][2]    1.0428  0.7140
Lag[4*(p+q)+(p+q)-1][5]    3.8959  0.2425
d.o.f=1
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.2268  0.6339
Lag[2*(p+q)+(p+q)-1][5]    1.3456  0.7778
Lag[4*(p+q)+(p+q)-1][9]    2.3370  0.8612
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.7797 0.500 2.000  0.3772
ARCH Lag[5]    2.0551 1.440 1.667  0.4592
ARCH Lag[7]    2.3206 2.315 1.543  0.6497

Nyblom stability test
------------------------------------
Joint Statistic:  22.3087
Individual Statistics:              
mu     0.49114
ar1    0.26770
omega  5.18582
alpha1 1.44965
beta1  1.43671
gamma1 0.97868
skew   0.03063
shape  1.36409

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.89 2.11 2.59
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value   prob sig
Sign Bias           1.2416 0.2145    
Negative Sign Bias  0.5609 0.5749    
Positive Sign Bias  0.1590 0.8737    
Joint Effect        2.5170 0.4722    


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     25.73       0.1378
2    30     34.63       0.2170
3    40     44.41       0.2544
4    50     51.59       0.3729


Elapsed time : 3.45856 

Create the 12 plots

plot(M_2,which='all')

please wait...calculating quantiles...

There is nothing in the plots above; however, what is readily noticeable is that the AR(1) we’ve just included in our model is not statistically significant. The implication is that adding this term to GARCH model is not in any way helpful and we should therefore consider removing from our model. In other words, we are no long going to use this model.

Model-5: GJG-GARCH in mean

Like usual, we shall edit the models we used above to have the GJR GARCH in mean model.

GJR_mean<-ugarchspec(mean.model = list(armaOrder =c(0,0),
                                       archm=T,
                                       archpow=2),
              variance.model = list(model="gjrGARCH"),
              distribution.model = 'sstd')
M_3<- ugarchfit(data=Returns,spec = GJR_mean)
M_3

*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001456    0.000327  4.45794 0.000008
archm  -0.123404    0.941008 -0.13114 0.895664
omega   0.000010    0.000001  8.51083 0.000000
alpha1  0.037186    0.004869  7.63665 0.000000
beta1   0.874803    0.009427 92.80199 0.000000
gamma1  0.137798    0.019951  6.90689 0.000000
skew    1.004600    0.021629 46.44748 0.000000
shape   5.436195    0.403567 13.47038 0.000000

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001456    0.000382  3.81118 0.000138
archm  -0.123404    0.974138 -0.12668 0.899194
omega   0.000010    0.000002  6.17301 0.000000
alpha1  0.037186    0.009680  3.84166 0.000122
beta1   0.874803    0.010795 81.03857 0.000000
gamma1  0.137798    0.024679  5.58354 0.000000
skew    1.004600    0.020274 49.55001 0.000000
shape   5.436195    0.474629 11.45355 0.000000

LogLikelihood : 10740.53 

Information Criteria
------------------------------------
                    
Akaike       -5.2701
Bayes        -5.2577
Shibata      -5.2701
Hannan-Quinn -5.2657

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                      1.837  0.1753
Lag[2*(p+q)+(p+q)-1][2]     2.286  0.2195
Lag[4*(p+q)+(p+q)-1][5]     5.150  0.1415
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.1814  0.6701
Lag[2*(p+q)+(p+q)-1][5]    1.3255  0.7827
Lag[4*(p+q)+(p+q)-1][9]    2.3359  0.8613
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.7881 0.500 2.000  0.3747
ARCH Lag[5]    2.1081 1.440 1.667  0.4478
ARCH Lag[7]    2.3712 2.315 1.543  0.6391

Nyblom stability test
------------------------------------
Joint Statistic:  22.2572
Individual Statistics:              
mu     0.50324
archm  0.32431
omega  4.97450
alpha1 1.44914
beta1  1.43572
gamma1 0.97241
skew   0.03083
shape  1.36238

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.89 2.11 2.59
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value   prob sig
Sign Bias           1.2901 0.1971    
Negative Sign Bias  0.5520 0.5810    
Positive Sign Bias  0.1468 0.8833    
Joint Effect        2.6969 0.4408    


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     22.96       0.2389
2    30     35.57       0.1864
3    40     40.64       0.3982
4    50     50.58       0.4108


Elapsed time : 4.48175 

Again the additional term in our model is not statistically significant. Thus we shall ignore this model. Therefore we are not going to use this model to make our prediction.

GARCH Model in R

Simulating stock prices

In this part we shall make use of the best model out of the five models above. We shall therefore use the model to simulate the stock prices for apple. From the five models that we ran, the best model was GJR-GARCh model. This was the third model that we estimated. We shall therefore copy the model that we used and edit the model to make our simulation.

GJR<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
              variance.model = list(model="gjrGARCH"),
              distribution.model = 'sstd')
M_1<- ugarchfit(data=Returns,spec = GJR)
M_1

*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model  : ARFIMA(0,0,0)
Distribution    : sstd 

Optimal Parameters
------------------------------------
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001423    0.000244   5.8404        0
omega   0.000010    0.000001   8.0677        0
alpha1  0.037218    0.001654  22.5009        0
beta1   0.874450    0.009426  92.7727        0
gamma1  0.137875    0.019465   7.0831        0
skew    1.004774    0.021448  46.8466        0
shape   5.434271    0.393137  13.8228        0

Robust Standard Errors:
        Estimate  Std. Error  t value Pr(>|t|)
mu      0.001423    0.000267   5.3386  0.00000
omega   0.000010    0.000002   5.4622  0.00000
alpha1  0.037218    0.010454   3.5603  0.00037
beta1   0.874450    0.010689  81.8099  0.00000
gamma1  0.137875    0.024481   5.6319  0.00000
skew    1.004774    0.020432  49.1765  0.00000
shape   5.434271    0.495804  10.9605  0.00000

LogLikelihood : 10740.52 

Information Criteria
------------------------------------
                    
Akaike       -5.2706
Bayes        -5.2597
Shibata      -5.2706
Hannan-Quinn -5.2667

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                        statistic p-value
Lag[1]                      1.885  0.1698
Lag[2*(p+q)+(p+q)-1][2]     2.319  0.2150
Lag[4*(p+q)+(p+q)-1][5]     5.179  0.1394
d.o.f=0
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                        statistic p-value
Lag[1]                     0.1851  0.6671
Lag[2*(p+q)+(p+q)-1][5]    1.3287  0.7819
Lag[4*(p+q)+(p+q)-1][9]    2.3390  0.8609
d.o.f=2

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[3]    0.7895 0.500 2.000  0.3743
ARCH Lag[5]    2.1081 1.440 1.667  0.4478
ARCH Lag[7]    2.3698 2.315 1.543  0.6394

Nyblom stability test
------------------------------------
Joint Statistic:  21.7937
Individual Statistics:              
mu     0.49535
omega  5.11751
alpha1 1.44995
beta1  1.43854
gamma1 0.97126
skew   0.03061
shape  1.36371

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         1.69 1.9 2.35
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value   prob sig
Sign Bias           1.2770 0.2017    
Negative Sign Bias  0.5576 0.5771    
Positive Sign Bias  0.1498 0.8809    
Joint Effect        2.6469 0.4493    


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     23.23       0.2274
2    30     34.73       0.2135
3    40     40.62       0.3990
4    50     48.86       0.4786


Elapsed time : 2.729416 

We shall treat this model as our final model. Run the two command lines above and run the command lines below as well.

GJR_final<-GJR
setfixed(GJR_final)<-as.list(coef(M_1))

Run the command below to obtain the coefficients of the model M_1.

coef(M_1)
           mu         omega        alpha1         beta1        gamma1 
0.00142344481 0.00001015981 0.03721790057 0.87445039679 0.13787499650 
         skew         shape 
1.00477372417 5.43427134460 

Now, we need to compare a one year focus for two different years. We shall however use 2008 focus using the command below. We shall make our prediction for the next year, which has approximately 252 trading days

f2008<-ugarchforecast(data = Returns["/2008-12"],
                      fitORspec = GJR_final,
                      n.ahead = 252)
f2022<-ugarchforecast(data = Returns["/2022-12"],
                      fitORspec = GJR_final,
                      n.ahead = 252)
par(mfrow= c(2,1))
plot(sigma(f2008))

plot(sigma(f2022))

For the 2008, we had a higher volatility and it was forecasted that after 2008,in next one year, the volatility is going to fall. At the end of 2022, the volatility was low and therefore it was forecasted that the volatility was going to increase. In other words, the volatility was forecasted to increase in 2020. In the long run, the volatility was expected to go towards the average. Now, let us have the two results have simulation in SIM, showing simulation. Consider the command below;

Note: m.sim is the number of time series of the simulated returns that you want. Assume that we want 3 different simulated returns. n.sim is the duration, which in this case is in the next one year, that is, 252 trading days. We shall use random seet (rseed) for the repeatebility purposes

SIM <- ugarchpath(spec = GJR_final,
                  m.sim = 3,
                  n.sim = 1*250,
                  rseed = 123)
plot.zoo(fitted(SIM))

The graph above shows three different simulated returns based on our model. We can as well plot the the simulated variability for the 252 days in 2023.

plot.zoo(sigma(SIM))

Convert to actual Apple share price

It is possible that may be you do not want the simulated returns or simulated variability, but you need the actual Apple share prices. First, run the following command;

tail(AAPL)
           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2023-03-02    144.38    146.71   143.90     145.91    52238100        145.91
2023-03-03    148.04    151.11   147.33     151.03    70732300        151.03
2023-03-06    153.79    156.30   153.46     153.83    87558000        153.83
2023-03-07    153.70    154.03   151.13     151.60    56182000        151.60
2023-03-08    152.81    153.47   151.83     152.87    47204800        152.87
2023-03-09    153.56    154.54   150.23     150.59    53833600        150.59

We are going the use the Apple’s closing price for the last trading period. The last value for the closing stock price was 161.38 dollars. Let us store this stock price for that period using the following command. Therefore, last value for the clsong stock price will be the starting value for our simulation of real share prices. We shall use the cumulative sum to convert the daily returns to cumulative values.

P <- 161.3*apply(fitted(SIM), 2,'cumsum')+291.52

Make the plot of the cumulative share prices

matplot(P, type = "l", lwd = 3)
Stock prices prediction

Stock prices prediction

The black line shows the best prediction for the Apple stock prices. From the plot the Apple stocks prices rises all through up to 2023-08-29.

PARAMETRIC TESTS AND REGRESSION ANALYSIS

INTRODUCTION

In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.

Parametric Tests

Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following

  • One sample t-test
  • Paired and unpaired t-test
  • Analysis of Variance (ANOVA)
  • Pearson correlation

One sample t-test

The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example:

A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.

One sample t-test

Enter the following data

score<-c(50,65,72,77,73,85,88,80,65,56,66,78,82,90)
score
 [1] 50 65 72 77 73 85 88 80 65 56 66 78 82 90

Length and mean

length(score)
[1] 14
mean(score)
[1] 73.35714

Default t-test

t.test(score,mu=80)

    One Sample t-test

data:  score
t = -2.0988, df = 13, p-value = 0.05593
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
 66.51943 80.19486
sample estimates:
mean of x 
 73.35714 

In the above output, the p-value of the test is 0.05593, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.05593.

Customized t-test

t.test(score,mu=80,alternative="two.sided",conf.level=0.95)

    One Sample t-test

data:  score
t = -2.0988, df = 13, p-value = 0.05593
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
 66.51943 80.19486
sample estimates:
mean of x 
 73.35714 

In the above output, the p-value of the test is 0.05593, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.05593.

One tail to the left

t.test(score,mu=80,alternative="less",conf.level=0.95)

    One Sample t-test

data:  score
t = -2.0988, df = 13, p-value = 0.02797
alternative hypothesis: true mean is less than 80
95 percent confidence interval:
     -Inf 78.96227
sample estimates:
mean of x 
 73.35714 

In the above output, the p-value of the test is 0.02797, which is less than the significance level alpha = 0.05. We can conclude that the true mean is not equal to 80 with a p-value = 0.02797.

One tailed to the right

t.test(score,mu=80,alternative="greater",conf.level=0.95)

    One Sample t-test

data:  score
t = -2.0988, df = 13, p-value = 0.972
alternative hypothesis: true mean is greater than 80
95 percent confidence interval:
 67.75202      Inf
sample estimates:
mean of x 
 73.35714 

In the above output, the p-value of the test is 0.972, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.972.

TWO SAMPLE T-TEST

Paired t-test

Paired T-Test. The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.

Create the data set(By variables)

weight_before <-c(190.1, 190.9, 172.7, 213, 231.4,
        196.9, 172.2, 285.5, 225.2, 113.7)
 
weight_after <-c(392.9, 313.2, 345.1, 393, 434,
        227.9, 422, 383.9, 392.3, 352.2)

Data frame

myDatta<-data.frame(weight_before,weight_after)
head(myDatta,10)
   weight_before weight_after
1          190.1        392.9
2          190.9        313.2
3          172.7        345.1
4          213.0        393.0
5          231.4        434.0
6          196.9        227.9
7          172.2        422.0
8          285.5        383.9
9          225.2        392.3
10         113.7        352.2

Perform the test (two sided)

t.test(weight_before,weight_after, alternative="two.sided",paired = TRUE)

    Paired t-test

data:  weight_before and weight_after
t = -7.9059, df = 9, p-value = 0.00002433
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -214.1284 -118.8516
sample estimates:
mean difference 
        -166.49 

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value approximately 0.0001.

Perform the test (one tailed to the left)

t.test(weight_before,weight_after, alternative="less",paired = TRUE)

    Paired t-test

data:  weight_before and weight_after
t = -7.9059, df = 9, p-value = 0.00001216
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
      -Inf -127.8868
sample estimates:
mean difference 
        -166.49 

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value approximately 0.0001.

Perform the test (one tailed to the right)

t.test(weight_before,weight_after, alternative="greater",paired = TRUE)

    Paired t-test

data:  weight_before and weight_after
t = -7.9059, df = 9, p-value = 1
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 -205.0932       Inf
sample estimates:
mean difference 
        -166.49 

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference is equal to zero with a p-value =1.

Create the data set(By group)

weight_before <-c(190.1, 190.9, 172.7, 213, 231.4,
        196.9, 172.2, 285.5, 225.2, 113.7)
 
weight_after <-c(392.9, 313.2, 345.1, 393, 434,
        227.9, 422, 383.9, 392.3, 352.2)

Data frame the data set

myData <- data.frame(
  weight = c(weight_before, weight_after),
group = rep(c("weight_before", "weight_after"), each = 10)
)

View data set

myData
   weight         group
1   190.1 weight_before
2   190.9 weight_before
3   172.7 weight_before
4   213.0 weight_before
5   231.4 weight_before
6   196.9 weight_before
7   172.2 weight_before
8   285.5 weight_before
9   225.2 weight_before
10  113.7 weight_before
11  392.9  weight_after
12  313.2  weight_after
13  345.1  weight_after
14  393.0  weight_after
15  434.0  weight_after
16  227.9  weight_after
17  422.0  weight_after
18  383.9  weight_after
19  392.3  weight_after
20  352.2  weight_after

Perform the test (two sided)

t.test(weight~group, alternative="two.sided",paired = TRUE, data=myData)

    Paired t-test

data:  weight by group
t = 7.9059, df = 9, p-value = 0.00002433
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 118.8516 214.1284
sample estimates:
mean difference 
         166.49 

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value which is approximately 0.0001.

Perform the test (left sided)

t.test(weight~group, alternative="less",paired = TRUE, data=myData)

    Paired t-test

data:  weight by group
t = 7.9059, df = 9, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
     -Inf 205.0932
sample estimates:
mean difference 
         166.49 

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference is equal to zero with a p-value =1.

Perform the test (left sided)

t.test(weight~group, alternative="greater",paired = TRUE, data=myData)

    Paired t-test

data:  weight by group
t = 7.9059, df = 9, p-value = 0.00001216
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 127.8868      Inf
sample estimates:
mean difference 
         166.49 

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value which is approximately 0.0001.

Import a table containing the datas set

my_tbl2020 <- tibble::tribble(
  ~S.n, ~mathematics, ~arts,
     1,           22,    53,
     2,           37,    68,
     3,           36,    42,
     4,           38,    49,
     5,           42,    51,
     6,           58,    65,
     7,           58,    51,
     8,           60,    71,
     9,           62,    55,
    10,           65,    74,
    11,           66,    68,
    12,           56,    64,
    13,           66,    67,
    14,           67,    73,
    15,           62,    65
  )

require(knitr)
kable(my_tbl2020, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
S.n mathematics arts
1 22 53
2 37 68
3 36 42
4 38 49
5 42 51
6 58 65
7 58 51
8 60 71
9 62 55
10 65 74
11 66 68
12 56 64
13 66 67
14 67 73
15 62 65

Attach the data set

attach(my_tbl2020)

View the data set

head(my_tbl2020,10)
# A tibble: 10 × 3
     S.n mathematics  arts
   <dbl>       <dbl> <dbl>
 1     1          22    53
 2     2          37    68
 3     3          36    42
 4     4          38    49
 5     5          42    51
 6     6          58    65
 7     7          58    51
 8     8          60    71
 9     9          62    55
10    10          65    74

Perform the paired test

TTEST<-t.test(mathematics,arts, alternative = "two.sided",paired = TRUE,data=my_tbl2020, conf.level=.95)
TTEST

    Paired t-test

data:  mathematics and arts
t = -2.8805, df = 14, p-value = 0.0121
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -14.073042  -2.060291
sample estimates:
mean difference 
      -8.066667 

In the above output, the p-value of the test is approximately 0.0121, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between mathematics and arts is not equal to zero with a p-value which is approximately 0.0121 at a 5% level of significance.

Additional Paired Tests

Generate the data set

before<-c(117,111,98,104,105,100,81,89,78)
after<-c(83,85,75,82,82,77,62,69,64)

Data Frame

My_framed_data<-data.frame(before,after)
My_framed_data
  before after
1    117    83
2    111    85
3     98    75
4    104    82
5    105    82
6    100    77
7     81    62
8     89    69
9     78    64

Perform the two sided test

MY_TTEST<-t.test(before,after, alternative = "two.sided",paired = TRUE,data=My_framed_data, conf.level = 0.99)
MY_TTEST

    Paired t-test

data:  before and after
t = 12.52, df = 8, p-value = 0.000001551
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
 16.59186 28.74147
sample estimates:
mean difference 
       22.66667 

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can therefore conclude that the true mean difference between the weight before and after is not equal to zero with a p-value which is approximately 0.0001 at a 5% level of significance.

Perform the left tailed test

MY_TTEST1<-t.test(before,after, alternative = "less",paired = TRUE,data=My_framed_data)
MY_TTEST1

    Paired t-test

data:  before and after
t = 12.52, df = 8, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
     -Inf 26.03331
sample estimates:
mean difference 
       22.66667 

In the above output, the p-value of the test is approximately 1, which is significantly greater than the significance level alpha = 0.05. We therefore retain the null hypothesis and conclude that the true mean difference between the weight before and after is greater than zero with a p-value which is approximately 1 at a 5% level of significance.

Perform the right tailed test

MY_TTEST2<-t.test(before,after, alternative = "greater",paired = TRUE,data=My_framed_data)
MY_TTEST2

    Paired t-test

data:  before and after
t = 12.52, df = 8, p-value = 0.0000007754
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 19.30002      Inf
sample estimates:
mean difference 
       22.66667 

In the above output, the p-value of the test is approximately 0.0001, which is significantly less than the significance level alpha = 0.05. We therefore reject the null hypothesis and conclude that the true mean difference between the weight before and after is less than zero with a p-value which is approximately 0.0001 at a 5% level of significance.

UNPAIRED T-TEST

What is an unpaired t-test? An unpaired t-test (also known as an independent t-test) is a statistical procedure that compares the averages/means of two independent or unrelated groups to determine if there is a significant difference between the two.

Extract the data set and run the test (two sided)

data("gapminder")
data2021<-gapminder%>%
  dplyr::select(country, lifeExp)%>%
  filter(country == "Kenya"|
           country == "United States")

head(data2021,14)
# A tibble: 14 × 2
   country       lifeExp
   <fct>           <dbl>
 1 Kenya            42.3
 2 Kenya            44.7
 3 Kenya            47.9
 4 Kenya            50.7
 5 Kenya            53.6
 6 Kenya            56.2
 7 Kenya            58.8
 8 Kenya            59.3
 9 Kenya            59.3
10 Kenya            54.4
11 Kenya            51.0
12 Kenya            54.1
13 United States    68.4
14 United States    69.5
t.test(lifeExp ~ country,alternative="two.sided",data = data2021)

    Welch Two Sample t-test

data:  lifeExp by country
t = -11.051, df = 17.966, p-value = 0.000000001917
alternative hypothesis: true difference in means between group Kenya and group United States is not equal to 0
95 percent confidence interval:
 -24.75174 -16.84326
sample estimates:
        mean in group Kenya mean in group United States 
                    52.6810                     73.4785 

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States is not equal to zero with a p-value which is approximately 0.0001.

Perform the test (left sided)

data2021<-gapminder %>%
  dplyr::select(country, lifeExp)%>%
  filter(country == "Kenya"|
           country == "United States")

head(data2021,14)
# A tibble: 14 × 2
   country       lifeExp
   <fct>           <dbl>
 1 Kenya            42.3
 2 Kenya            44.7
 3 Kenya            47.9
 4 Kenya            50.7
 5 Kenya            53.6
 6 Kenya            56.2
 7 Kenya            58.8
 8 Kenya            59.3
 9 Kenya            59.3
10 Kenya            54.4
11 Kenya            51.0
12 Kenya            54.1
13 United States    68.4
14 United States    69.5
t.test(lifeExp ~ country,alternative="less",data = data2021)

    Welch Two Sample t-test

data:  lifeExp by country
t = -11.051, df = 17.966, p-value = 0.0000000009583
alternative hypothesis: true difference in means between group Kenya and group United States is less than 0
95 percent confidence interval:
      -Inf -17.53385
sample estimates:
        mean in group Kenya mean in group United States 
                    52.6810                     73.4785 

In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States is greater or equal to zero with a p-value which is approximately 0.0001.

Perform the test (right sided)

data2021<-gapminder %>%
  dplyr::select(country, lifeExp)%>%
  filter(country == "Kenya"|
           country == "United States")

head(data2021,14)
# A tibble: 14 × 2
   country       lifeExp
   <fct>           <dbl>
 1 Kenya            42.3
 2 Kenya            44.7
 3 Kenya            47.9
 4 Kenya            50.7
 5 Kenya            53.6
 6 Kenya            56.2
 7 Kenya            58.8
 8 Kenya            59.3
 9 Kenya            59.3
10 Kenya            54.4
11 Kenya            51.0
12 Kenya            54.1
13 United States    68.4
14 United States    69.5
t.test(lifeExp ~ country,alternative="greater",data = data2021)

    Welch Two Sample t-test

data:  lifeExp by country
t = -11.051, df = 17.966, p-value = 1
alternative hypothesis: true difference in means between group Kenya and group United States is greater than 0
95 percent confidence interval:
 -24.06115       Inf
sample estimates:
        mean in group Kenya mean in group United States 
                    52.6810                     73.4785 

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States less than or equal to zero with a p-value =1.

Additional Tests

Additional unpaired test

male<-c(135,180,108,128,160,143,175,170)
female<-c(205,195,185,150,175,190,180,220)

Data Frame

daatta<-data.frame(male,female)
daatta
  male female
1  135    205
2  180    195
3  108    185
4  128    150
5  160    175
6  143    190
7  175    180
8  170    220

Perform the two tailed test

TTEST1<-t.test(male,female, alternative = "two.sided",paired = FALSE,data=daatta)
TTEST1

    Welch Two Sample t-test

data:  male and female
t = -3.2304, df = 13.477, p-value = 0.006308
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -62.69717 -12.55283
sample estimates:
mean of x mean of y 
  149.875   187.500 

In the above output, the p-value of the test is approximately 0.006308, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is not equal to zero with a p-value which is approximately 0.006308 at a 5% level of significance.

Perform the left tailed test

TTEST2<-t.test(male,female, alternative = "less",paired = FALSE,data=daatta)
TTEST2

    Welch Two Sample t-test

data:  male and female
t = -3.2304, df = 13.477, p-value = 0.003154
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -17.05415
sample estimates:
mean of x mean of y 
  149.875   187.500 

In the above output, the p-value of the test is approximately 0.003154, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.003154 at a 5% level of significance.

Perform the right tailed test

TTEST3<-t.test(male,female, alternative = "greater",paired = FALSE,data=daatta)
TTEST3

    Welch Two Sample t-test

data:  male and female
t = -3.2304, df = 13.477, p-value = 0.9968
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -58.19585       Inf
sample estimates:
mean of x mean of y 
  149.875   187.500 

In the above output, the p-value of the test is approximately 0.9968, which is greater than the significance level alpha = 0.05. We therefore fail to reject the null hypothesis and conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.9968 at a 5% level of significance.

Additional data set

weight<-c(135,180,108,128,160,143,175,170,205,195,185,150,175,190,180,220)
weight
 [1] 135 180 108 128 160 143 175 170 205 195 185 150 175 190 180 220

Generate the grouping variable

gender<-gl(2,8,labels=c("male","female"))
gender
 [1] male   male   male   male   male   male   male   male   female female
[11] female female female female female female
Levels: male female

Data frame the observations

paired_data<-data.frame(weight,gender)
paired_data
   weight gender
1     135   male
2     180   male
3     108   male
4     128   male
5     160   male
6     143   male
7     175   male
8     170   male
9     205 female
10    195 female
11    185 female
12    150 female
13    175 female
14    190 female
15    180 female
16    220 female
TTEST4<-t.test(weight~gender, alternative = "two.sided",paired = FALSE,data=paired_data)
TTEST4

    Welch Two Sample t-test

data:  weight by gender
t = -3.2304, df = 13.477, p-value = 0.006308
alternative hypothesis: true difference in means between group male and group female is not equal to 0
95 percent confidence interval:
 -62.69717 -12.55283
sample estimates:
  mean in group male mean in group female 
             149.875              187.500 

In the above output, the p-value of the test is approximately 0.006308, which is less than the significance level alpha = 0.05. We therefore reject the null hypothesis and conclude that the true mean difference in height between male and female is not equal zero with a p-value which is approximately 0.006308 at a 5% level of significance.

Perform the left tailed test

TTEST5<-t.test(weight~gender, alternative = "less",paired = FALSE,data=paired_data)
TTEST5

    Welch Two Sample t-test

data:  weight by gender
t = -3.2304, df = 13.477, p-value = 0.003154
alternative hypothesis: true difference in means between group male and group female is less than 0
95 percent confidence interval:
      -Inf -17.05415
sample estimates:
  mean in group male mean in group female 
             149.875              187.500 

In the above output, the p-value of the test is approximately 0.003154, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.003154 at a 5% level of significance.

Perform the right tailed test

TTEST6<-t.test(weight~gender, alternative = "greater",paired = FALSE,data=paired_data)
TTEST6

    Welch Two Sample t-test

data:  weight by gender
t = -3.2304, df = 13.477, p-value = 0.9968
alternative hypothesis: true difference in means between group male and group female is greater than 0
95 percent confidence interval:
 -58.19585       Inf
sample estimates:
  mean in group male mean in group female 
             149.875              187.500 

In the above output, the p-value of the test is approximately 0.9968, which is greater than the significance level alpha = 0.05. We therefore fail to reject the null hypothesis and conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.9968 at a 5% level of significance.

ANALYSIS OF VARIANCE

Analysis of Variance (ANOVA) is a statistical approach used to compare variances across the means (or average) of different groups. A range of scenarios use it to determine if there is any difference between the means of three of more different groups.

Load gapminder data set

head(gapminder,10)
# A tibble: 10 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.

Perform the test (GDP per capita against continent)

attach(gapminder)
ANOVA<-aov(gdpPercap~continent)
summary(ANOVA)
              Df       Sum Sq    Mean Sq F value              Pr(>F)    
continent      4  37990228667 9497557167   126.6 <0.0000000000000002 ***
Residuals   1699 127489276662   75037832                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 126.6 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the mean gdp per capita across continent at a 5% level of significance.

Test the significance of the difference

Load the package below

library(agricolae)

Test the significant difference

TK<-TukeyHSD(ANOVA)
TK
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = gdpPercap ~ continent)

$continent
                       diff        lwr       upr     p adj
Americas-Africa   4942.3558  3280.4806  6604.231 0.0000000
Asia-Africa       5708.3958  4188.6340  7228.158 0.0000000
Europe-Africa    12275.7210 10710.1623 13841.280 0.0000000
Oceania-Africa   16427.8546 11507.4034 21348.306 0.0000000
Asia-Americas      766.0401 -1044.5150  2576.595 0.7767582
Europe-Americas   7333.3652  5484.2011  9182.529 0.0000000
Oceania-Americas 11485.4989  6467.6035 16503.394 0.0000000
Europe-Asia       6567.3251  4844.7530  8289.897 0.0000000
Oceania-Asia     10719.4588  5746.8215 15692.096 0.0000000
Oceania-Europe    4152.1337  -834.6909  9138.958 0.1539474

From the results above a significant difference in mean gdp per capita exist across various continent except for Asia and America and Oceania and Europe.

Plot

plot(TK)

plot(TK, las=1)

ANOVA(Life Expectancy against continent)

ANOVA2<-aov(lifeExp~continent)
summary(ANOVA2)
              Df Sum Sq Mean Sq F value              Pr(>F)    
continent      4 139343   34836   408.7 <0.0000000000000002 ***
Residuals   1699 144805      85                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics is given as 126.6 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the mean life expectancy across continent at a 5% level of significance.

Test the significant difference

TK2<-TukeyHSD(ANOVA2)
TK2
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lifeExp ~ continent)

$continent
                      diff       lwr       upr     p adj
Americas-Africa  15.793407 14.022263 17.564550 0.0000000
Asia-Africa      11.199573  9.579887 12.819259 0.0000000
Europe-Africa    23.038356 21.369862 24.706850 0.0000000
Oceania-Africa   25.460878 20.216908 30.704848 0.0000000
Asia-Americas    -4.593833 -6.523432 -2.664235 0.0000000
Europe-Americas   7.244949  5.274203  9.215696 0.0000000
Oceania-Americas  9.667472  4.319650 15.015293 0.0000086
Europe-Asia      11.838783 10.002952 13.674614 0.0000000
Oceania-Asia     14.261305  8.961718 19.560892 0.0000000
Oceania-Europe    2.422522 -2.892185  7.737230 0.7250559

From the results above a significant difference in mean life expectancy exist across various continent except for Oceania and Europe.

Plot the significant difference

plot(TK2)

plot(TK2, las=1)

TWO WAY ANOVA (Two factor ANOVA)

library(haven)
MAIZE_YIELDS_two_way_ANOVA <- read_dta("D:/Data/data and other training stuff/STATA training DATA set/MAIZE YIELDS two way ANOVA.dta")
head(MAIZE_YIELDS_two_way_ANOVA,10)
# A tibble: 10 × 3
   maize_yields treatment blocks     
          <dbl> <dbl+lbl> <dbl+lbl>  
 1         18.5 1 [TR 1]  1 [BLOCK 1]
 2         11.7 1 [TR 1]  2 [BLOCK 2]
 3         15.4 1 [TR 1]  3 [BLOCK 3]
 4         16.5 1 [TR 1]  4 [BLOCK 4]
 5         15.7 2 [TR 2]  1 [BLOCK 1]
 6         14.2 2 [TR 2]  2 [BLOCK 2]
 7         16.6 2 [TR 2]  3 [BLOCK 3]
 8         18.5 2 [TR 2]  4 [BLOCK 4]
 9         16.2 3 [TR 3]  1 [BLOCK 1]
10         12.9 3 [TR 3]  2 [BLOCK 2]

Rename the data set

my_2anova<-MAIZE_YIELDS_two_way_ANOVA
str(my_2anova)
tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
 $ maize_yields: num [1:24] 18.5 11.7 15.4 16.5 15.7 14.2 16.6 18.5 16.2 12.9 ...
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ treatment   : dbl+lbl [1:24] 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5...
   ..@ format.stata: chr "%10.0g"
   ..@ labels      : Named num [1:6] 1 2 3 4 5 6
   .. ..- attr(*, "names")= chr [1:6] "TR 1" "TR 2" "TR 3" "TR 4" ...
 $ blocks      : dbl+lbl [1:24] 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3...
   ..@ format.stata: chr "%10.0g"
   ..@ labels      : Named num [1:4] 1 2 3 4
   .. ..- attr(*, "names")= chr [1:4] "BLOCK 1" "BLOCK 2" "BLOCK 3" "BLOCK 4"
my_2anova
# A tibble: 24 × 3
   maize_yields treatment blocks     
          <dbl> <dbl+lbl> <dbl+lbl>  
 1         18.5 1 [TR 1]  1 [BLOCK 1]
 2         11.7 1 [TR 1]  2 [BLOCK 2]
 3         15.4 1 [TR 1]  3 [BLOCK 3]
 4         16.5 1 [TR 1]  4 [BLOCK 4]
 5         15.7 2 [TR 2]  1 [BLOCK 1]
 6         14.2 2 [TR 2]  2 [BLOCK 2]
 7         16.6 2 [TR 2]  3 [BLOCK 3]
 8         18.5 2 [TR 2]  4 [BLOCK 4]
 9         16.2 3 [TR 3]  1 [BLOCK 1]
10         12.9 3 [TR 3]  2 [BLOCK 2]
# … with 14 more rows
attach(my_2anova)

Convert the character variables to factors variables

my_2anova$treatment <- as.factor(my_2anova$treatment) 
my_2anova$blocks <- as.factor(my_2anova$blocks)

Confirm the structure of the data set

str(my_2anova)
tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
 $ maize_yields: num [1:24] 18.5 11.7 15.4 16.5 15.7 14.2 16.6 18.5 16.2 12.9 ...
  ..- attr(*, "format.stata")= chr "%10.0g"
 $ treatment   : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ blocks      : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
modell<-lm(maize_yields~treatment+blocks,data=my_2anova)
modell

Call:
lm(formula = maize_yields ~ treatment + blocks, data = my_2anova)

Coefficients:
(Intercept)   treatment2   treatment3   treatment4   treatment5   treatment6  
     14.921        0.725       -1.200        0.575        0.675        0.900  
    blocks2      blocks3      blocks4  
     -1.467        2.767        1.117  

Model summary

summary(modell)

Call:
lm(formula = maize_yields ~ treatment + blocks, data = my_2anova)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5958 -1.7688  0.0292  1.2313  3.5792 

Coefficients:
            Estimate Std. Error t value     Pr(>|t|)    
(Intercept)   14.921      1.410  10.585 0.0000000235 ***
treatment2     0.725      1.628   0.445       0.6624    
treatment3    -1.200      1.628  -0.737       0.4723    
treatment4     0.575      1.628   0.353       0.7288    
treatment5     0.675      1.628   0.415       0.6842    
treatment6     0.900      1.628   0.553       0.5884    
blocks2       -1.467      1.329  -1.104       0.2872    
blocks3        2.767      1.329   2.082       0.0549 .  
blocks4        1.117      1.329   0.840       0.4140    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.302 on 15 degrees of freedom
Multiple R-squared:  0.4681,    Adjusted R-squared:  0.1843 
F-statistic:  1.65 on 8 and 15 DF,  p-value: 0.192

Create the two factor anove

two_way<-anova(modell)
two_way
Analysis of Variance Table

Response: maize_yields
          Df Sum Sq Mean Sq F value  Pr(>F)  
treatment  5 12.377  2.4754  0.4672 0.79478  
blocks     3 57.555 19.1849  3.6208 0.03802 *
Residuals 15 79.478  5.2985                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, the F statistics for treatment is given as 0.4672 with its associated p-value of approximately 0.79478, which is significantly greater than 0.05. This is an indication that there is no significant difference in the mean maize yield across treatment at a 5% level of significance. On the other, F statistics for blocks is 3.6208 with its associated p-value of 0.03802, which is less than 0.05. This is a clear indication that there exists a significant difference in the mean maize yield across blocks at a 5% level of significant.

PEARSON CORRELATION

A Pearson’s correlation is used when the two statistics we want to analyze are both quantitative. This means we will be comparing quantitative variables to find a linear relationship (if the variables represent a nonlinear relationship, a correlation is not appropriate). The Pearson correlation measures the strength of the linear relationship between two variables. It has a value between -1 to 1, with a value of -1 meaning a total negative linear correlation, 0 being no correlation, and + 1 meaning a total positive correlation.

Load the data set

data("gapminder")
head(gapminder,10)
# A tibble: 10 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.

Perform the pearson correlation

cor.test(gdpPercap, lifeExp, method = "pearson")

    Pearson's product-moment correlation

data:  gdpPercap and lifeExp
t = 29.658, df = 1702, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5515065 0.6141690
sample estimates:
      cor 
0.5837062 

“t” is the value of the test statistic (t=29.658). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true correlation is not equal to 0). Sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s named as cor (cor = 0.5837). From the results we shall therefore reject the null hypothesis and conclude that there is a positive moderate correlation between life expectancy and gdp per capita at a 5% level of significance.

Additional Pearson correlation

library(haven)
Pearson <- read_sav("C:/Users/user/Downloads/Pearson.sav")
head(Pearson,10)
# A tibble: 10 × 2
     mpg weight
   <dbl>  <dbl>
 1    18   3504
 2    15   3693
 3    18   3436
 4    16   3433
 5    17   3449
 6    15   4341
 7    14   4354
 8    14   4312
 9    14   4425
10    15   3850
length(Pearson)
[1] 2
str(Pearson)
tibble [300 × 2] (S3: tbl_df/tbl/data.frame)
 $ mpg   : num [1:300] 18 15 18 16 17 15 14 14 14 15 ...
  ..- attr(*, "format.spss")= chr "F4.1"
 $ weight: num [1:300] 3504 3693 3436 3433 3449 ...
  ..- attr(*, "format.spss")= chr "F4.0"

Perform the test

attach(Pearson)
cor(Pearson)
              mpg     weight
mpg     1.0000000 -0.8782815
weight -0.8782815  1.0000000
cor.test(Pearson$mpg, Pearson$weight, method = "pearson", data=Pearson)

    Pearson's product-moment correlation

data:  Pearson$mpg and Pearson$weight
t = -31.709, df = 298, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9018288 -0.8495329
sample estimates:
       cor 
-0.8782815 

“t” is the value of the test statistic (t=-31.709). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true correlation is not equal to 0). Sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s named as cor (cor = -0.8783). From the results we shall therefore reject the null hypothesis and conclude that the true correlation is not equal to zero. Further, correlation coefficient from the results above shows that there a strong negative correlation between mpg and weight at a 5% level of significance.
There is sufficient evidence to conclude that there is a significant linear relationship between mpg and weight because the correlation coefficient is significantly different from zero.

Scatter plot

plot(Pearson$mpg, Pearson$weight)

Inserting an image in Rmarkdown

knitr::include_graphics("C:/Users/user/Downloads/my_scatter.png")
My scatter plot showing negative correlation

My scatter plot showing negative correlation

LINEAR REGRESSION

Insert data in form of a table

library(devtools)
my_tbl <- tibble::tribble(
  ~ln_population, ~ln_life_exp, ~ln_gdp_cap,
     6.925587075,   1.45940757,    2.891786,
     6.965715868,   1.48190105,    2.914265,
     7.011447073,   1.50510926,       2.931,
     7.062129255,   1.53173431,    2.922309,
     7.116589814,   1.55736281,    2.869221,
     7.172613788,   1.58476078,    2.895485,
     7.109977092,   1.60047192,    2.990344,
     7.142012486,   1.61089428,    2.930641,
     7.212664826,   1.61986519,    2.812473,
     7.346888958,   1.62079169,    2.803007,
     7.402577829,   1.62458115,    2.861376,
     7.503653471,   1.64175165,    2.988818,
     6.108124079,   1.74217504,    3.204407,
     6.169234922,   1.77290819,    3.288313,
     6.237578169,   1.81170903,    3.364155,
     6.297554802,   1.82098918,     3.44094,
      6.35479086,   1.83052451,    3.520277,
      6.39950897,   1.83840828,    3.548144,
     6.444059949,   1.84769602,    3.560012
  )

require(knitr)
kable(my_tbl, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
ln_population ln_life_exp ln_gdp_cap
6.926 1.459 2.892
6.966 1.482 2.914
7.011 1.505 2.931
7.062 1.532 2.922
7.117 1.557 2.869
7.173 1.585 2.895
7.110 1.600 2.990
7.142 1.611 2.931
7.213 1.620 2.812
7.347 1.621 2.803
7.403 1.625 2.861
7.504 1.642 2.989
6.108 1.742 3.204
6.169 1.773 3.288
6.238 1.812 3.364
6.298 1.821 3.441
6.355 1.831 3.520
6.400 1.838 3.548
6.444 1.848 3.560
attach(my_tbl)
my_model<-lm(ln_life_exp~ln_gdp_cap,data=my_tbl)
stargazer(my_model,report = "vc*stp",type = "text",out = "./q7results.txt")

===============================================
                        Dependent variable:    
                    ---------------------------
                            ln_life_exp        
-----------------------------------------------
ln_gdp_cap                   0.430***          
                              (0.050)          
                             t = 8.585         
                            p = 0.00000        
                                               
Constant                      0.328**          
                              (0.155)          
                             t = 2.111         
                             p = 0.050         
                                               
-----------------------------------------------
Observations                    19             
R2                             0.813           
Adjusted R2                    0.802           
Residual Std. Error       0.058 (df = 17)      
F Statistic           73.702*** (df = 1; 17)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01
exp(0.328)
[1] 1.388189
my_tbl1 <- tibble::tribble(
  ~Year, ~Income, ~Consumption,
   1950,   791.8,        733.2,
   1951,     819,        748.7,
   1952,   844.3,        771.4,
   1953,     880,        802.5,
   1954,     894,        822.7,
   1955,   944.5,        873.8,
   1956,   989.4,        899.8,
   1957,  1012.1,        919.7,
   1958,  1028.8,        932.9,
   1959,  1067.2,        979.4,
   1960,  1091.1,       1005.1,
   1961,  1123.2,       1025.2,
   1962,  1170.2,         1069,
   1963,  1207.3,       1108.4,
   1964,    1291,       1170.6,
   1965,  1365.7,       1236.4,
   1966,  1431.3,       1298.9,
   1967,  1493.2,       1337.7,
   1968,  1551.3,       1405.9,
   1969,  1599.8,       1456.7,
   1970,  1688.1,         1492,
   1971,  1728.4,       1538.8,
   1972,  1797.4,       1621.9,
   1973,  1916.3,       1689.6,
   1974,  1896.6,         1674,
   1975,  1931.7,       1711.9,
   1976,    2001,       1803.9,
   1977,  2066.6,       1883.8,
   1978,  2167.4,         1961,
   1979,  2216.2,       2004.4
  )

require(knitr)
kable(my_tbl1, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
Year Income Consumption
1950 791.8 733.2
1951 819.0 748.7
1952 844.3 771.4
1953 880.0 802.5
1954 894.0 822.7
1955 944.5 873.8
1956 989.4 899.8
1957 1012.1 919.7
1958 1028.8 932.9
1959 1067.2 979.4
1960 1091.1 1005.1
1961 1123.2 1025.2
1962 1170.2 1069.0
1963 1207.3 1108.4
1964 1291.0 1170.6
1965 1365.7 1236.4
1966 1431.3 1298.9
1967 1493.2 1337.7
1968 1551.3 1405.9
1969 1599.8 1456.7
1970 1688.1 1492.0
1971 1728.4 1538.8
1972 1797.4 1621.9
1973 1916.3 1689.6
1974 1896.6 1674.0
1975 1931.7 1711.9
1976 2001.0 1803.9
1977 2066.6 1883.8
1978 2167.4 1961.0
1979 2216.2 2004.4

View the data set

attach(my_tbl1)
tail(my_tbl1)
# A tibble: 6 × 3
   Year Income Consumption
  <dbl>  <dbl>       <dbl>
1  1974  1897.       1674 
2  1975  1932.       1712.
3  1976  2001        1804.
4  1977  2067.       1884.
5  1978  2167.       1961 
6  1979  2216.       2004.
head(my_tbl1)
# A tibble: 6 × 3
   Year Income Consumption
  <dbl>  <dbl>       <dbl>
1  1950   792.        733.
2  1951   819         749.
3  1952   844.        771.
4  1953   880         802.
5  1954   894         823.
6  1955   944.        874.
str(my_tbl)
tibble [19 × 3] (S3: tbl_df/tbl/data.frame)
 $ ln_population: num [1:19] 6.93 6.97 7.01 7.06 7.12 ...
 $ ln_life_exp  : num [1:19] 1.46 1.48 1.51 1.53 1.56 ...
 $ ln_gdp_cap   : num [1:19] 2.89 2.91 2.93 2.92 2.87 ...

Estimate the model (Linear Regression Model)

Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.

myy_model<-lm(Consumption~Income,data = my_tbl1)
stargazer(myy_model,report = "vc*stp",type = "text",out = "./q7results.txt")

===============================================
                        Dependent variable:    
                    ---------------------------
                            Consumption        
-----------------------------------------------
Income                       0.881***          
                              (0.006)          
                            t = 143.727        
                             p = 0.000         
                                               
Constant                     31.935***         
                              (9.004)          
                             t = 3.547         
                             p = 0.002         
                                               
-----------------------------------------------
Observations                    30             
R2                             0.999           
Adjusted R2                    0.999           
Residual Std. Error      14.858 (df = 28)      
F Statistic         20,657.590*** (df = 1; 28) 
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01
myy_model<-lm(log(Consumption)~log(Income),data = my_tbl1)
stargazer(myy_model,report = "vc*stp",type = "text",out = "./q7results.txt")

===============================================
                        Dependent variable:    
                    ---------------------------
                         log(Consumption)      
-----------------------------------------------
log(Income)                  0.972***          
                              (0.006)          
                            t = 174.576        
                             p = 0.000         
                                               
Constant                      0.107**          
                              (0.040)          
                             t = 2.667         
                             p = 0.013         
                                               
-----------------------------------------------
Observations                    30             
R2                             0.999           
Adjusted R2                    0.999           
Residual Std. Error       0.010 (df = 28)      
F Statistic         30,476.920*** (df = 1; 28) 
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Interpretation of the model

According to Relative Income Theory (RIT), the level of consumption expenditures is not determined by the absolute level of income but by the relative level of income, with the APC declining as relative income increases. More specifically, the RIT argues that the level of consumption spending is determined by the household’s level of current income relative to the highest level of income previously earned. Symbolically, it is shown:

knitr::include_graphics("consumption.png")
Simbolic Representation of RIT

Simbolic Representation of RIT

where C represents the current level of consumption expenditures, Y, the current level of income, Y^d, the highest level of income previously earned, and a and b, represent numerical constants which relate income to consumption. From the above equation, we find that when the households experience a temporary and short-run increase in current income above its previous peak level of income, it increases its consumption expenditures by an amount which is less than proportional to the increase in current income.

Consequently, when current income rises relative to peak income, the APC declines and the increase in total consumption expenditures is not proportional to the increase in total income. Again, when a household experiences current and peak income growing by the same percentage amount, it increases its consumption expenditures by an amount which is proportional to the increase in current income.

Consequently, the APC remains constant and the increase in total consumption expenditure is proportional to the increase in total income. Thus, according to the RIT, changes in current consumption are not proportional to the changes in current income only when current income increases relative to previous peak income.

If current and peak income grow together, changes in consumption are always proportional to the changes in income. However, it must be noted that RIT works for decreases as well as increases in the level of current income. The RIT is fundamentally different from AIT. The RIT explains away the short-run consumption function as a result of temporary deviations in current income, while the AIT explains away the long-run consumption function as the result of factors other than income on consumption.

From the results above, autonomous consumption is about 31.935 US dollar. On the other hand, a unit change in income results in 88.1% change in consumption. The effect is statistically significant at a 1% level of significance.

Prediction

Pre<-predict(myy_model,data.frame(Income=2000))
Pre
       1 
7.491265 
my_tbl2021 <- tibble::tribble(
  ~YEAR,    ~GDP,   ~PIV,      ~ICT,      ~VAT,   ~IMPTT,
   1973,   15790,   3645,    1124.8,     639.8,   795.44,
   1974,   18776,   4075,    1531.4,    937.26,   842.24,
   1975,   21140,   4837,    1796.8,   1185.48,   983.62,
   1976,   25562,   5808,    2149.4,   1308.44,  1057.18,
   1977,   32699,   7800,    2846.8,   1855.26,  2083.94,
   1978,   35601,  10280,    3121.4,   1995.38,  2025.48,
   1979,   39543,  10809,      3437,   3098.14,  2049.64,
   1980,   44648,  12451,    3951.6,      3588,     2919,
   1981,   51641,  14508,    3993.4,    3895.9,  3674.24,
   1982,   58214,  13364,    4624.6,    3917.5,  3305.84,
   1983,   66218,  14349,      5023,   5062.38,  3424.38,
   1984,   72550,  16143,    6019.4,      5471,  3043.58,
   1985,  100831,  17631,    7162.4,   6065.86,   4236.8,
   1986,  117472,  23064,    7714.6,    7950.4,   4934.2,
   1987,  131169,  25735,    9089.6,  10399.14,  5473.72,
   1988,  151194,  30359,   10240.4,   9774.92,  10240.5,
   1989,  171589,  33056,     11983,   9461.94, 11983.06,
   1990,  195536,  40560,   14261.6,  12140.28, 14261.56,
   1991,  224232,  42671,   17027.8,   18555.4,  5118.78,
   1992,  264476,  43777,   19970.4,  22142.72,     9183,
   1993,  333616,  56580,   36767.2,  28994.34, 14792.78,
   1994,  400700,  75616,   43505.8,  24533.86, 18598.28,
   1995,  465654,  99497,   48082.3,  28403.72, 21175.68,
   1996,  687998, 110142,  48375.02,  29850.08, 22594.06,
   1997,  770312, 118535,   55577.9,   34468.1, 24567.06,
   1998,  850808, 133366,   55234.9,  90204.76, 28443.93,
   1999,  906928, 141403,  53316.99,  40944.11, 28605.16,
   2000,  967838, 161714,  53428.93,  50220.92, 28803.74,
   2001, 1020020, 185186,  55861.95,  50871.68,    27302,
   2002, 1035370, 178466,  70140.28,  56135.25, 24936.09,
   2003, 1138060, 179254,  77409.73,  58853.08,    25214,
   2004, 1286460, 207196,  99312.47,  75995.66, 23531.72,
   2005, 1445480, 264912, 114629.06,  79925.91, 20511.43,
   2006, 1642400, 309402,    130719,  96497.01,    27927,
   2007, 1833511, 309781,    165078, 111904.51, 32944.35,
   2008, 2111173, 239525,    194155,    126854,    36181,
   2009, 2365453, 292488,    288168,    146791,    41372,
   2010, 2551161, 238938,    268291,    172360,    48903,
   2011, 3725918, 213845,  290621.6, 163940.39, 48436.93,
   2012, 4261370, 242365, 353693.71, 174716.17, 55397.81,
   2013, 4745594, 234586, 422357.86, 218297.19, 61692.72,
   2014, 5403471, 273641, 470101.98,  243156.2, 70245.12,
   2015, 6284191, 293645, 540440.43,  276504.4, 75410.29,
   2016, 7194163, 261548, 589921.37, 287766.52, 86329.96,
   2017, 7749435, 269542, 557959.32,  309977.4, 85243.79,
   2018, 7946248, 256345, 683377.33,  381419.9, 95354.98
  )

require(knitr)
kable(my_tbl2021, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
YEAR GDP PIV ICT VAT IMPTT
1973 15790 3645 1124.80 639.80 795.44
1974 18776 4075 1531.40 937.26 842.24
1975 21140 4837 1796.80 1185.48 983.62
1976 25562 5808 2149.40 1308.44 1057.18
1977 32699 7800 2846.80 1855.26 2083.94
1978 35601 10280 3121.40 1995.38 2025.48
1979 39543 10809 3437.00 3098.14 2049.64
1980 44648 12451 3951.60 3588.00 2919.00
1981 51641 14508 3993.40 3895.90 3674.24
1982 58214 13364 4624.60 3917.50 3305.84
1983 66218 14349 5023.00 5062.38 3424.38
1984 72550 16143 6019.40 5471.00 3043.58
1985 100831 17631 7162.40 6065.86 4236.80
1986 117472 23064 7714.60 7950.40 4934.20
1987 131169 25735 9089.60 10399.14 5473.72
1988 151194 30359 10240.40 9774.92 10240.50
1989 171589 33056 11983.00 9461.94 11983.06
1990 195536 40560 14261.60 12140.28 14261.56
1991 224232 42671 17027.80 18555.40 5118.78
1992 264476 43777 19970.40 22142.72 9183.00
1993 333616 56580 36767.20 28994.34 14792.78
1994 400700 75616 43505.80 24533.86 18598.28
1995 465654 99497 48082.30 28403.72 21175.68
1996 687998 110142 48375.02 29850.08 22594.06
1997 770312 118535 55577.90 34468.10 24567.06
1998 850808 133366 55234.90 90204.76 28443.93
1999 906928 141403 53316.99 40944.11 28605.16
2000 967838 161714 53428.93 50220.92 28803.74
2001 1020020 185186 55861.95 50871.68 27302.00
2002 1035370 178466 70140.28 56135.25 24936.09
2003 1138060 179254 77409.73 58853.08 25214.00
2004 1286460 207196 99312.47 75995.66 23531.72
2005 1445480 264912 114629.06 79925.91 20511.43
2006 1642400 309402 130719.00 96497.01 27927.00
2007 1833511 309781 165078.00 111904.51 32944.35
2008 2111173 239525 194155.00 126854.00 36181.00
2009 2365453 292488 288168.00 146791.00 41372.00
2010 2551161 238938 268291.00 172360.00 48903.00
2011 3725918 213845 290621.60 163940.39 48436.93
2012 4261370 242365 353693.71 174716.17 55397.81
2013 4745594 234586 422357.86 218297.19 61692.72
2014 5403471 273641 470101.98 243156.20 70245.12
2015 6284191 293645 540440.43 276504.40 75410.29
2016 7194163 261548 589921.37 287766.52 86329.96
2017 7749435 269542 557959.32 309977.40 85243.79
2018 7946248 256345 683377.33 381419.90 95354.98

View the data set

head(my_tbl2021,10)
# A tibble: 10 × 6
    YEAR   GDP   PIV   ICT   VAT IMPTT
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  1973 15790  3645 1125.  640.  795.
 2  1974 18776  4075 1531.  937.  842.
 3  1975 21140  4837 1797. 1185.  984.
 4  1976 25562  5808 2149. 1308. 1057.
 5  1977 32699  7800 2847. 1855. 2084.
 6  1978 35601 10280 3121. 1995. 2025.
 7  1979 39543 10809 3437  3098. 2050.
 8  1980 44648 12451 3952. 3588  2919 
 9  1981 51641 14508 3993. 3896. 3674.
10  1982 58214 13364 4625. 3918. 3306.

Descriptive Statistics

summary(my_tbl2021)
      YEAR           GDP               PIV              ICT        
 Min.   :1973   Min.   :  15790   Min.   :  3645   Min.   :  1125  
 1st Qu.:1984   1st Qu.:  79620   1st Qu.: 16515   1st Qu.:  6305  
 Median :1996   Median : 576826   Median :104820   Median : 48229  
 Mean   :1996   Mean   :1542657   Mean   :124401   Mean   :128339  
 3rd Qu.:2007   3rd Qu.:1785733   3rd Qu.:237850   3rd Qu.:156488  
 Max.   :2018   Max.   :7946248   Max.   :309781   Max.   :683377  
      VAT               IMPTT        
 Min.   :   639.8   Min.   :  795.4  
 1st Qu.:  5619.7   1st Qu.: 3814.9  
 Median : 29422.2   Median :20843.6  
 Mean   : 75848.5   Mean   :25351.1  
 3rd Qu.:108052.6   3rd Qu.:31909.2  
 Max.   :381419.9   Max.   :95355.0  

Attach the data set

attach(my_tbl2021)

Perform the multiple regression model

MDL<-lm(log(PIV)~log(ICT)+log(VAT)+log(IMPTT),data=my_tbl2021)
stargazer(MDL,report = "vc*stp",type = "text",out = "./q7results.txt")

===============================================
                        Dependent variable:    
                    ---------------------------
                             log(PIV)          
-----------------------------------------------
log(ICT)                      -0.135           
                              (0.177)          
                            t = -0.761         
                             p = 0.452         
                                               
log(VAT)                     0.647***          
                              (0.211)          
                             t = 3.060         
                             p = 0.004         
                                               
log(IMPTT)                    0.356**          
                              (0.162)          
                             t = 2.193         
                             p = 0.034         
                                               
Constant                     2.532***          
                              (0.398)          
                             t = 6.357         
                            p = 0.00000        
                                               
-----------------------------------------------
Observations                    46             
R2                             0.958           
Adjusted R2                    0.955           
Residual Std. Error       0.303 (df = 42)      
F Statistic           317.211*** (df = 3; 42)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Model interpretation

The Effect of Value Added Tax (VAT) on Private Investment.

From the analysis it was found that the value added tax has a significantly positive effect on the level of private investment in Kenya. The findings are concluded on the basis that the value added tax has positive coefficient. Other studies shows that the increase in the value of value added tax increases the level of private investment (Adejare, 2017). This is an indicator that the increase in the value added tax increases the level of private investment. According to Gitahi (2010), value added tax negatively influences the level of private investment. From the table of coefficients, value added tax had a statistically significant effect on the level of private investment as opposed to the null hypothesis which stated that value added tax has no statistically significant effect on the level of private investment. This implies that we will reject the null hypothesis and conclude that value added tax has a statistically significant effect on the level of private investment at 5% level of significant. The coefficient of 0.647 show that one unit change in the level of value added tax results into 0.647 unit change in the level of private investment.

The Effect of Income Tax on Private Investment.

From the table 4.8, the coefficient for the income shows that income tax has a negative effect on the level of private investment. The argument here is that an increase in the rate of income tax constrain private investment. The assumption here is that, according to Keynesian economist, an increase in income tax reduce the disposable income. A decrease in the value of disposable income implies a decrease in the level of saving. Consequently, whenever the level of saving decreases private investment decreases. This is so because, Keynes (1936) argued that in the long run savings equates investment. An increase in the value of income tax charged results into a decrease in the level of private investment in country (Gitahi, 2010). The results from the analysis in the coefficient table shows that the effect of income tax on private investment is significant. The p- value for the income tax given as 0.028 which is less than 0.05 implies that we will therefore reject the null hypothesis and conclude that income tax has a statistically significant effect on the level of private investment at 5% level of significant. The fact that the Kenyan government is subjecting its citizens to higher income tax rate more than other countries in the region, this has accelerated higher capital flight to other countries leading shrinking of the private investment level in Kenya.

The Effect of Import Tax on private Investment

The effect of Import tax on private investment was found to be positive and statistically significant. From the existing literature, the results shows that import tax encouraged private investment that deterring private investment. Newbery (1987) and Coady (1997) argued that import taxes and barriers on international trade of any sort tend to encourage domestic production of final consumer goods while permitting relatively free imports of capital or intermediate goods. This tends to be associated with high rates of effective protection, high cost of domestic production, and creating a bias against exports. Import tax charged on final consumer goods leads production of import substitute. The production of import substitute in the country enables for the expansion of the local business enterprises leading to an increase in the level of private investment (Gitahi, 2010) Consequently, while reducing the dependence of the country on imports of final consumption goods, the economy becomes highly dependent on imports of intermediate goods. The above scenario will lead to the expansion of the local business enterprises and consequently increasing the aggregate demand which will reflect as the positive effect of import tax on private investment. From the table 4.8 import tax is statistically significant shown by the p-value of 0.034 which is less than 0.05 implying that we will reject the null hypothesis and conclude import tax has statistically significant effect on the level of private investment in Kenya at 5% level of significance.

Predict the residuals

RESID<-residuals(MDL)

Summarize the residuals

mean(RESID)
[1] -0.00000000000000001027399

Covariance

RESID and log(ICT)

cov(RESID,log(ICT))
[1] 0.0000000000000000004384995

RESID and log(VAT)

cov(RESID,log(VAT))
[1] -0.000000000000000001207078

RESID and log(IMPTT)

cov(RESID,log(IMPTT))
[1] 0.00000000000000000800864

NON PARAMETRIC TESTS

TEST OF VARIANCE EQUALITY

Import the data set

Rehearsal_and_non_rehearsal_groups <- read_sav("D:/Training/Data/Rehearsal and non rehearsal groups.sav")
head(Rehearsal_and_non_rehearsal_groups,10)
# A tibble: 10 × 2
   No.ofwordsrecalled Groups             
                <dbl> <dbl+lbl>          
 1                 24 1 [Rehearsal group]
 2                 21 1 [Rehearsal group]
 3                 19 1 [Rehearsal group]
 4                 27 1 [Rehearsal group]
 5                 16 1 [Rehearsal group]
 6                 15 1 [Rehearsal group]
 7                 27 1 [Rehearsal group]
 8                 18 1 [Rehearsal group]
 9                 22 1 [Rehearsal group]
10                 26 1 [Rehearsal group]

Rename the data set

data222<-Rehearsal_and_non_rehearsal_groups

Views the data set

head(data222,10)
# A tibble: 10 × 2
   No.ofwordsrecalled Groups             
                <dbl> <dbl+lbl>          
 1                 24 1 [Rehearsal group]
 2                 21 1 [Rehearsal group]
 3                 19 1 [Rehearsal group]
 4                 27 1 [Rehearsal group]
 5                 16 1 [Rehearsal group]
 6                 15 1 [Rehearsal group]
 7                 27 1 [Rehearsal group]
 8                 18 1 [Rehearsal group]
 9                 22 1 [Rehearsal group]
10                 26 1 [Rehearsal group]

Attach the data set

attach(data222)

Test of variance equality

var.test(No.ofwordsrecalled~Groups, ratio=1,
         alternative=c("two.sided","less","greater"),
         conf.level=0.95, data=data222)

    F test to compare two variances

data:  No.ofwordsrecalled by Groups
F = 1.1019, num df = 9, denom df = 9, p-value = 0.8875
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2736844 4.4360491
sample estimates:
ratio of variances 
          1.101852 

In the above output, the p-value of the test is 0.8875, which is greater than the significance level alpha = 0.05. We can conclude that there is equal variance across the two groups with a p-value = 0.8875.

KRUSKAL WALLI TEST

Kruskal-Wallis test, proposed by Kruskal and Wallis in 1952, is a nonparametric method for testing whether samples are originated from the same distribution.597,681 It extends the Mann-Whitney U test to more than two groups. The null hypothesis of the Kruskal-Wallis test is that the mean ranks of the groups are the same. As the nonparametric equivalent one-way ANOVA, Kruskal-Wallis test is called one-way ANOVA on ranks. Unlike the analogous one-way ANOVA, the nonparametric Kruskal-Wallis test does not assume a normal distribution of the underlying data. Thus, Kruskal-Wallis test is more suitable for analysis of microbiome data. Because the postsequencing microbiome data are often not normally distributed and contain some strong outliers, it is more appropriate to use ranks rather than actual values to avoid the testing being affected by the presence of outliers or by the nonnormal distribution of data.

The Kruskal Wallis test is the non parametric alternative to the One Way ANOVA. Non parametric means that the test doesn’t assume your data comes from a particular distribution. The H test is used when the assumptions for ANOVA aren’t met (like the assumption of normality). It is sometimes called the one-way ANOVA on ranks, as the ranks of the data values are used in the test rather than the actual data points.

H_test_miles_per_gassolin <- read_sav("D:/Data/data and other training stuff/SPSS_training data sets/H-test ,miles per gassolin.sav")

Rename the data set

data333<-H_test_miles_per_gassolin

View the data set

head(data333,10)
# A tibble: 10 × 2
   miles gassolin 
   <dbl> <dbl+lbl>
 1    20 1 [A]    
 2    31 1 [A]    
 3    24 1 [A]    
 4    33 1 [A]    
 5    23 1 [A]    
 6    24 1 [A]    
 7    28 1 [A]    
 8    16 1 [A]    
 9    19 1 [A]    
10    26 1 [A]    

Attach the data set

attach(data333)

Change gassolin to factors

gassolin<-as.factor(data333$gassolin)

KRUSKAL WALLIS TEST

kruskal.test(miles~gassolin, data=data333)

    Kruskal-Wallis rank sum test

data:  miles by gassolin
Kruskal-Wallis chi-squared = 0.86831, df = 2, p-value = 0.6478

In the above output, the p-value of the test is 0.6478, which is greater than the significance level alpha = 0.05. We can conclude that the median miles covered using the three types of gasoline is not significantly different with a p-value =0.6478

WILCOXON PAIRED SAMPLE

The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. This tutorial describes how to compute paired samples Wilcoxon test in R.

before <-c(190.1, 190.9, 172.7, 213, 231.4,
        196.9, 172.2, 285.5, 225.2, 113.7)
 
after <-c(392.9, 313.2, 345.1, 393, 434,
        227.9, 422, 383.9, 392.3, 352.2)

Data frame the data set

myData <- data.frame(
group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)

View the data set

head(myData,10)
    group weight
1  before  190.1
2  before  190.9
3  before  172.7
4  before  213.0
5  before  231.4
6  before  196.9
7  before  172.2
8  before  285.5
9  before  225.2
10 before  113.7

attach the data set

attach(myData)

Perform the wilcoxon test for paired data (two.sided test)

wilcox.test(weight~group, data=myData)

    Wilcoxon rank sum exact test

data:  weight by group
W = 98, p-value = 0.0000433
alternative hypothesis: true location shift is not equal to 0

In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 4.33e-05.

Perform the wilcoxon test for paired data (one.sided test to the left)

result = wilcox.test(weight ~ group,
                    data = myData,
                    paired = TRUE,
                    alternative = "less")
 
print(result)

    Wilcoxon signed rank exact test

data:  weight by group
V = 55, p-value = 1
alternative hypothesis: true location shift is less than 0

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater or equal to the median weight after treatment with a p-value = 1.

Perform the wilcoxon test for paired data (one.sided test to the right)

result2 = wilcox.test(weight ~ group,
                    data = myData,
                    paired = TRUE,
                    alternative = "greater")

print(result2)

    Wilcoxon signed rank exact test

data:  weight by group
V = 55, p-value = 0.0009766
alternative hypothesis: true location shift is greater than 0

In the above output, the p-value of the test is 0.0009766, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than the median weight after treatment with a p-value = 0.0009766.

Perform the wilcoxon test for one sample (two.sided)

weight <-c(27.6,30.6,32.2,25.3,30.9,31.0,28.9,28.9,28.9,28.2)

Perform the test

result3 = wilcox.test(weight, mu = 25)
print(result3)

    Wilcoxon signed rank test with continuity correction

data:  weight
V = 55, p-value = 0.005793
alternative hypothesis: true location is not equal to 25

One.sided to the left

wilcox.test(myData$weight, mu = 25,
            alternative = "less")

    Wilcoxon signed rank exact test

data:  myData$weight
V = 210, p-value = 1
alternative hypothesis: true location is less than 25

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than or equal to the median weight after treatment with a p-value = 1.

One sided test to the right

wilcox.test(myData$weight, mu = 25,
            alternative = "greater")

    Wilcoxon signed rank exact test

data:  myData$weight
V = 210, p-value = 0.0000009537
alternative hypothesis: true location is greater than 25

In the above output, the p-value of the test is approximately 0.0001, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than or equal to the median weight after treatment with a p-value approximately 0.0001.

SIGN TEST

The sign test is used to test the null hypothesis that the median of a distribution is equal to some value. It can be used a) in place of a one-sample t-test b) in place of a paired t-test or c) for ordered categorical data where a numerical scale is inappropriate but where it is possible to rank the observations.

Perform the sign test for one sample (two.sided)

weight <-c(27.6,30.6,32.2,25.3,30.9,31.0,28.9,28.9,28.9,28.2)

Perform the test

Load the package BSDA

library(BSDA)
result4 = SIGN.test(weight, mu = 25)
print(result4)

    One-sample Sign-Test

data:  weight
s = 10, p-value = 0.001953
alternative hypothesis: true median is not equal to 0
95 percent confidence interval:
 27.79467 30.96756
sample estimates:
median of x 
       28.9 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level  L.E.pt  U.E.pt
Lower Achieved CI     0.8906 28.2000 30.9000
Interpolated CI       0.9500 27.7947 30.9676
Upper Achieved CI     0.9785 27.6000 31.0000

Sign Test of paired data

before <-c(190.1, 190.9, 172.7, 213, 231.4,
        196.9, 172.2, 285.5, 225.2, 113.7)
 
after <-c(392.9, 313.2, 345.1, 393, 434,
        227.9, 422, 383.9, 392.3, 352.2)

Bind the data

myDatta<-data.frame(before,after)
head(myDatta,10)
   before after
1   190.1 392.9
2   190.9 313.2
3   172.7 345.1
4   213.0 393.0
5   231.4 434.0
6   196.9 227.9
7   172.2 422.0
8   285.5 383.9
9   225.2 392.3
10  113.7 352.2

Paired sample sign test to the right

SIGN.test(before,after,md=0, alternative ="greater",paired=T, data=myDatta)

    Dependent-samples Sign-Test

data:  before and after
S = 0, p-value = 1
alternative hypothesis: true median difference is greater than 0
95 percent confidence interval:
 -206.608      Inf
sample estimates:
median of x-y 
       -176.2 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level   L.E.pt U.E.pt
Lower Achieved CI     0.9453 -202.800    Inf
Interpolated CI       0.9500 -206.608    Inf
Upper Achieved CI     0.9893 -238.500    Inf

In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is equal to zero with a p-value = 1.

Paired sample sign test to the left

SIGN.test(before,after,md=0, alternative ="less",paired=T, data=myDatta)

    Dependent-samples Sign-Test

data:  before and after
S = 0, p-value = 0.0009766
alternative hypothesis: true median difference is less than 0
95 percent confidence interval:
      -Inf -119.7507
sample estimates:
median of x-y 
       -176.2 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level L.E.pt    U.E.pt
Lower Achieved CI     0.9453   -Inf -122.3000
Interpolated CI       0.9500   -Inf -119.7507
Upper Achieved CI     0.9893   -Inf  -98.4000

In the above output, the p-value of the test is 0.0009766, which is less than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is not equal to zero with a p-value = 0.009766.

Paired sample sign test (two sided)

SIGN.test(before,after,md=0, alternative ="two.sided",paired=T, data=myDatta)

    Dependent-samples Sign-Test

data:  before and after
S = 0, p-value = 0.001953
alternative hypothesis: true median difference is not equal to 0
95 percent confidence interval:
 -226.9173 -106.1542
sample estimates:
median of x-y 
       -176.2 

Achieved and Interpolated Confidence Intervals: 

                  Conf.Level    L.E.pt    U.E.pt
Lower Achieved CI     0.8906 -202.8000 -122.3000
Interpolated CI       0.9500 -226.9173 -106.1542
Upper Achieved CI     0.9785 -238.5000  -98.4000

In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is not equal to zero with a p-value = 0.001953.

MANN-WHITNEY WILCOXON TEST

The Mann-Whitney U test is used to compare differences between two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed. For example, you could use the Mann-Whitney U test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differ based on gender (i.e., your dependent variable would be “attitudes towards pay discrimination” and your independent variable would be “gender”, which has two groups: “male” and “female”). Alternately, you could use the Mann-Whitney U test to understand whether salaries, measured on a continuous scale, differed based on educational level (i.e., your dependent variable would be “salary” and your independent variable would be “educational level”, which has two groups: “high school” and “university”). The Mann-Whitney U test is often considered the nonparametric alternative to the independent t-test although this is not always the case.

Unlike the independent-samples t-test, the Mann-Whitney U test allows you to draw different conclusions about your data depending on the assumptions you make about your data’s distribution. These conclusions can range from simply stating whether the two populations differ through to determining if there are differences in medians between groups. These different conclusions hinge on the shape of the distributions of your data, which we explain more about later.

Create the data set

red_bulb <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8)
orange_bulb <- c(47.8, 60, 63.4, 76, 89.4, 67.3, 61.3, 62.4)

Passing them in the columns

BULB_PRICE = c(red_bulb, orange_bulb)
BULB_TYPE = rep(c("red", "orange"), each = 8)

Now creating a dataframe

DATASET <- data.frame(BULB_TYPE, BULB_PRICE, stringsAsFactors = TRUE)

printing the dataframe

DATASET
   BULB_TYPE BULB_PRICE
1        red       38.9
2        red       61.2
3        red       73.3
4        red       21.8
5        red       63.4
6        red       64.6
7        red       48.4
8        red       48.8
9     orange       47.8
10    orange       60.0
11    orange       63.4
12    orange       76.0
13    orange       89.4
14    orange       67.3
15    orange       61.3
16    orange       62.4

Conduct the test

RSLT<-wilcox.test(BULB_PRICE~ BULB_TYPE,
                   data = DATASET,
                   exact = FALSE)

SPEARMAN’S RANK CORRELATION

Spearman’s rank correlation is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function

Import the data set

My_data<-read.csv("C:\\Users\\user\\Downloads\\gapminder.csv")
head(My_data,10)
       country continent year lifeExp      pop gdpPercap
1  Afghanistan      Asia 1952  28.801  8425333  779.4453
2  Afghanistan      Asia 1957  30.332  9240934  820.8530
3  Afghanistan      Asia 1962  31.997 10267083  853.1007
4  Afghanistan      Asia 1967  34.020 11537966  836.1971
5  Afghanistan      Asia 1972  36.088 13079460  739.9811
6  Afghanistan      Asia 1977  38.438 14880372  786.1134
7  Afghanistan      Asia 1982  39.854 12881816  978.0114
8  Afghanistan      Asia 1987  40.822 13867957  852.3959
9  Afghanistan      Asia 1992  41.674 16317921  649.3414
10 Afghanistan      Asia 1997  41.763 22227415  635.3414

Attach the data set

attach(My_data)

Perform the test

res <- cor.test(gdpPercap, lifeExp, method = "spearman")
res

    Spearman's rank correlation rho

data:  gdpPercap and lifeExp
S = 143096490, p-value < 0.00000000000000022
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8264712 

S is the value of the test statistic (S = 14309649). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true rho is not equal to 0). Sample estimates is the correlation coefficient. For Spearmann correlation coefficient it’s named as rho (Cor.coeff = 0.8264). From the results we shall therefore reject the null hypothesis and conclude that life expectancy and gdp per capita are dependent, and therefore true rho is not equal to 0.

MATRIX

matrix, a set of numbers arranged in rows and columns so as to form a rectangular array. The numbers are called the elements, or entries, of the matrix. Matrices have wide applications in engineering, physics, economics, and statistics as well as in various branches of mathematics. Matrices also have important applications in computer graphics, where they have been used to represent rotations and other transformations of images.

An example of a matrix

A<-matrix(c(4,19,16,21,14,8,10,18,14),nrow=3,ncol=3,byrow=T)
A
     [,1] [,2] [,3]
[1,]    4   19   16
[2,]   21   14    8
[3,]   10   18   14

Determinant of a matrix (2x2)

The determinant is a special number that can be calculated from a matrix. The matrix has to be square (same number of rows and columns) like this one:

Mat<-matrix(c(3,8,4,6),nrow=2,byrow = T)
Mat
     [,1] [,2]
[1,]    3    8
[2,]    4    6

\[ (3*6)-(4*8)= 18-32 = -14 \]

Alternatively

det(Mat)
[1] -14

Determinant of a matrix (3x3)

mtrx <- matrix( c(-1, -3, -5, -7, -8, 9, 11, 13, 15),nrow = 3,byrow = T)
mtrx
     [,1] [,2] [,3]
[1,]   -1   -3   -5
[2,]   -7   -8    9
[3,]   11   13   15

the determinant

det(mtrx)
[1] -360

Matrix Operation

3x3 Matrix

A<-matrix(c(4,19,16,21,14,8,10,18,14),nrow=3,ncol=3,byrow=T)
A
     [,1] [,2] [,3]
[1,]    4   19   16
[2,]   21   14    8
[3,]   10   18   14
B<-matrix(c(161,191,123,165,125,155,100,200,155),nrow=3,ncol=3,byrow=T)
B
     [,1] [,2] [,3]
[1,]  161  191  123
[2,]  165  125  155
[3,]  100  200  155

Inverse of the matrix B

solve(B)
             [,1]         [,2]         [,3]
[1,]  0.009121582  0.003927184 -0.011165601
[2,]  0.007905371 -0.009929774  0.003656479
[3,] -0.016085370  0.010278944  0.008937189

Find the identity matrix of the matrix B

zapsmall(B%*%solve(B))
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

matrix operation

addition

A+B
     [,1] [,2] [,3]
[1,]  165  210  139
[2,]  186  139  163
[3,]  110  218  169

subtraction

B-A
     [,1] [,2] [,3]
[1,]  157  172  107
[2,]  144  111  147
[3,]   90  182  141

multiplication

A%*%B
     [,1] [,2] [,3]
[1,] 5379 6339 5917
[2,] 6491 7361 5993
[3,] 5980 6960 6190

Division

B/A
          [,1]      [,2]     [,3]
[1,] 40.250000 10.052632  7.68750
[2,]  7.857143  8.928571 19.37500
[3,] 10.000000 11.111111 11.07143
M1<-matrix(c(16,121,166,121,129,177,163,135,198),nrow=3,ncol=3,byrow=T)
M1
     [,1] [,2] [,3]
[1,]   16  121  166
[2,]  121  129  177
[3,]  163  135  198
M2<-matrix(c(100,170,188,131,175,178,151,199,502),nrow=3,ncol=3,byrow=T)
M2
     [,1] [,2] [,3]
[1,]  100  170  188
[2,]  131  175  178
[3,]  151  199  502
S<-solve(A)
S
      [,1]  [,2]  [,3]
[1,] -1.04 -0.44  1.44
[2,]  4.28  2.08 -6.08
[3,] -4.76 -2.36  6.86
solve(B)
             [,1]         [,2]         [,3]
[1,]  0.009121582  0.003927184 -0.011165601
[2,]  0.007905371 -0.009929774  0.003656479
[3,] -0.016085370  0.010278944  0.008937189

matrix binding

column binding

cbind(M1,M2)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   16  121  166  100  170  188
[2,]  121  129  177  131  175  178
[3,]  163  135  198  151  199  502

Row bidning

rbind(M1,M2)
     [,1] [,2] [,3]
[1,]   16  121  166
[2,]  121  129  177
[3,]  163  135  198
[4,]  100  170  188
[5,]  131  175  178
[6,]  151  199  502

identity matrix

zapsmall(A%*%S)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

Additional Matrix

S<-matrix(c(2.879,10.002,-1.810,10.002,199.798,-5.627,-1.810,-5.627,3.628),nrow=3)
S
       [,1]    [,2]   [,3]
[1,]  2.879  10.002 -1.810
[2,] 10.002 199.798 -5.627
[3,] -1.810  -5.627  3.628
S_inv<-solve(S)
S_inv
            [,1]         [,2]         [,3]
[1,]  0.58648233 -0.022083814  0.258342724
[2,] -0.02208381  0.006065228 -0.001610437
[3,]  0.25834272 -0.001610437  0.402022712

Identity matrix

zapsmall(S%*%S_inv)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

4x4 Matrix

R<-matrix(c(4,3,4,5,8,4,7,8,2,5,3,8,9,5,5,6),nrow = 4)
R
     [,1] [,2] [,3] [,4]
[1,]    4    8    2    9
[2,]    3    4    5    5
[3,]    4    7    3    5
[4,]    5    8    8    6

The determinant

det(R)
[1] -71

Find the inverse

solve(R)
            [,1]       [,2]       [,3]        [,4]
[1,] -0.87323944  1.5211268  1.6901408 -1.36619718
[2,]  0.33802817 -0.9436620 -0.4929577  0.69014085
[3,]  0.07042254 -0.1549296 -0.3943662  0.35211268
[4,]  0.18309859  0.1971831 -0.2253521 -0.08450704

Find the identity matrix

zapsmall(R%*%solve(R))
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1

Giving the column and Row Names

my_mat<-matrix(c(25,22,18,26,12,16,24,20,26,28,30,28,12,
                 17,16,17,10,15,16,20,14,19,17,14),ncol=3,nrow = 8)
my_mat
     [,1] [,2] [,3]
[1,]   25   26   10
[2,]   22   28   15
[3,]   18   30   16
[4,]   26   28   20
[5,]   12   12   14
[6,]   16   17   19
[7,]   24   16   17
[8,]   20   17   14

Give name to columns and rows

colnames(my_mat) <- c("sales", "profit","quantity")
rownames(my_mat)<- c("John","Chang","Michael","Bryan","Jose","McTon","Jayden","Jane")
my_mat
        sales profit quantity
John       25     26       10
Chang      22     28       15
Michael    18     30       16
Bryan      26     28       20
Jose       12     12       14
McTon      16     17       19
Jayden     24     16       17
Jane       20     17       14

Convert the matrix into a data frame

my_mat<-as.data.frame(my_mat)
my_mat
        sales profit quantity
John       25     26       10
Chang      22     28       15
Michael    18     30       16
Bryan      26     28       20
Jose       12     12       14
McTon      16     17       19
Jayden     24     16       17
Jane       20     17       14

Data framing/ Data binding

Flour<-c(230,181,165,150,97,182,191,199,172,170)
Cooking_oil<-c(125,99,97,115,120,100,80,90,95,125)
Kerosene<-c(200,55,105,85,125,150,85,120,110,130)
Rice<-c(109,107,98,71,82,103,111,93,86,78)

Data frame

data_frame<-data.frame(Flour,Cooking_oil,Kerosene,Rice)
data_frame
   Flour Cooking_oil Kerosene Rice
1    230         125      200  109
2    181          99       55  107
3    165          97      105   98
4    150         115       85   71
5     97         120      125   82
6    182         100      150  103
7    191          80       85  111
8    199          90      120   93
9    172          95      110   86
10   170         125      130   78

Further Operation

Covariance

cov(data_frame)
                Flour Cooking_oil   Kerosene       Rice
Flour       1196.4556   -127.3556  469.94444  309.60000
Cooking_oil -127.3556    244.2667  316.22222 -101.75556
Kerosene     469.9444    316.2222 1589.16667   69.77778
Rice         309.6000   -101.7556   69.77778  197.06667

Variance

var(data_frame)
                Flour Cooking_oil   Kerosene       Rice
Flour       1196.4556   -127.3556  469.94444  309.60000
Cooking_oil -127.3556    244.2667  316.22222 -101.75556
Kerosene     469.9444    316.2222 1589.16667   69.77778
Rice         309.6000   -101.7556   69.77778  197.06667

mean

mean(data_frame$Flour)
[1] 173.7

Linear Regression from matrix

income<-c(25,26,32,31,25,26,24,36,21,26,21,23,25)
expenditure<-c(15,16,14,12,20,16,19,18,14,13,19,18,17)
investment<-c(10,12,14,12,12,14,13,16,21,14,15,19,17)

Data Frame

data.frame(income,expenditure,investment)
   income expenditure investment
1      25          15         10
2      26          16         12
3      32          14         14
4      31          12         12
5      25          20         12
6      26          16         14
7      24          19         13
8      36          18         16
9      21          14         21
10     26          13         14
11     21          19         15
12     23          18         19
13     25          17         17

Estimate the regression model

Regression_model=lm(expenditure~income+investment)
summary(Regression_model)

Call:
lm(formula = expenditure ~ income + investment)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5335 -1.4506 -0.2696  1.9866  3.5648 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 20.284694   6.894461   2.942   0.0147 *
income      -0.150291   0.183800  -0.818   0.4326  
investment  -0.007682   0.259844  -0.030   0.9770  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.67 on 10 degrees of freedom
Multiple R-squared:  0.06582,   Adjusted R-squared:  -0.121 
F-statistic: 0.3523 on 2 and 10 DF,  p-value: 0.7115

View the results using stargazer

stargazer(Regression_model, type="text")

===============================================
                        Dependent variable:    
                    ---------------------------
                            expenditure        
-----------------------------------------------
income                        -0.150           
                              (0.184)          
                                               
investment                    -0.008           
                              (0.260)          
                                               
Constant                     20.285**          
                              (6.894)          
                                               
-----------------------------------------------
Observations                    13             
R2                             0.066           
Adjusted R2                   -0.121           
Residual Std. Error       2.670 (df = 10)      
F Statistic             0.352 (df = 2; 10)     
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

SURVIVAL ANALYSIS (Part one)

In many cancer studies, the main outcome under assessment is the time to an event of interest. The generic name for the time is survival time, although it may be applied to the time ‘survived’ from complete remission to relapse or progression as equally as to the time from diagnosis to death. If the event occurred in all individuals, many methods of analysis would be applicable. However, it is usual that at the end of follow-up some of the individuals have not had the event of interest, and thus their true time to event is unknown. Further, survival data are rarely Normally distributed, but are skewed and comprise typically of many early events and relatively few late ones. It is these features of the data that make the special methods called survival analysis necessary.

This paper is the first of a series of four articles that aim to introduce and explain the basic concepts of survival analysis. Most survival analyses in cancer journals use some or all of Kaplan–Meier (KM) plots, logrank tests, and Cox (proportional hazards) regression. We will discuss the background to, and interpretation of, each of these methods but also other approaches to analysis that deserve to be used more often. In this first article, we will present the basic concepts of survival analysis, including how to produce and interpret survival curves, and how to quantify and test survival differences between two or more groups of patients. Future papers in the series cover multivariate analysis and the last paper introduces some more advanced concepts in a brief question and answer format. More detailed accounts of these methods can be found in books written specifically about survival analysis, for example, Collett (1994), Parmar and Machin (1995) and Kleinbaum (1996). In addition, individual references for the methods are presented throughout the series. Several introductory texts also describe the basis of survival analysis, for example, Altman (2003) and Piantadosi (1997).

In other words, survival analysis is a branch of statistics for analyzing the expected duration of time until one event happens, such as a death in biological organisms and failure in mechanical systems. Today, I have gone through this topic and found it very interesting. How one can predict the survival of cancer patients, churn rate or students’ dropout rate, or machine failure.

Load the dataset

my_data<-ovarian
data(cancer, package="survival")

View the first few observations

head(my_data,10)
   futime fustat     age resid.ds rx ecog.ps
1      59      1 72.3315        2  1       1
2     115      1 74.4932        2  1       1
3     156      1 66.4658        2  1       2
4     421      0 53.3644        2  2       1
5     431      1 50.3397        2  1       1
6     448      0 56.4301        1  1       2
7     464      1 56.9370        2  2       2
8     475      1 59.8548        2  2       2
9     477      0 64.1753        2  1       1
10    563      1 55.1781        1  2       2

View the last few observations

tail(my_data,10)
   futime fustat     age resid.ds rx ecog.ps
17   1040      0 38.8932        2  1       2
18   1106      0 44.6000        1  1       1
19   1129      0 53.9068        1  2       1
20   1206      0 44.2055        2  2       1
21   1227      0 59.5890        1  2       2
22    268      1 74.5041        2  1       2
23    329      1 43.1370        2  1       1
24    353      1 63.2192        1  2       2
25    365      1 64.4247        2  2       1
26    377      0 58.3096        1  2       1

Variable definition

Note:

fustat= 1= event occurred(death), =0= no event(they are censored). When fitting a model, make sure you know what 0 and 1 code for. This is the survival after diagnosis.

futime: survival or censoring time

fustat: censoring status

age: in years

resid.ds: residual disease present (1=no,2=yes)

rx: treatment group

ecog.ps: ECOG performance status (1 is better, see reference)

Note:

A patient is scored as censored if he or she did not suffer the outcome of interest. In survival analysis, patients who do not have an “event” during a specified period are said to have censored observation.

Read variable name

names(my_data)
[1] "futime"   "fustat"   "age"      "resid.ds" "rx"       "ecog.ps" 

Attach the dataset

attach(my_data)

Kaplan Meier Survival Model

Kaplan-Meier estimate is one of the best options to be used to measure the fraction of subjects living for a certain amount of time after treatment. In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects survived or saved after that intervention over a period of time. The time starting from a defined point to the occurrence of a given event, for example death is called as survival time and the analysis of group data as survival analysis. This can be affected by subjects under study that are uncooperative and refused to be remained in the study or when some of the subjects may not experience the event or death before the end of the study, although they would have experienced or died if observation continued, or we lose touch with them midway in the study. We label these situations as censored observations. The Kaplan-Meier estimate is the simplest way of computing the survival over time in spite of all these difficulties associated with subjects or situations. The survival curve can be created assuming various situations. It involves computing of probabilities of occurrence of event at a certain point of time and multiplying these successive probabilities by any earlier computed probabilities to get the final estimate. This can be calculated for two groups of subjects and also their statistical difference in the survivals. This can be used in Ayurveda research when they are comparing two drugs and looking for survival of subjects.

Fit the model

km.model<-survfit(Surv(futime,fustat)~1,
                  type="kaplan-meier")

What is ~1

Since we have no x-variable, we must put a 1 there, its just survival, no variable to relate it to. In the command above, we specified the model, however, if we don’t the model, the system will fit the Kaplan-meier model by default.

Summaries of the model

km.model
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")

      n events median 0.95LCL 0.95UCL
[1,] 26     12    638     464      NA

This gives the MEDIAN survival and confidence interval (CI) for it. From the results above, we have 26 individuals and 12 events. The results also gives the median survival time. From the results, more than half of the people survived beyond 638 days.

Model summary

summary(km.model)
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   59     26       1    0.962  0.0377        0.890        1.000
  115     25       1    0.923  0.0523        0.826        1.000
  156     24       1    0.885  0.0627        0.770        1.000
  268     23       1    0.846  0.0708        0.718        0.997
  329     22       1    0.808  0.0773        0.670        0.974
  353     21       1    0.769  0.0826        0.623        0.949
  365     20       1    0.731  0.0870        0.579        0.923
  431     17       1    0.688  0.0919        0.529        0.894
  464     15       1    0.642  0.0965        0.478        0.862
  475     14       1    0.596  0.0999        0.429        0.828
  563     12       1    0.546  0.1032        0.377        0.791
  638     11       1    0.497  0.1051        0.328        0.752

From the summary table above, at time (59 days) 26 individuals were at risk and an event occurred. In other words, and individual died. Additionally, beyond 59 days, the probability of an individual surviving is 96.2%. Further, from the summary table above, we are 95% confident that there are 89% to 100% chances of survival beyond 59 days. At time 115 days, 25 individuals were at risk and one died. The probability of survival beyond 115 days is 92.3% with the standard error of 0.0523. Besides, we are 95% confident that beyond 115 days, there are 82.6% to 100% chances of survival.

Plot the model

plot(km.model,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model")

From the command above, we provide conf.int = F, if we do not want the coefficient level in the plot. Consider the plot below with confidence level.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)

From the graph above, you get the reason why why the median survival days is 638.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
abline(h=0.5,col="red")

Remember we can also include a ‘tick’ where there is a censored observation by using the “mark-time” argument.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1, mark.time = T)

The graph above shows every point where there were censored observation.

Add Another variable (+50 years)

Import the data set

cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")

View the dataset

head(cancer,10)
   futime fustat     age resid.ds rx ecog.ps over50 gender
1      59      1 72.3315        2  1       1      1      1
2     115      1 74.4932        2  1       1      1      0
3     156      1 66.4658        2  1       2      1      0
4     421      0 53.3644        2  2       1      1      0
5     431      1 50.3397        2  1       1      1      0
6     448      0 56.4301        1  1       2      1      1
7     464      1 56.9370        2  2       2      1      0
8     475      1 59.8548        2  2       2      1      0
9     477      0 64.1753        2  1       1      1      1
10    563      1 55.1781        1  2       2      1      1
tail(cancer,10)
   futime fustat     age resid.ds rx ecog.ps over50 gender
17   1040      0 38.8932        2  1       2      0      1
18   1106      0 44.6000        1  1       1      0      0
19   1129      0 53.9068        1  2       1      1      1
20   1206      0 44.2055        2  2       1      0      0
21   1227      0 59.5890        1  2       2      1      0
22    268      1 74.5041        2  1       2      1      1
23    329      1 43.1370        2  1       1      0      1
24    353      1 63.2192        1  2       2      1      0
25    365      1 64.4247        2  2       1      1      0
26    377      0 58.3096        1  2       1      1      1
Let us now fit a KM model relating over/under50 to survival and see if there is a relationship between survival and age.
km.model2<-survfit(Surv(futime,fustat)~over50,
                  type="kaplan-meier", data = cancer)

Plot the model

plot(km.model2,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="K-M Model: Survival Analysis", las=1)

ggsurvplot(km.model2,data = cancer, pval = TRUE)

Alternatively,

plot(km.model2, conf.int = F,xlab = "Time(days)",ylab = "Survival Probability",main="K-M model: Survival Analysis", col = c("red","blue"),las=1,lwd=2,mark.time = T)

## legend 
legend(100, 0.2, legend = c("below50","above50"),lty = 1,lwd = 2,col=c("red","blue"),bty = "",cex = 0.6)

Display the summary

km.model2
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer, 
    type = "kaplan-meier")

          n events median 0.95LCL 0.95UCL
over50=0  6      1     NA      NA      NA
over50=1 20     11    563     431      NA

From the summary statistics above, there were six individuals who were under 50 years of age, from which only one event occurred. On the other hand, we have 20 individual who were at risk and over 50 years from which 11 events occurred. This is a clear indication that there is a relationship between age and survival.

Display the model summary

summary(km.model2)
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer, 
    type = "kaplan-meier")

                over50=0 
        time       n.risk      n.event     survival      std.err lower 95% CI 
     329.000        6.000        1.000        0.833        0.152        0.583 
upper 95% CI 
       1.000 

                over50=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   59     20       1    0.950  0.0487        0.859        1.000
  115     19       1    0.900  0.0671        0.778        1.000
  156     18       1    0.850  0.0798        0.707        1.000
  268     17       1    0.800  0.0894        0.643        0.996
  353     16       1    0.750  0.0968        0.582        0.966
  365     15       1    0.700  0.1025        0.525        0.933
  431     12       1    0.642  0.1093        0.460        0.896
  464     10       1    0.577  0.1157        0.390        0.855
  475      9       1    0.513  0.1193        0.326        0.809
  563      7       1    0.440  0.1227        0.255        0.760
  638      6       1    0.367  0.1222        0.191        0.705

From the summary table above, at time (329 days) 6 individuals below 50 years were at risk and an event occurred. In other words, and individual died. Additionally, beyond 329 days, the probability of an individual below 50 years surviving is 83.3%. Further, from the summary table above, we are 95% confident that there are 58.3% to 100% chances of survival beyond 329 days for the individuals below 50 years. On the other hand, at time (59 days) 20 individuals above 50 years were at risk and an event occurred. In other words, and individual died. Besides, beyond 59 days, the probability of an individual above 50 years surviving is 95%. Further, from the summary table above, we are 95% confident that there are 85.9% to 100% chances of survival beyond 59 days for the individuals above 50 years. It is important to note that the tick marks in the plot above shows censored observations.

Discussion
  • Are the two survival functions statistically different *
  • What do you think? Just visually comparing the two *

From the graph, we can argue that survival differs significantly across the two age category. Individuals below 50 years seems to have a higher survival probability as compared to individuals above 50 years.

Mathematical Proof

Ho: Survival in the two age groups is the same

Ha: Survival in the two age groups is not the same

survdiff(Surv(futime,fustat)~over50, data = cancer)
Call:
survdiff(formula = Surv(futime, fustat) ~ over50, data = cancer)

          N Observed Expected (O-E)^2/E (O-E)^2/V
over50=0  6        1      3.6     1.876      2.75
over50=1 20       11      8.4     0.804      2.75

 Chisq= 2.7  on 1 degrees of freedom, p= 0.1 

From the inferential results above, there is no evidence to reject the null hypothesis. In other words, there is no statistically sufficient evidence to prove that survival in the two groups is statistically different as shown by the p-value of 0.1 which is more than 0.05.

SURVIVAL ANALYSIS (Part Two)

Cox-Proportional Harzard Model (a)

The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.

In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:

  • the definition of hazard and survival functions,
  • the construction of Kaplan-Meier survival curves for different patient groups
  • the logrank test for comparing two or more survival curves

The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age. An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.

The Cox proportional hazards model is called a semi-parametric model, because there are no assumptions about the shape of the baseline hazard function. There are however, other assumptions as noted above (i.e., independence, changes in predictors produce proportional changes in the hazard regardless of time, and a linear association between the natural logarithm of the relative hazard and the predictors). There are other regression models used in survival analysis that assume specific distributions for the survival times such as the exponential, Weibull, Gompertz and log-normal distributions1,8. The exponential regression survival model, for example, assumes that the hazard function is constant. Other distributions assume that the hazard is increasing over time, decreasing over time, or increasing initially and then decreasing. Example 5 will illustrate estimation of a Cox proportional hazards regression model and discuss the interpretation of the regression coefficients.

Load the data set

data("lung")
head(lung)
  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0
tail(lung,5)
    inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
224    1  188      1  77   1       1       80        60       NA       3
225   13  191      1  39   1       0       90        90     2350      -5
226   32  105      1  75   2       2       60        70     1025       5
227    6  174      1  66   1       1       90       100     1075       1
228   22  177      1  58   2       1       80        90     1060       0
names(lung)
 [1] "inst"      "time"      "status"    "age"       "sex"       "ph.ecog"  
 [7] "ph.karno"  "pat.karno" "meal.cal"  "wt.loss"  

The data set above is an Rstudio inbuilt data set that can be used to run Cox Proportional hazard model, however, in this case, we are going to use the data below.

Load the data set

cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")

View the data set

head(cancer,10)
   futime fustat     age resid.ds rx ecog.ps over50 gender
1      59      1 72.3315        2  1       1      1      1
2     115      1 74.4932        2  1       1      1      0
3     156      1 66.4658        2  1       2      1      0
4     421      0 53.3644        2  2       1      1      0
5     431      1 50.3397        2  1       1      1      0
6     448      0 56.4301        1  1       2      1      1
7     464      1 56.9370        2  2       2      1      0
8     475      1 59.8548        2  2       2      1      0
9     477      0 64.1753        2  1       1      1      1
10    563      1 55.1781        1  2       2      1      1

Change the variable of interest into factor

cancer$over50<-as.factor(cancer$over50)
cancer$ecog.ps<-as.factor(cancer$ecog.ps)

Attach the data set

attach(cancer)

Fit the model

cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)

Display the model summary

summary(cox.model)
Call:
coxph(formula = Surv(futime, fustat) ~ over50 + ecog.ps)

  n= 26, number of events= 12 

           coef exp(coef) se(coef)     z Pr(>|z|)
over501  1.5356    4.6442   1.0541 1.457    0.145
ecog.ps2 0.2628    1.3006   0.5893 0.446    0.656

         exp(coef) exp(-coef) lower .95 upper .95
over501      4.644     0.2153    0.5884    36.655
ecog.ps2     1.301     0.7689    0.4098     4.128

Concordance= 0.603  (se = 0.081 )
Likelihood ratio test= 3.64  on 2 df,   p=0.2
Wald test            = 2.46  on 2 df,   p=0.3
Score (logrank) test = 2.95  on 2 df,   p=0.2

Remember, cox proportional hazard model do not have the coefficient for the intercept.

Interpretation of exponential coefficient.

At any given time, someone is above 50 years is 4.6442 times as likely to to die as someone who is under 50 years adjusting for ECOG performance status. ECOG performance status describes a patient’s level of functioning in terms of their ability to care for themselves, daily activity, and physical ability (walking, working, etc.). Researchers worldwide consider the ECOG Performance Status Scale when planning cancer clinical trials to study new treatments. From the results (4.6442- 1) indicates that someone who is over 50 years is 364.42% times likely to die compared to someone who is under 50 years when adjusted for ECOG performance status. What this means is that if we pick two individualz who have same the ECOG performance status, the one above 50 years is 364.42% times likely to dies than the one below 50 years. Additionally, someone who is below 50 years is 0.2153 times as likey to die as compared to someone who is above 50 years adjusting to ECOG performance status.

Interpretation of the Likelyhood Ratio Test

This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.

Comparing the nested models using likelihood ratio test (LRT); [can we drop ECOG performance status]

cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)
cox.model2<-coxph(Surv(futime,fustat)~over50)

Perform the Likelihood Ratio Test

The test aims to establish whether dropping ECOG performance would better or worsen our model significantly. Consider the command below.

anova(cox.model2,cox.model,test = "LRT")
Analysis of Deviance Table
 Cox model: response is  Surv(futime, fustat)
 Model 1: ~ over50
 Model 2: ~ over50 + ecog.ps
   loglik  Chisq Df Pr(>|Chi|)
1 -33.266                     
2 -33.166 0.2013  1     0.6537

From the results above(p-value>0.05), we cam drop the ECOG performance status without the loss of predictive power from the model. In other words, ECOG performance status is not very important in this model.

Additional Test

Load the data set

data("lung")
head(lung)
  inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1    3  306      2  74   1       1       90       100     1175      NA
2    3  455      2  68   1       0       90        90     1225      15
3    3 1010      1  56   1       0       90        90       NA      15
4    5  210      2  57   1       1       90        60     1150      11
5    1  883      2  60   1       0      100        90       NA       0
6   12 1022      1  74   1       1       50        80      513       0
write.csv(lung, "C:\\Users\\user\\Downloads\\lung.csv", row.names=FALSE)

Now Import the data set

lung_data<-read.csv("C:\\Users\\user\\Downloads\\lung_data.csv")

Attach the data set

attach(lung_data)

View the data set

head(lung_data,10)
   inst time status age over50 sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1     3  306      1  74      1   0       1       90       100     1175      NA
2     3  455      1  68      1   0       0       90        90     1225      15
3     3 1010      0  56      1   0       0       90        90       NA      15
4     5  210      1  57      1   0       1       90        60     1150      11
5     1  883      1  60      1   0       0      100        90       NA       0
6    12 1022      0  74      1   0       1       50        80      513       0
7     7  310      1  68      1   1       2       70        60      384      10
8    11  361      1  71      1   1       2       60        80      538       1
9     1  218      1  53      1   0       1       70        80      825      16
10    7  166      1  61      1   0       2       70        70      271      34

Cox proportional hazard model (b)

Fit the model with numeric X-variable

cox.model3<-coxph(Surv(time,status)~over50+sex)

Display summary statistics

summary(cox.model3)
Call:
coxph(formula = Surv(time, status) ~ over50 + sex)

  n= 228, number of events= 165 

          coef exp(coef) se(coef)      z Pr(>|z|)   
over50  0.2803    1.3236   0.3003  0.933  0.35057   
sex    -0.5270    0.5904   0.1672 -3.151  0.00162 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       exp(coef) exp(-coef) lower .95 upper .95
over50    1.3236     0.7555    0.7347    2.3844
sex       0.5904     1.6938    0.4254    0.8193

Concordance= 0.585  (se = 0.022 )
Likelihood ratio test= 11.58  on 2 df,   p=0.003
Wald test            = 10.94  on 2 df,   p=0.004
Score (logrank) test = 11.19  on 2 df,   p=0.004

Interpretation of exponential coefficient.

At any given time, someone is above 50 years is 1.3236 times as likely to to die as someone who is under 50 years adjusting for gender. From the results (1.3236- 1) indicates that someone who is over 50 years is 32.36% times likely to die compared to someone who is under 50 years when adjusted for gender. What this means is that if we pick two individuals who have same the gender, the one above 50 years is 32.36% times likely to die than the one below 50 years. Additionally, someone who is below 50 years is 0.7555 times likely to die as compared to someone who is above 50 years adjusting for gender.

Interpretation of the Likelyhood Ratio Test

This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.

Comparing the nested models using likelihood ratio test (LRT); [can we drop gender]

cox.model3<-coxph(Surv(time,status)~over50+sex)
cox.model4<-coxph(Surv(time,status)~over50)

Perform the Likelihood Ratio Test

The test aims to establish whether dropping gender would better or worsen our model significantly. Consider the command below.

anova(cox.model4,cox.model3,test = "LRT")
Analysis of Deviance Table
 Cox model: response is  Surv(time, status)
 Model 1: ~ over50
 Model 2: ~ over50 + sex
   loglik  Chisq Df Pr(>|Chi|)   
1 -749.35                        
2 -744.12 10.464  1   0.001217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above(p-value<0.05), we cannot drop the the X-variable sex from the model. Doing so leads to loss of predictive power from the model. In other words, the X-variable sex is very important in this model.

Now, we have worked with adding categorical X-variables, let us now add numeric X-variable in the model.

Run the model

cox.model5<-coxph(Surv(time,status)~age+wt.loss)

Display Summary statistics

summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

            coef exp(coef) se(coef)     z Pr(>|z|)  
age     0.021697  1.021934 0.009627 2.254   0.0242 *
wt.loss 0.001339  1.001340 0.006239 0.215   0.8301  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.022     0.9785    1.0028     1.041
wt.loss     1.001     0.9987    0.9892     1.014

Concordance= 0.563  (se = 0.026 )
Likelihood ratio test= 5.27  on 2 df,   p=0.07
Wald test            = 5.12  on 2 df,   p=0.08
Score (logrank) test = 5.14  on 2 df,   p=0.08

Interpretation exponentiated age variable

At any given instant in time, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger adjusting for weight loss in the past six months. What this means is that if we pick two individuals who have similar weight loss in the past six months, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger.

Assumptions of Cox-Proportional Hazard Model

The Cox proportional hazards model makes two assumptions: * survival curves for different strata must have hazard functions that are proportional over the time t. * the relationship between the log hazard and each covariate is linear, which can be verified with residual plots. Examples of covariates can be categorical such as race or treatment group, or continuous such as biomarker concentrations.

Linearity

Fit the model

cox.model5<-coxph(Surv(time,status)~age+wt.loss)

Display the summary of the model

summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

            coef exp(coef) se(coef)     z Pr(>|z|)  
age     0.021697  1.021934 0.009627 2.254   0.0242 *
wt.loss 0.001339  1.001340 0.006239 0.215   0.8301  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.022     0.9785    1.0028     1.041
wt.loss     1.001     0.9987    0.9892     1.014

Concordance= 0.563  (se = 0.026 )
Likelihood ratio test= 5.27  on 2 df,   p=0.07
Wald test            = 5.12  on 2 df,   p=0.08
Score (logrank) test = 5.14  on 2 df,   p=0.08

If you recall, we assume that the relationship between the covariates or rather X-variable and the log hazard is linear. We can check this assumption in the same way we check linearity in the linear regression model, logistic regression model, or Poisson regression model. In order to do, we can look at the residual plot. On the x-axis, we plot the predicted values and y-axis we plot the residuals.

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)

Add the horizontal line at zero

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)
abline(h=0)

Fit a smoother through the points

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)
abline(h=0)
lines(smooth.spline(predict(cox.model5),residuals(cox.model5,type = "martingale")),col="red")

Check linearity using the deviance residuals

plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))

Add the abline and smoother

plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
abline(h=0)
lines(smooth.spline(predict(cox.model5),
                    residuals(cox.model5,type = "deviance")),col="red")

From the results above, there is linearity between the covariates/X-variables and the log hazard.

Proporrtional Hazard Assumption

This assumption is tested using Schoenfeld test which has the following hypothesis;

Ho: HAZARDS are proportional, HAZARDs are constant over time
Ha: HAZARDS are not proportional, HAZARDS are not proportional over time.

Performing this test will return test for each X-variable and for overall model

cox.zph(cox.model5)
         chisq df    p
age     0.6910  1 0.41
wt.loss 0.0121  1 0.91
GLOBAL  0.7209  2 0.70

The p-values for the covariates and overall model are greater than 0.05. This is an indicator that there is statistically significant evidence to reject the null hypothesis. We therefore conculude that HAZARDS are proportional over time.

Plot of Schoemfeld test

We can see a plot of these as well …(one plot for each parameter). These are plots of “changes in b over time”, if we let “b” vary over time recall…if “b” vary over time, this means that there is NO PH!. The effect is not constant over time… it varies! However pay less attention to the extremes, as line is sensitive…

plot(cox.zph(cox.model5))

The smooth line in the plot tells us how HAZARds will change if we allow them to change. The dotted line on either side of the solid line is the confidence limit. Let us add the abline through zero.

plot(cox.zph(cox.model5)[1], main="Plot of Hazards")
abline(h=0,col="red")

From the graph above, the reference line [abline] lies with the confidence bound at least 1/3 of the time. However, that is not enough to show that coefficient of HAZARDS for age is not changing over time. Partially, the proportional hazards assumption seems to be partially met.

plot(cox.zph(cox.model5)[2], main="Plot of Hazards")
abline(h=0,col="red")

From the graph above, the reference line [abline] lies with the confidence bound at least throught the time. This is a clear indication that is enough evidence to enough to show that coefficient of HAZARDS for weight loss is not changing over time. From the graph, we can conlude that the coefficient for weight loss is not really changing. In other words proportional hazards assumption is fully met.