Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.
In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although, it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.
Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following
The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example: 1. A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.
knitr::opts_chunk$set(comment = NA)
library(haven)
library(tidyverse)
library(agricolae)
library(stargazer)
library(highcharter)
library(ggplot2)
library(ggthemes)
library(plotrix)
library(tseries)
library(tseries)
library(tidyverse)
library(vars)
library(olsrr)
library(car)
library(grid)
library(rmarkdown)
library(ggtext)
library(RColorBrewer)
library(rhandsontable)
library(forecast)
library(quantmod)
library(xts)
library(PerformanceAnalytics)
library(rugarch)
library(dplyr)
library(magrittr)
library(gapminder)
library(plyr)
library(plotrix)
library(survival)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(survminer)
library(magrittr)
library(ggpubr)
library(stargazer)
Use the command below to ensure that the output values are not written in scientific notation.
options(scipen=999)
theme_set(theme_economist())
theme_set(theme_base())
Superstore_data<-read.csv("C:\\Users\\user\\Downloads\\Superstore_data.csv")
attach(Superstore_data)
head(Superstore_data,10)
Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
1 1 CA-2018-152156 11/8/2018 11/11/2018 Second Class CG-12520
2 2 CA-2018-152156 11/8/2018 11/11/2018 Second Class CG-12520
3 3 CA-2018-138688 6/12/2018 6/16/2018 Second Class DV-13045
4 4 US-2017-108966 10/11/2017 10/18/2017 Standard Class SO-20335
5 5 US-2017-108966 10/11/2017 10/18/2017 Standard Class SO-20335
6 6 CA-2016-115812 6/9/2016 6/14/2016 Standard Class BH-11710
7 7 CA-2016-115812 6/9/2016 6/14/2016 Standard Class BH-11710
8 8 CA-2016-115812 6/9/2016 6/14/2016 Standard Class BH-11710
9 9 CA-2016-115812 6/9/2016 6/14/2016 Standard Class BH-11710
10 10 CA-2016-115812 6/9/2016 6/14/2016 Standard Class BH-11710
Customer.Name Segment Country.Region City State
1 Claire Gute Consumer United States Henderson Kentucky
2 Claire Gute Consumer United States Henderson Kentucky
3 Darrin Van Huff Corporate United States Los Angeles California
4 Sean O'Donnell Consumer United States Fort Lauderdale Florida
5 Sean O'Donnell Consumer United States Fort Lauderdale Florida
6 Brosina Hoffman Consumer United States Los Angeles California
7 Brosina Hoffman Consumer United States Los Angeles California
8 Brosina Hoffman Consumer United States Los Angeles California
9 Brosina Hoffman Consumer United States Los Angeles California
10 Brosina Hoffman Consumer United States Los Angeles California
Postal.Code Region Product.ID Category Sub.Category
1 42420 South FUR-BO-10001798 Furniture Bookcases
2 42420 South FUR-CH-10000454 Furniture Chairs
3 90036 West OFF-LA-10000240 Office Supplies Labels
4 33311 South FUR-TA-10000577 Furniture Tables
5 33311 South OFF-ST-10000760 Office Supplies Storage
6 90032 West FUR-FU-10001487 Furniture Furnishings
7 90032 West OFF-AR-10002833 Office Supplies Art
8 90032 West TEC-PH-10002275 Technology Phones
9 90032 West OFF-BI-10003910 Office Supplies Binders
10 90032 West OFF-AP-10002892 Office Supplies Appliances
Product.Name Sales
1 Bush Somerset Collection Bookcase 261.9600
2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
4 Bretford CR4500 Series Slim Rectangular Table 957.5775
5 Eldon Fold 'N Roll Cart System 22.3680
6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
7 Newell 322 7.2800
8 Mitel 5320 IP Phone VoIP phone 907.1520
9 DXL Angle-View Binders with Locking Rings by Samsill 18.5040
10 Belkin F5C206VTEL 6 Outlet Surge 114.9000
Quantity Discount Profit
1 2 0.00 41.9136
2 3 0.00 219.5820
3 2 0.00 6.8714
4 5 0.45 -383.0310
5 2 0.20 2.5164
6 7 0.00 14.1694
7 4 0.00 1.9656
8 6 0.20 90.7152
9 3 0.20 5.7825
10 5 0.00 34.4700
tail(Superstore_data,10)
Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
9985 9985 CA-2017-100251 5/17/2017 5/23/2017 Standard Class DV-13465
9986 9986 CA-2017-100251 5/17/2017 5/23/2017 Standard Class DV-13465
9987 9987 CA-2018-125794 9/29/2018 10/3/2018 Standard Class ML-17410
9988 9988 CA-2019-163629 11/17/2019 11/21/2019 Standard Class RA-19885
9989 9989 CA-2019-163629 11/17/2019 11/21/2019 Standard Class RA-19885
9990 9990 CA-2016-110422 1/21/2016 1/23/2016 Second Class TB-21400
9991 9991 CA-2019-121258 2/26/2019 3/3/2019 Standard Class DB-13060
9992 9992 CA-2019-121258 2/26/2019 3/3/2019 Standard Class DB-13060
9993 9993 CA-2019-121258 2/26/2019 3/3/2019 Standard Class DB-13060
9994 9994 CA-2019-119914 5/4/2019 5/9/2019 Second Class CC-12220
Customer.Name Segment Country.Region City State
9985 Dianna Vittorini Consumer United States Long Beach New York
9986 Dianna Vittorini Consumer United States Long Beach New York
9987 Maris LaWare Consumer United States Los Angeles California
9988 Ruben Ausman Corporate United States Athens Georgia
9989 Ruben Ausman Corporate United States Athens Georgia
9990 Tom Boeckenhauer Consumer United States Miami Florida
9991 Dave Brooks Consumer United States Costa Mesa California
9992 Dave Brooks Consumer United States Costa Mesa California
9993 Dave Brooks Consumer United States Costa Mesa California
9994 Chris Cortes Consumer United States Westminster California
Postal.Code Region Product.ID Category Sub.Category
9985 11561 East OFF-LA-10003766 Office Supplies Labels
9986 11561 East OFF-SU-10000898 Office Supplies Supplies
9987 90008 West TEC-AC-10003399 Technology Accessories
9988 30605 South TEC-AC-10001539 Technology Accessories
9989 30605 South TEC-PH-10004006 Technology Phones
9990 33180 South FUR-FU-10001889 Furniture Furnishings
9991 92627 West FUR-FU-10000747 Furniture Furnishings
9992 92627 West TEC-PH-10003645 Technology Phones
9993 92627 West OFF-PA-10004041 Office Supplies Paper
9994 92683 West OFF-AP-10002684 Office Supplies Appliances
Product.Name
9985 Self-Adhesive Removable Labels
9986 Acme Hot Forged Carbon Steel Scissors with Nickel-Plated Handles, 3 7/8" Cut, 8"L
9987 Memorex Mini Travel Drive 64 GB USB 2.0 Flash Drive
9988 Logitech G430 Surround Sound Gaming Headset with Dolby 7.1 Technology
9989 Panasonic KX - TS880B Telephone
9990 Ultra Door Pull Handle
9991 Tenex B1-RE Series Chair Mats for Low Pile Carpets
9992 Aastra 57i VoIP phone
9993 It's Hot Message Books with Stickers, 2 3/4" x 5"
9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
Sales Quantity Discount Profit
9985 31.500 10 0.0 15.1200
9986 55.600 4 0.0 16.1240
9987 36.240 1 0.0 15.2208
9988 79.990 1 0.0 28.7964
9989 206.100 5 0.0 55.6470
9990 25.248 3 0.2 4.1028
9991 91.960 2 0.0 15.6332
9992 258.576 2 0.2 19.3932
9993 29.600 4 0.0 13.3200
9994 243.160 2 0.0 72.9480
Sales<-ts(Superstore_data$Sales,start=1265,frequency = 1)
Profit<-ts(Superstore_data$Profit,start = 1265,frequency = 1)
ts.plot(Sales, main="A time series plot of Sales from Superstore company",xlab="Time in years",ylab="Sales")
A time series plot of Sales from Superstore company
ts.plot(Profit, main="A time series plot of Profit from Superstore company",xlab="Time in years",ylab="Profit")
A time series plot of Profit from Superstore company
The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.
You can use the test for continuous data. Your data should be a random sample from a normal population.
If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.
The company assumes that from 2016 to 2019, they made an average Sales of approximately 200. Test this hypothesis to determine whether the company management were right on their assumption.
t.test(Superstore_data$Sales,mu=200,alternative="two.sided",conf.level=0.95)
One Sample t-test
data: Superstore_data$Sales
t = 4.7893, df = 9993, p-value = 0.000001698
alternative hypothesis: true mean is not equal to 200
95 percent confidence interval:
217.6375 242.0785
sample estimates:
mean of x
229.858
In the above output, the p-value of the test is approximately close to 0.0001, which is significantly less than the significance level alpha = 0.05. We will therefore the null hypothesis and conclude that the true of sales from 2016 to 2019 is not equal to 200 at a 5% level of significance given the p-value of approximately 0.0001
The Independent Samples t Test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test.
data22<-Superstore_data %>%
dplyr::select(Region, Sales)%>%
filter(Region == "West"|
Region == "East")
head(data22,5)
Region Sales
1 West 14.620
2 West 48.860
3 West 7.280
4 West 907.152
5 West 18.504
tail(data22,5)
Region Sales
6047 West 36.240
6048 West 91.960
6049 West 258.576
6050 West 29.600
6051 West 243.160
t.test(Sales ~ Region,alternative="two.sided",data = data22)
Welch Two Sample t-test
data: Sales by Region
t = 0.79611, df = 5603.9, p-value = 0.426
alternative hypothesis: true difference in means between group East and group West is not equal to 0
95 percent confidence interval:
-17.31977 41.00552
sample estimates:
mean in group East mean in group West
238.3361 226.4932
In the above output, the p-value of the test is 0.426, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from West region and East region is equal to zero with a p-value of 0.426 at a 5% level of significance.
data23<-Superstore_data %>%
dplyr::select(Category, Sales)%>%
filter(Category == "Technology"|
Category == "Furniture")
head(data23,5)
Category Sales
1 Furniture 261.9600
2 Furniture 731.9400
3 Furniture 957.5775
4 Furniture 48.8600
5 Technology 907.1520
tail(data23,5)
Category Sales
3964 Technology 79.990
3965 Technology 206.100
3966 Furniture 25.248
3967 Furniture 91.960
3968 Technology 258.576
t.test(Sales ~ Category,alternative="two.sided",data = data23)
Welch Two Sample t-test
data: Sales by Category
t = -3.6721, df = 2497.7, p-value = 0.0002456
alternative hypothesis: true difference in means between group Furniture and group Technology is not equal to 0
95 percent confidence interval:
-157.8094 -47.9394
sample estimates:
mean in group Furniture mean in group Technology
349.8349 452.7093
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and furniture category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.
data24<-Superstore_data %>%
dplyr::select(Category, Sales)%>%
filter(Category == "Technology"|
Category == "Office Supplies")
head(data24,5)
Category Sales
1 Office Supplies 14.620
2 Office Supplies 22.368
3 Office Supplies 7.280
4 Technology 907.152
5 Office Supplies 18.504
t.test(Sales ~ Category,alternative="two.sided",data = data24)
Welch Two Sample t-test
data: Sales by Category
t = -12.694, df = 1982.1, p-value < 0.00000000000000022
alternative hypothesis: true difference in means between group Office Supplies and group Technology is not equal to 0
95 percent confidence interval:
-384.8897 -281.8807
sample estimates:
mean in group Office Supplies mean in group Technology
119.3241 452.7093
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and office supplies category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.
One-Way ANOVA (“analysis of variance”) compares the means of two or more independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. One-Way ANOVA is a parametric test.
anova_model<-aov(Sales~Category)
summary(anova_model)
Df Sum Sq Mean Sq F value Pr(>F)
Category 2 195881745 97940872 265.5 <0.0000000000000002 ***
Residuals 9991 3685743767 368906
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 265.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the three categories at a 5% level of significance.
Tukey HSD is the commonly used post hoc test to determine which two categories are differing in sales from 2016 to 2019.
tukey_hsd<-TukeyHSD(anova_model)
tukey_hsd
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Sales ~ Category)
$Category
diff lwr upr p adj
Office Supplies-Furniture -230.5108 -266.45583 -194.5657 0.0000000
Technology-Furniture 102.8744 57.56303 148.1858 0.0000003
Technology-Office Supplies 333.3852 295.51937 371.2510 0.0000000
From the results above a significant difference in the average Sales exists across all the categories. This is indicated by the p-value less than 0.05 for all the pairs. This results can be visualized as shown below.
GGTukey.1<-function(tukey_hsd){
A<-require("tidyverse")
if(A==TRUE){
library(tidyverse)
} else {
install.packages("tidyverse")
library(tidyverse)
}
B<-as.data.frame(tukey_hsd[1])
colnames(B)[2:4]<-c("min",
"max",
"p")
C<-data.frame(id=row.names(B),
min=B$min,
max=B$max,
idt=ifelse(B$p<0.05,
"significant",
"not significant")
)
D<-C%>%
ggplot(aes(id,color=idt))+
geom_errorbar(aes(ymin=min,
ymax=max),
width = 0.5,
size=1.25)+
labs(x=NULL,
color=NULL)+
scale_color_manual(values=c("green",
"red")
)+
coord_flip()+
theme(text=element_text(family="TimesNewRoman"),
title=element_text(color="black",size=15),
axis.text = element_text(color="black",size=10),
axis.title = element_text(color="black",size=10),
panel.grid=element_line(color="grey75"),
axis.line=element_blank(),
plot.background=element_rect(fill="white",color="white"),
panel.background=element_rect(fill="white"),
panel.border = element_rect(colour = "black", fill = NA,size=0.59),
legend.key= element_rect(color="white",fill="white")
)
return(D)
}
GGTukey.1(tukey_hsd)
Tukeys honesty significant difference
plot(tukey_hsd,las =1)
anova_model2<-aov(Sales~Sub.Category)
summary(anova_model2)
Df Sum Sq Mean Sq F value Pr(>F)
Sub.Category 16 776253968 48515873 155.9 <0.0000000000000002 ***
Residuals 9977 3105371543 311253
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 155.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the seventeen sub categories at a 5% level of significance.
tukey_hsd2<-TukeyHSD(anova_model2)
tukey_hsd2
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Sales ~ Sub.Category)
$Sub.Category
diff lwr upr p adj
Appliances-Accessories 14.7811064 -98.352371 127.914584 1.0000000
Art-Accessories -181.9057697 -279.299223 -84.512317 0.0000000
Binders-Accessories -82.4140438 -167.571615 2.743527 0.0707858
Bookcases-Accessories 287.8850290 142.479391 433.290667 0.0000000
Chairs-Accessories 316.3578159 212.227967 420.487665 0.0000000
Copiers-Accessories 1982.9670138 1738.872771 2227.061256 0.0000000
Envelopes-Accessories -151.1068795 -290.643768 -11.569991 0.0187882
Fasteners-Accessories -202.0378297 -350.263756 -53.811903 0.0003125
Furnishings-Accessories -120.1489362 -213.413396 -26.884476 0.0010412
Labels-Accessories -181.6715489 -304.305120 -59.037978 0.0000400
Machines-Accessories 1429.5787092 1236.717765 1622.439653 0.0000000
Paper-Accessories -158.6905119 -245.436934 -71.944090 0.0000000
Phones-Accessories 155.2369304 60.389846 250.084015 0.0000021
Storage-Accessories 48.6159493 -47.347208 144.579107 0.9455486
Supplies-Accessories 29.6755961 -126.561825 185.913017 0.9999997
Tables-Accessories 432.8201673 304.435984 561.204351 0.0000000
Art-Appliances -196.6868761 -309.258580 -84.115172 0.0000002
Binders-Appliances -97.1951502 -199.365184 4.974884 0.0839321
Bookcases-Appliances 273.1039226 117.124105 429.083741 0.0000002
Chairs-Appliances 301.5767095 183.128706 420.024713 0.0000000
Copiers-Appliances 1968.1859073 1717.648678 2218.723136 0.0000000
Envelopes-Appliances -165.8879859 -316.411895 -15.364077 0.0147424
Fasteners-Appliances -216.8189361 -375.431134 -58.206739 0.0002927
Furnishings-Appliances -134.9300426 -243.949139 -25.910946 0.0022497
Labels-Appliances -196.4526554 -331.455976 -61.449334 0.0000641
Machines-Appliances 1414.7976027 1213.844256 1615.750949 0.0000000
Paper-Appliances -173.4716183 -276.969665 -69.973572 0.0000009
Phones-Appliances 140.4558240 30.079770 250.831878 0.0013283
Storage-Appliances 33.8348429 -77.501726 145.171412 0.9998021
Supplies-Appliances 14.8944897 -151.229065 181.018045 1.0000000
Tables-Appliances 418.0390609 277.791414 558.286708 0.0000000
Binders-Art 99.4917259 15.081912 183.901540 0.0052743
Bookcases-Art 469.7907987 324.821821 614.759777 0.0000000
Chairs-Art 498.2635856 394.744359 601.782813 0.0000000
Copiers-Art 2164.8727835 1921.038405 2408.707162 0.0000000
Envelopes-Art 30.7988902 -108.282913 169.880694 0.9999974
Fasteners-Art -20.1320600 -167.929659 127.665539 1.0000000
Furnishings-Art 61.7568335 -30.825370 154.339037 0.6491218
Labels-Art 0.2342208 -121.881288 122.349730 1.0000000
Machines-Art 1611.4844789 1418.952537 1804.016420 0.0000000
Paper-Art 23.2152578 -62.797221 109.227737 0.9999589
Phones-Art 337.1427001 242.966407 431.318994 0.0000000
Storage-Art 230.5217190 135.221496 325.821942 0.0000000
Supplies-Art 211.5813658 55.750251 367.412481 0.0003411
Tables-Art 614.7259370 486.836518 742.615356 0.0000000
Bookcases-Binders 370.2990728 233.250425 507.347720 0.0000000
Chairs-Binders 398.7718597 306.671058 490.872661 0.0000000
Copiers-Binders 2065.3810576 1826.170834 2304.591281 0.0000000
Envelopes-Binders -68.6928357 -199.498322 62.112651 0.9265977
Fasteners-Binders -119.6237859 -259.661129 20.413557 0.2055723
Furnishings-Binders -37.7348924 -117.345140 41.875355 0.9698971
Labels-Binders -99.2575051 -211.856461 13.341450 0.1629283
Machines-Binders 1511.9927530 1325.351719 1698.633787 0.0000000
Paper-Binders -76.2764681 -148.140741 -4.412195 0.0244962
Phones-Binders 237.6509742 156.192387 319.109562 0.0000000
Storage-Binders 131.0299931 48.274572 213.785414 0.0000059
Supplies-Binders 112.0896399 -36.401652 260.580932 0.4181442
Tables-Binders 515.2342111 396.398061 634.070362 0.0000000
Chairs-Bookcases 28.4727869 -121.105104 178.050678 0.9999997
Copiers-Bookcases 1695.0819848 1428.412678 1961.751291 0.0000000
Envelopes-Bookcases -438.9919085 -615.063096 -262.920721 0.0000000
Fasteners-Bookcases -489.9228587 -672.956859 -306.888859 0.0000000
Furnishings-Bookcases -408.0339652 -550.261879 -265.806052 0.0000000
Labels-Bookcases -469.5565779 -632.558150 -306.555006 0.0000000
Machines-Bookcases 1141.6936801 920.954324 1362.433037 0.0000000
Paper-Bookcases -446.5755409 -584.617062 -308.534020 0.0000000
Phones-Bookcases -132.6480986 -275.918784 10.622587 0.1080946
Storage-Bookcases -239.2690797 -383.281050 -95.257110 0.0000013
Supplies-Bookcases -258.2094329 -447.789631 -68.629235 0.0003179
Tables-Bookcases 144.9351383 -22.435763 312.306039 0.1863844
Copiers-Chairs 1666.6091979 1420.006796 1913.211600 0.0000000
Envelopes-Chairs -467.4646954 -611.344120 -323.585271 0.0000000
Fasteners-Chairs -518.3956456 -670.716593 -366.074698 0.0000000
Furnishings-Chairs -436.5067521 -536.151146 -336.862358 0.0000000
Labels-Chairs -498.0293648 -625.582250 -370.476480 0.0000000
Machines-Chairs 1113.2208933 917.195157 1309.246630 0.0000000
Paper-Chairs -475.0483278 -568.620158 -381.476498 0.0000000
Phones-Chairs -161.1208855 -262.248108 -59.993663 0.0000049
Storage-Chairs -267.7418666 -369.916586 -165.567147 0.0000000
Supplies-Chairs -286.6822198 -446.809910 -126.554529 0.0000001
Tables-Chairs 116.4623514 -16.628761 249.553464 0.1723836
Envelopes-Copiers -2134.0738932 -2397.589097 -1870.558690 0.0000000
Fasteners-Copiers -2185.0048435 -2453.222376 -1916.787311 0.0000000
Furnishings-Copiers -2103.1159499 -2345.330687 -1860.901213 0.0000000
Labels-Copiers -2164.6385627 -2419.606624 -1909.670502 0.0000000
Machines-Copiers -553.3883046 -848.625604 -258.151005 0.0000000
Paper-Copiers -2141.6575257 -2381.437969 -1901.877082 0.0000000
Phones-Copiers -1827.7300833 -2070.558600 -1584.901566 0.0000000
Storage-Copiers -1934.3510645 -2177.617681 -1691.084447 0.0000000
Supplies-Copiers -1953.2914176 -2226.018114 -1680.564721 0.0000000
Tables-Copiers -1550.1468465 -1807.930124 -1292.363569 0.0000000
Fasteners-Envelopes -50.9309502 -229.338317 127.476417 0.9999128
Furnishings-Envelopes 30.9579433 -105.264385 167.180272 0.9999962
Labels-Envelopes -30.5646695 -188.353313 127.223974 0.9999996
Machines-Envelopes 1580.6855886 1363.767155 1797.604023 0.0000000
Paper-Envelopes -7.5836324 -139.429015 124.261751 1.0000000
Phones-Envelopes 306.3438099 169.033094 443.654526 0.0000000
Storage-Envelopes 199.7228288 61.638829 337.806829 0.0000749
Supplies-Envelopes 180.7824756 -4.334771 365.899722 0.0644947
Tables-Envelopes 583.9270468 421.628674 746.225420 0.0000000
Furnishings-Fasteners 81.8888935 -63.221082 226.998869 0.8728069
Labels-Fasteners 20.3662808 -145.156039 185.888600 1.0000000
Machines-Fasteners 1631.6165388 1409.009286 1854.223792 0.0000000
Paper-Fasteners 43.3473178 -97.661856 184.356492 0.9997704
Phones-Fasteners 357.2747601 211.142577 503.406943 0.0000000
Storage-Fasteners 250.6537790 103.794754 397.512804 0.0000005
Supplies-Fasteners 231.7134258 39.961562 423.465289 0.0034553
Tables-Fasteners 634.8579970 465.031191 804.684803 0.0000000
Labels-Furnishings -61.5226128 -180.371140 57.325914 0.9347411
Machines-Furnishings 1549.7276453 1359.251076 1740.204214 0.0000000
Paper-Furnishings -38.5415757 -119.849148 42.765997 0.9698815
Phones-Furnishings 275.3858666 185.486205 365.285528 0.0000000
Storage-Furnishings 168.7648855 77.688504 259.841267 0.0000000
Supplies-Furnishings 149.8245323 -3.459881 303.108945 0.0639163
Tables-Furnishings 552.9691034 428.195395 677.742812 0.0000000
Machines-Labels 1611.2502581 1404.799158 1817.701358 0.0000000
Paper-Labels 22.9810370 -90.824299 136.786373 0.9999993
Phones-Labels 336.9084794 216.814007 457.002952 0.0000000
Storage-Labels 230.2874982 109.309647 351.265350 0.0000000
Supplies-Labels 211.3471451 38.613696 384.080594 0.0027717
Tables-Labels 614.4917162 466.474111 762.509322 0.0000000
Paper-Machines -1588.2692211 -1775.640525 -1400.897917 0.0000000
Phones-Machines -1274.3417787 -1465.598238 -1083.085319 0.0000000
Storage-Machines -1380.9627599 -1572.775146 -1189.150374 0.0000000
Supplies-Machines -1399.9031130 -1627.923278 -1171.882949 0.0000000
Tables-Machines -996.7585419 -1206.676532 -786.840552 0.0000000
Phones-Paper 313.9274423 230.809266 397.045619 0.0000000
Storage-Paper 207.3064612 122.916951 291.695972 0.0000000
Supplies-Paper 188.3661080 38.957964 337.774252 0.0015901
Tables-Paper 591.5106792 471.530846 711.490513 0.0000000
Storage-Phones -106.6209811 -199.317352 -13.924610 0.0078092
Supplies-Phones -125.5613343 -279.813794 28.691125 0.2813586
Tables-Phones 277.5832369 151.622179 403.544295 0.0000000
Supplies-Storage -18.9403532 -173.881566 136.000860 1.0000000
Tables-Storage 384.2042180 257.400644 511.007792 0.0000000
Tables-Supplies 403.1445712 226.282053 580.007090 0.0000000
GGTukey.2<-function(tukey_hsd2){
A<-require("tidyverse")
if(A==TRUE){
library(tidyverse)
} else {
install.packages("tidyverse")
library(tidyverse)
}
B<-as.data.frame(tukey_hsd2[1])
colnames(B)[2:4]<-c("min",
"max",
"p")
C<-data.frame(id=row.names(B),
min=B$min,
max=B$max,
idt=ifelse(B$p<0.05,
"significant",
"not significant")
)
D<-C%>%
ggplot(aes(id,color=idt))+
geom_errorbar(aes(ymin=min,
ymax=max),
width = 0.5,
size=1.25)+
labs(x=NULL,
color=NULL)+
scale_color_manual(values=c("red",
"green")
)+
coord_flip()+
theme(text=element_text(family="TimesNewRoman"),
title=element_text(color="black",size=15),
axis.text = element_text(color="black",size=10),
axis.title = element_text(color="black",size=10),
panel.grid=element_line(color="grey75"),
axis.line=element_blank(),
plot.background=element_rect(fill="white",color="white"),
panel.background=element_rect(fill="white"),
panel.border = element_rect(colour = "black", fill = NA,size=0.59),
legend.key= element_rect(color="white",fill="white")
)
return(D)
}
Make an interpretation of the results above.
GGTukey.2(tukey_hsd2)
ANOVA stands for analysis of variance and tests for differences in the effects of independent variables on a dependent variable. A two-way ANOVA test is a statistical test used to determine the effect of two nominal predictor variables on a continuous outcome variable.
Superstore_data$Category<-as.factor(Superstore_data$Category)
Superstore_data$Sub.Category<- as.factor((Superstore_data$Sub.Category))
linear_model<- lm(Sales~Category+Sub.Category, data = Superstore_data)
Two_factor_anova<-anova(linear_model)
Two_factor_anova
Analysis of Variance Table
Response: Sales
Df Sum Sq Mean Sq F value Pr(>F)
Category 2 195881745 97940872 314.67 < 0.00000000000000022 ***
Sub.Category 14 580372224 41455159 133.19 < 0.00000000000000022 ***
Residuals 9977 3105371543 311253
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics for category is given as 314.67 with its associated p-value of approximately 0.0001, which is significantly less than 0.01. This is an indication that there is a significant difference in the mean Sales across various categories at a 1% level of significance. On the other, F statistics for sub category is 133.19 with its associated p-value of approximately 0.0001, which is less than 0.01. This is a clear indication that there exists a significant difference in the average Sales across sub categories at a 1% level of significant.
stargazer(Two_factor_anova, type="text")
=================================================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------------------------
Df 3 3,331.000 5,755.608 2 9,977
Sum Sq 3 1,293,875,171.000 1,580,537,112.000 195,881,745.000 3,105,371,543.000
Mean Sq 3 46,569,095.000 49,015,303.000 311,253.000 97,940,872.000
F value 2 223.927 128.325 133.188 314.666
Pr(> F) 2 0.000 0.000 0 0
---------------------------------------------------------------------------------
The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. It is often used to evaluate whether sample data is representative of the full population.
Let us enter some data using the command below: The data entered is gotten from linkedin regarding the number of jobs opened from various professions. There are fours different categories of professions.
jobs<-c(11091,11282,15378,12696)
Input the following categories that have the following names
names(jobs)<-c("Project Management","Supply chain","Services","Quality control")
Run the following command to view the available jobs from various categories of professions
jobs
Project Management Supply chain Services Quality control
11091 11282 15378 12696
It is important to calculate the proportion of job vacancies in various categories.
jobs/sum(jobs)
Project Management Supply chain Services Quality control
0.2198545 0.2236407 0.3048348 0.2516701
These are different proportions that we get. Remember these proportions may change on a daily basis. In other words, new jobs will keep opening as others closes, therefore, this proportion is prone to variation. We would therefore like to find out if these proportion in job vacancies are real or out of chances. In other words, we would to established if these proportions are statistically significant or not, that is by chances. From the results above, let us assume that the proportion are all the same in the fours categories. Assuming that the proportion is approximately 25% in all the four categories. We shall create the object probability as shown below.
probability<-c(0.25,0.25,0.25,0.25)
For the chi square test of goodness, we shall have our null hypothesis as follows; * Null hypothesis: Proportions in job categories is not statistically different from 0.25 * Alternative hypothesis: Proportions in job categories is statistically different from 0.25
The alternative hypothesis indicates that fluctuations in job categories will be below or above 0.25. Run the following code chunk to run the chi square test
chisq.test(jobs,p=probability)
Chi-squared test for given probabilities
data: jobs
X-squared = 930.89, df = 3, p-value < 0.00000000000000022
From the result, the square test value if 930.89 and the associated probability of approximately 0.001. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that the proportion of job is statistically different from 0.25, that is, it is below or above 0.25 at a 5% level of significance. The degree of freedom is 3 because we gave four categories. (n-1)
Students in a social studies class hypothesize that the literacy rates across the world for every region are 82%. The table below shows the actual literacy rates across the world broken down by region. What are the test statistic and the degrees of freedom?
illit_lavel<-c(99,99.5,67.3,62.5,91,93.8,61.9,91.9,84.5,66.4)
Input the following categories that have the following names
names(illit_lavel)<-c("Developed Regions","Commonwealth of Independent States","Northern Africa", "Sub-saharan Africa","Latin American and the caribbean","Eastern Asia","Southern Asia","South-Eastern Asia","Western Asia","Oceania")
Run the following command to view the illiteracy level from various regions
illit_lavel
Developed Regions Commonwealth of Independent States
99.0 99.5
Northern Africa Sub-saharan Africa
67.3 62.5
Latin American and the caribbean Eastern Asia
91.0 93.8
Southern Asia South-Eastern Asia
61.9 91.9
Western Asia Oceania
84.5 66.4
It is important to calculate the proportion of illiteracy level in various regions.
illit_lavel/sum(illit_lavel)
Developed Regions Commonwealth of Independent States
0.12105649 0.12166789
Northern Africa Sub-saharan Africa
0.08229396 0.07642455
Latin American and the caribbean Eastern Asia
0.11127415 0.11469797
Southern Asia South-Eastern Asia
0.07569088 0.11237466
Western Asia Oceania
0.10332600 0.08119345
chisq.test(illit_lavel)
Chi-squared test for given probabilities
data: illit_lavel
X-squared = 26.449, df = 9, p-value = 0.001725
1-pchisq(26.449,9)
[1] 0.001724323
From the result, the square test value if 26.449 and the associated probability of approximately 0.002. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that the illiteracy level across the world is statistically different from 82%, that is, it is below or above 82% at a 5% level of significance. The degree of freedom is 9 because we gave ten regions, (n-1)
Employers want to know which days of the week employees are absent in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of 60 managers were asked on which day of the week they had the highest number of employee absences. The results were distributed as in the table below. For the population of employees, do the days for the highest number of absences occur with equal frequencies during a five-day work week? Test at a 5% significance level.
absenteeism<-c(15,12,9,9,15)
Input the following categories that have the following names
names(absenteeism)<-c("Monday","Tuesday","Wednesday","Thursday","Friday")
Run the following command to view the illiteracy level from various regions
absenteeism
Monday Tuesday Wednesday Thursday Friday
15 12 9 9 15
It is important to calculate the proportion of illiteracy level in various regions.
absenteeism/sum(absenteeism)
Monday Tuesday Wednesday Thursday Friday
0.25 0.20 0.15 0.15 0.25
prob<-c(0.20,0.20,0.20,0.20,0.20)
chisq.test(absenteeism,p=prob)
Chi-squared test for given probabilities
data: absenteeism
X-squared = 3, df = 4, p-value = 0.5578
1- pchisq(3,4)
[1] 0.5578254
From the result, the square test value if 3 and the associated probability of approximately 0.5578. This probability is far above 0.05. This is an indication that there is sufficient evidence to support the null hypothesis and therefore conclude that at a 5% level of significance, from the sample data, there is not sufficient evidence to conclude that the absent days do not occur with equal frequencies. In other words, absent days occurs with equal variation. ## For further practice consider the example below.
knitr::include_graphics("chi.png")
Distribution of voters
The Chi Square statistic is commonly used for testing relationships
between categorical variables.
The null hypothesis of the Chi-Square test is that no relationship
exists on the categorical variables in the population; they are
independent. An example research question that could be answered using a
Chi-Square analysis would be: * Is there a significant relationship
between voter intent and political party membership?
The Chi-Square statistic is most commonly used to evaluate Tests of Independence when using a cross tabulation (also known as a bivariate table). Cross tabulation presents the distributions of two categorical variables simultaneously, with the intersections of the categories of the variables appearing in the cells of the table. The Test of Independence assesses whether an association exists between the two variables by comparing the observed pattern of responses in the cells to the pattern that would be expected if the variables were truly independent of each other. Calculating the Chi-Square statistic and comparing it against a critical value from the Chi-Square distribution allows the researcher to assess whether the observed cell counts are significantly different from the expected cell counts.
In other words the Chi-square test of independence checks whether two variables are likely to be related or not. We have counts for two categorical or nominal variables. We also have an idea that the two variables are not related. The test gives us a way to decide if our idea is plausible or not. For instance, assume that we have two variables, gender and voting. Because we have two variables, we shall use rbind function to store our variable in the object voters
voters<-rbind(c(1486,2131),c(2792,3591))
voters
[,1] [,2]
[1,] 1486 2131
[2,] 2792 3591
dimnames(voters)<-list(gender=c("male","female"),
voting=c("voted","didn't vote")
)
Run the code chunk to look at the data we have just entered
voters
voting
gender voted didn't vote
male 1486 2131
female 2792 3591
Run the code chunk below to run the chi square test of independence
chisq.test(voters)
Pearson's Chi-squared test with Yates' continuity correction
data: voters
X-squared = 6.5523, df = 1, p-value = 0.01047
From the result, the square test value if 6.5523 and the associated probability of approximately 0.01. This probability is below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of voting on gender at a 5% level of significance. We can reject the null hypothesis with 95% confidence.
Consider the information below
knitr::include_graphics("chi2.png")
Automobile owners
Let us enter the information in the Rstudio
Buyer<-rbind(c(70,130,150),c(30,20,100))
Give names to the dimension columns and rows using the following code chunk
dimnames(Buyer)<-list(Gender=c("Male","Female"),
Reason=c("Styling","Engineering","Fuel economy"))
Buyer
Reason
Gender Styling Engineering Fuel economy
Male 70 130 150
Female 30 20 100
chisq.test(Buyer)
Pearson's Chi-squared test
data: Buyer
X-squared = 31.746, df = 2, p-value = 0.0000001278
From the result, the square test value if 6.5523 and the associated probability of approximately 0.001. This probability is far below 0.1. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of reason for purchase on gender at a 10% level of significance. We can reject the null hypothesis with 90% confidence. The output above has the degree of freedom of 2 calculated as follows;
\[ (Number of columns-1)(Number of rows-1) = (3-1)(2-1)= (2*1) = 2 \]
A random sample of 400 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from this survey is summarized in the following contingency table.
Education<-rbind(c(60,56,46,42),c(40,46,53,57))
Give names to the dimension columns and rows using the following code chunk
dimnames(Education)<-list(Gender=c("Male","Female"),
Educ_level=c("High school","Bachelors","Masterd","PhD"))
Education
Educ_level
Gender High school Bachelors Masterd PhD
Male 60 56 46 42
Female 40 46 53 57
chisq.test(Education)
Pearson's Chi-squared test
data: Education
X-squared = 7.5911, df = 3, p-value = 0.05526
From the result, the square test value if 7.5911 and the associated probability of approximately 0.05526. This probability is slightly above 0.05. This is an indication that there is sufficient evidence to support the null hypothesis and therefore conclude that there is no statistically significance dependence of education level on gender at a 5% level of significance. We therefore retain the null hypothesis with 95% confidence.
Ch_sqt <- read_sav("C:/Users/user/Downloads/Effects of employees voice mechanism on academic staff turnover intentions1.sav")
attach(Ch_sqt)
head(Ch_sqt,10)
# A tibble: 10 × 141
ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 ESa
<dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l>
1 1 2 [fem… 5 [51-… 2 [Ph.… 2 [con… 2 [Ass… 2 [Ken… 3 [11-… 4 [6-1… 1 [str…
2 2 1 [mal… 3 [31-… 2 [Ph.… 1 [Per… 3 [Sen… 1 [Uni… 4 [21-… 3 [4-5… 3 [Nei…
3 3 1 [mal… 5 [51-… 3 [Mas… 1 [Per… 3 [Sen… 2 [Ken… 2 [2-1… 2 [2-3… 4 [Agr…
4 4 1 [mal… 4 [41-… 2 [Ph.… 1 [Per… 3 [Sen… 2 [Ken… 2 [2-1… 5 [11-… 3 [Nei…
5 5 1 [mal… 4 [41-… 2 [Ph.… 2 [con… 4 [Lec… 1 [Uni… 6 [41-… 4 [6-1… 3 [Nei…
6 6 1 [mal… 5 [51-… 1 [Pos… 2 [con… 4 [Lec… 1 [Uni… 4 [21-… 2 [2-3… 3 [Nei…
7 7 1 [mal… 4 [41-… 2 [Ph.… 1 [Per… 4 [Lec… 1 [Uni… 2 [2-1… 2 [2-3… 1 [str…
8 8 1 [mal… 2 [21-… 3 [Mas… 2 [con… 4 [Lec… 4 [Not… 2 [2-1… 6 [21 … 2 [Dis…
9 9 2 [fem… 4 [41-… 2 [Ph.… 1 [Per… 4 [Lec… 1 [Uni… 2 [2-1… 2 [2-3… 5 [Str…
10 10 1 [mal… 3 [31-… 3 [Mas… 1 [Per… 4 [Lec… 2 [Ken… 2 [2-1… 5 [11-… 2 [Dis…
# … with 131 more variables: ESb <dbl+lbl>, ESc <dbl+lbl>, ESd <dbl+lbl>,
# ESe <dbl+lbl>, ESf <dbl+lbl>, ESg <dbl+lbl>, ESh <dbl+lbl>, ESi <dbl+lbl>,
# ERa <dbl+lbl>, ERb <dbl+lbl>, ERc <dbl+lbl>, ERd <dbl+lbl>, ERe <dbl+lbl>,
# ERf <dbl+lbl>, ERg <dbl+lbl>, QCa <dbl+lbl>, QCb <dbl+lbl>, QCc <dbl+lbl>,
# QCd <dbl+lbl>, QCe <dbl+lbl>, QCf <dbl+lbl>, QCg <dbl+lbl>, QCh <dbl+lbl>,
# QCi <dbl+lbl>, QCj <dbl+lbl>, QCk <dbl+lbl>, SSa <dbl+lbl>, SSb <dbl+lbl>,
# SSc <dbl+lbl>, SSd <dbl+lbl>, SSe <dbl+lbl>, SSf <dbl+lbl>, …
TBL<-table(Ch_sqt$Q4, Ch_sqt$Q6)
TBL
1 2 3 4
1 155 16 5 52
2 53 3 2 67
The two-way table above has no column and row names and therefore it is very important to pride columns names and row names for easier row dimension and identification.
dimnames(TBL)<-list(Employment_terms=c("Permanent","Contractual"),
Trade_union=c("UASU","KUSU","Any other","Not a member of any"))
TBL
Trade_union
Employment_terms UASU KUSU Any other Not a member of any
Permanent 155 16 5 52
Contractual 53 3 2 67
Run the following code chunk to run the chi square test
chisq.test(TBL)
Pearson's Chi-squared test
data: TBL
X-squared = 35.018, df = 3, p-value = 0.0000001208
From the result, the square test value of 35.018 and the associated probability of approximately 0.001. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of trade union membership on terms of employment at a 5% level of significance. We can reject the null hypothesis with 95% confidence
TAB<-table(Ch_sqt$Q6)
TAB
1 2 3 4
208 19 7 119
dimnames(TAB)<-list(Trade_Union=c("UASU","KUSU","Member of any other","Not a member of any"))
TAB
Trade_Union
UASU KUSU Member of any other Not a member of any
208 19 7 119
Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.
In order to run a linear regression, it is always important to ensure that there is a well established linear association between the two variable, for instance, one need to create a scatter plot of the two variable to determine the linear association. Consider the scatter plot below for sales and profit.
plot(Sales, Profit, main = "A scatter plot of Sales and Profit")
A scstter plot of profit and sales
plot(log(Sales),Profit)
plot(Sales, log(Profit))
plot(log(Sales),log(Profit))
Scatter plot of sales and profit
cor(Sales,Profit)
[1] 0.4790643
reg_model<-lm(Profit~Sales, data = Superstore_data)
summary(reg_model)
Call:
lm(formula = Profit ~ Sales, data = Superstore_data)
Residuals:
Min 1Q Median 3Q Max
-7397.5 2.6 14.6 21.7 5261.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.732867 2.192459 -5.808 0.00000000653 ***
Sales 0.180067 0.003301 54.555 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 205.6 on 9992 degrees of freedom
Multiple R-squared: 0.2295, Adjusted R-squared: 0.2294
F-statistic: 2976 on 1 and 9992 DF, p-value: < 0.00000000000000022
stargazer(reg_model, type="text")
===============================================
Dependent variable:
---------------------------
Profit
-----------------------------------------------
Sales 0.180***
(0.003)
Constant -12.733***
(2.192)
-----------------------------------------------
Observations 9,994
R2 0.230
Adjusted R2 0.229
Residual Std. Error 205.639 (df = 9992)
F Statistic 2,976.247*** (df = 1; 9992)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
anova(reg_model)
Analysis of Variance Table
Response: Profit
Df Sum Sq Mean Sq F value Pr(>F)
Sales 1 125857839 125857839 2976.2 < 0.00000000000000022 ***
Residuals 9992 422535997 42287
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(reg_model)
Residuals vs Leverage
Residuals vs Leverage
Residuals vs Leverage
Residuals vs Leverage
confint(reg_model)
2.5 % 97.5 %
(Intercept) -17.0305284 -8.4352058
Sales 0.1735967 0.1865366
Resid<-residuals(reg_model)
summary(Resid)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-7397.542 2.581 14.634 0.000 21.658 5261.551
From the results above, the residuals have a mean of 0.00. This indicates the zero conditional mean is met. Next it is important to test the covariance between the residuals and the independent variable. If the covariance is given as zero, that is an indication that the change in dependent is causal.
cov(Resid,Sales)
[1] -0.00000000005062127
The value (covariance) is approximately close to zero, an indicator that changes in Profit is caused by changes in sales. In other words, changes in profits is causal.
Resid<-ts(Resid, start = 1265, frequency = 1)
ts.plot(Resid, main="A time series plot of regression residuals",xlab="Time in years",ylab="Residuals")
abline(0,-0.00000000005062127)
Regression residuals
hist(log(Sales),main="Histogram showing the distribution of sales from Superstore",xlab="log of sales", ylab="Frequencies",breaks = 15)
summary(log(Sales))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.8119 2.8495 3.9980 4.1098 5.3468 10.0274
ggplot(data=Superstore_data, aes(log(Sales))) +
geom_histogram(aes(y = ..density..),color="red")+
scale_x_continuous(limits = c(-2,12))+
stat_function(fun = dnorm,
args = list(mean = mean(log(Sales)),
sd = sd(log(Sales))),
col = "#1b98e0",
size = 1)+
labs(title = "Histogram showing the distribution of
Sales for Superstore Company",
subtitle = "Sales from 2016 to 2019",
caption="Data Collected and compiled by the Superstore Company")
Histogram showing the distribution of sales
ggplot(data=Superstore_data, aes(log(Sales))) +
geom_histogram(aes(y = ..density..),color="red")+
scale_x_continuous(limits = c(-2,12))+
stat_function(fun = dnorm,
args = list(mean = mean(log(Sales)),
sd = sd(log(Sales))),
col = "#1b98e0",
size = 1)+
labs(title = "Histogram showing the distribution of Sales for
Superstore Company",
subtitle = "Sales from 2016 to 2019",
caption="Data Collected and compiled by the Superstore Company")+
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.subtitle = element_text(hjust = 0.5))
Histogram showing the distribution of Sales
Summarize the average sales from various regions
group_mean_sales <- aggregate(Sales ~ Region, data = Superstore_data, mean)
group_mean_sales
Region Sales
1 Central 215.7727
2 East 238.3361
3 South 241.8036
4 West 226.4932
# Create data for the graph.
average_sales <- c(215.7727, 236.1127, 241.8036,226.4932)
Regions <- c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Regions,explode = 0.06, main = "Pie Chart of
Sales from Various of the United States ")
average_sales <- c(215.7727, 236.1127, 241.8036,226.4932)
Regions <- c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)
pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from
Various regions of the United States",col = rainbow(length(average_sales)))
legend("topright", c("Central","East","South","West"), cex = 0.8,
fill = rainbow(length(average_sales)))
Pie chart of sales for various regions
Summarize the average sales from various categories
group_mean_sales1 <- aggregate(Sales ~ Category, data = Superstore_data, mean)
group_mean_sales1
Category Sales
1 Furniture 349.8349
2 Office Supplies 119.3241
3 Technology 452.7093
pie_data <- data.frame(
category = c("Furniture", "Office Supplies", "Technology"),
value = c(349.8349, 119.3241, 452.7093)
)
# Create ggplot object
pie_chart <- ggplot(pie_data, aes(x = "", y = value, fill = category))
# Create pie chart
pie_chart + geom_bar(stat = "identity") + coord_polar(theta = "y")+
labs(title = "Pie Chart Average Sales Across Categories")
# Customize pie chart
pie_chart +
geom_bar(stat = "identity") +
coord_polar(theta = "y") +
labs(title = "Pie Chart", fill = "Category") +
scale_fill_manual(values = c("blue", "green", "orange")) +
theme_void()+
labs(title = "Pie Chart Average Sales Across Categories")
# Create data for the graph.
average_sales <- c(349.8349, 119.3241,452.7093)
Sales_category <- c("Furniture","Office Supplies","Technology")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Sales_category,explode = 0.06, main = "Pie Chart of
Sales from Various Categories")
pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from
Various Categories",col = rainbow(length(average_sales)))
legend("topright", c("Furniture","Office Supplies","Technology"), cex = 0.8,
fill = rainbow(length(average_sales)))
Pie chart of sales for various categories
average_sales1 <- c(347.7488, 119.0760,452.5782)
Sales_category1 <- c("Furniture","Office Supplies","Technology")
barplot(average_sales1,names.arg=Sales_category1,xlab="Category",ylab="Average Sales",
col="blue",
main="Bar graph for the average sales for various category",border="red")
Bar graph for various categories
group_mean_sales2 <- aggregate(Sales ~ Sub.Category, data = Superstore_data, mean)
group_mean_sales2
Sub.Category Sales
1 Accessories 215.97460
2 Appliances 230.75571
3 Art 34.06883
4 Binders 133.56056
5 Bookcases 503.85963
6 Chairs 532.33242
7 Copiers 2198.94162
8 Envelopes 64.86772
9 Fasteners 13.93677
10 Furnishings 95.82567
11 Labels 34.30305
12 Machines 1645.55331
13 Paper 57.28409
14 Phones 371.21153
15 Storage 264.59055
16 Supplies 245.65020
17 Tables 648.79477
average_sales2 <- c(216.13882,230.08435,34.10157,133.56056,486.67443,532.03556,2198.94162,65.11606,13.93677,95.82567,34.30305, 1645.55331,57.30044,370.17151,263.05245,245.65020,648.79477 )
sub_catgory_sales <- c("Accessories","Appliances","Art","Binders","Bookcases","Chairs","Copiers","Envelopes","Fasteners","Furnishings","Labels","Machines","Paper","Phones", "Storage","Supplies","Tables")
barplot(average_sales2,names.arg=sub_catgory_sales,xlab="sub_category",ylab="Average Sales",col="grey",main="Bar graph for the average sales for various sub categories",border="blue")
Bar graph for various sub categories
ggplot(Superstore_data,aes(x=Sub.Category, y=Sales))+
geom_bar(stat = "identity")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and
Sub categories")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(Superstore_data,aes(x=Sub.Category, y=mean(Sales)))+
geom_bar(stat = "identity")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
theme(axis.text.x = element_markdown(angle = 90))
Grouped Barplot of Sales for various Categories and Sub categories
Superstore_data %>%
filter(Category == "Furniture"|
Category == "Office Supplies"|
Category == "Technology")%>%
ggplot(aes(Category, fill= Sub.Category))+
geom_bar(position = "dodge",
alpha=0.5)+
theme_economist()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
labs(title = "Grouped Bar Graph",
y="Number",
x="Category")
Grouped bar graph
Superstore_data %>%
filter(Category == "Furniture"|
Category == "Office Supplies"|
Category == "Technology")%>%
ggplot(aes(Category, fill= Sub.Category))+
geom_bar(position = "fill",
alpha=0.5)+
theme_economist()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
labs(title = "Grouped Bar Graph",
y="Number",
x="Category")
Stacked Bar graphs 100%
Superstore_data %>%
filter(Category == "Furniture"|
Category == "Office Supplies"|
Category == "Technology")%>%
ggplot(aes(Category, fill= Sub.Category))+
geom_bar(alpha=0.5)+
theme_economist_white()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
labs(title = "Grouped Bar Graph",
y="Frequency",
x="Category")
Stacked bar graph
ggplot(Superstore_data,aes(x=Category, y=mean(Sales), fill=Sub.Category))+
geom_bar(stat = "identity")+
theme_calc()
Stacked Bar Graph for proportion
ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")
Grouped Barplot of Sales for various Categories and Sub categories
ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
facet_wrap(Category~.)+
theme(axis.text.x = element_markdown(angle = 90))
Grouped Barplot of Sales for various Categories and Sub categories
ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
theme(legend.position = "none")
facet_wrap(Category~.)
<ggproto object: Class FacetWrap, Facet, gg>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map_data: function
params: list
setup_data: function
setup_params: function
shrink: TRUE
train_scales: function
vars: function
super: <ggproto object: Class FacetWrap, Facet, gg>
unemploymnt_data<-read.csv("C:\\Users\\user\\Downloads\\Unemployment.csv")
attach(unemploymnt_data)
head(unemploymnt_data, 10)
year Unemployment Inflation FedRate
1 1859 5.133333 0.9084719 3.933333
2 1860 5.233333 1.8107772 3.696667
3 1861 5.533333 1.6227203 2.936667
4 1862 6.266667 1.7953352 2.296667
5 1863 6.800000 0.5370330 2.003333
6 1864 7.000000 0.7149242 1.733333
7 1865 6.766667 0.8918621 1.683333
8 1866 6.200000 1.0676163 2.400000
9 1867 5.633333 2.3034394 2.456667
10 1868 5.533333 1.2348411 2.606667
tail(unemploymnt_data,10)
year Unemployment Inflation FedRate
155 2013 4.533333 1.395350 5.533333
156 2014 4.433333 1.081917 4.860000
157 2015 4.266667 1.694265 4.733333
158 2016 4.300000 1.342605 4.746667
159 2017 4.233333 1.376280 5.093333
160 2018 4.100000 1.789712 5.306667
161 2019 4.033333 3.668536 5.680000
162 2020 4.000000 2.102699 6.273333
163 2021 4.066667 1.868115 6.520000
164 2022 3.966667 1.748108 6.473333
This is an annual data from 1980 first quarter to 2021 last quarter.
attach(unemploymnt_data)
Unemployment<-ts(unemploymnt_data$Unemployment,frequency = 1,start = 1859)
ts.plot(Unemployment,main="Yearly time series plot of Unemployment from 1859 to 2022", xlab="Time in quarters",ylab="Unemployment rate")
Yearly time series plot of Unemployment from 1859 to 2022
Inflation<-ts(unemploymnt_data$Inflation,frequency = 1,start = 1859)
ts.plot(Inflation,main="Yearly time series plot of Inflation from 1859 to 2022", xlab="Time in quarters",ylab="Inflation rate")
Yearly time series plot of Inflation from 1859 to 2022
FedRate<-ts(unemploymnt_data$FedRate,frequency = 1,start = 1859)
ts.plot(FedRate,main="Yearly time series plot of FedRate from 1859 to 2022", xlab="Time in quarters",ylab="Fed Funds rate")
Yearly time series plot of FedRate from 1859 to 2022
plot(Unemployment,Inflation)
Scatter plot of unemployment and inflation
plot(FedRate,Inflation)
Scatter plot of FedRate and Inflation
cor(Inflation,FedRate)
[1] 0.6716283
cor(Inflation,Unemployment)
[1] 0.2172026
hist(Inflation, main="Histogram showing the distribution of Inflation")
Histogram showing the distribution of Inflation
hist(Unemployment, main="Histogram showing the distribution of Unemployment")
Histogram showing the distribution of Unemployment
hist(FedRate, main = "Histogram showing the distribution of FedRate")
Histogram showing the distribution of FedRate
shapiro.test(Inflation)
Shapiro-Wilk normality test
data: Inflation
W = 0.91272, p-value = 0.00000002459
shapiro.test(Unemployment)
Shapiro-Wilk normality test
data: Unemployment
W = 0.96757, p-value = 0.0006884
shapiro.test(FedRate)
Shapiro-Wilk normality test
data: FedRate
W = 0.9102, p-value = 0.00000001704
Estimate the model using the log transformed values of inflation, unemployment and FedRate to bring about linearity between the dependent variable and independent variables.
REG_MODEL<-lm(log(Inflation)~log(Unemployment)+log(FedRate),data = unemploymnt_data)
stargazer(REG_MODEL, report = "vc*stp",type = "text",out = "./q7results.txt")
===============================================
Dependent variable:
---------------------------
log(Inflation)
-----------------------------------------------
log(Unemployment) 0.176
(0.157)
t = 1.127
p = 0.262
log(FedRate) 0.976***
(0.085)
t = 11.555
p = 0.000
Constant -0.902***
(0.291)
t = -3.102
p = 0.003
-----------------------------------------------
Observations 164
R2 0.473
Adjusted R2 0.466
Residual Std. Error 0.498 (df = 161)
F Statistic 72.167*** (df = 2; 161)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
outlierTest(REG_MODEL)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
14 -3.172999 0.0018094 0.29673
The p-value for Boniferron test statistics shows that there are outliers in the data set
qqPlot(REG_MODEL, main= "QQ Plot Showing the Possible Presence of outliers")
QQ Plot Showing the Possible Presence of outliers
[1] 14 15
leveragePlots(REG_MODEL)
QQ Plot Showing the Possible Presence of outliers
vif(REG_MODEL)
log(Unemployment) log(FedRate)
1.034369 1.034369
VIF of 1.034369 for unemployment and 1.034369 for FedRate is an indication that predictors are not correlated
ncvTest(REG_MODEL)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 2.623628, Df = 1, p = 0.10528
the results show that the variance of the error terms is constant
spreadLevelPlot(REG_MODEL)
Spread-Level Plot for Regression Model
Suggested power transformation: 1.024086
durbinWatsonTest(REG_MODEL)
lag Autocorrelation D-W Statistic p-value
1 0.7001071 0.5736665 0
Alternative hypothesis: rho != 0
The results shows that there is an correlation of the regression residuals between periods
coeftest(REG_MODEL, hccm(REG_MODEL, type = "hc0"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.901611 0.272771 -3.3054 0.001169 **
log(Unemployment) 0.176483 0.148486 1.1885 0.236368
log(FedRate) 0.976433 0.074715 13.0688 < 0.00000000000000022 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
stargazer(coeftest(REG_MODEL, hccm(REG_MODEL, type = "hc0")),type="text")
=============================================
Dependent variable:
---------------------------
---------------------------------------------
log(Unemployment) 0.176
(0.148)
log(FedRate) 0.976***
(0.075)
Constant -0.902***
(0.273)
=============================================
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
RSD<-residuals(REG_MODEL)
summary(RSD)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.52073 -0.35028 -0.01588 0.00000 0.35593 1.13292
The zero conditional mean assumption is met, because the residuals mean is approximately zero.
cov(RSD,log(FedRate))
[1] 0.00000000000000007316962
cov(RSD,log(Unemployment))
[1] 0.0000000000000001019151
RSD<-ts(RSD, frequency = 1,start = 1858)
ts.plot(RSD)
Times series of residual
ts.plot(RSD)
abline(0,0.000)
Time series plot of residuals with an abline line
ggplot(data=unemploymnt_data,aes(x=year,y=Inflation))+geom_line()
Times series plot of inflation
ggplot(data=unemploymnt_data,aes(x=year,y=Inflation))+geom_line()+
labs(title="Time Series plot of Inflation Rate",
caption="source:Federal Reserve",
y="Inflation Rate", x="Year",
color=3) + # title and caption
theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
axis.title=element_text(size=12,face="bold"),
panel.grid = element_blank())+
#theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
theme(legend.text = element_text(size=12,face="bold"))+
theme_set(theme_economist())
Time Series plot of Inflation Rate
ggplot(data=unemploymnt_data,aes(x=year,y=Unemployment))+geom_line()+
labs(title="Time Series plot of Unemployment Rate",
caption="source:Federal Reserve",
y="Unemployment Rate", x="Year",
color=3) + # title and caption
theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
axis.title=element_text(size=12,face="bold"),
panel.grid = element_blank())+
#theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
theme(legend.text = element_text(size=12,face="bold"))+
theme_set(theme_economist())
Time Series plot of Unemployment Rate
ggplot(data=unemploymnt_data,aes(x=year,y=FedRate))+geom_line()+
labs(title="Time Series plot of Fed Fund Rate",
caption="source:Federal Reserve",
y="Fed Fund Rate", x="Year",
color=3) + # title and caption
theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
axis.title=element_text(size=12,face="bold"),
panel.grid = element_blank())+
#theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
theme(legend.text = element_text(size=12,face="bold"))+
theme_set(theme_economist())
Time Series plot of Fed Fund Rate
ggplot(data=unemploymnt_data,aes(x=year))+
geom_line(aes(y=Inflation,colour="Inflation Rate"))+
geom_line(aes(y=FedRate,colour="FedRate"))+
geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
labs(title="Trends of Inflation, Unemployment and FedRate from 1859 to 2022",
caption="", y="Rate", x="Time in Years", color="Key")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 10))
Trends of Inflation, Unemployment and FedRate from 1859 to 2022
date<-seq(as.Date("1859-01-01"),by="1 year",length.out=length(unemploymnt_data$year))
date
[1] "1859-01-01" "1860-01-01" "1861-01-01" "1862-01-01" "1863-01-01"
[6] "1864-01-01" "1865-01-01" "1866-01-01" "1867-01-01" "1868-01-01"
[11] "1869-01-01" "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01"
[16] "1874-01-01" "1875-01-01" "1876-01-01" "1877-01-01" "1878-01-01"
[21] "1879-01-01" "1880-01-01" "1881-01-01" "1882-01-01" "1883-01-01"
[26] "1884-01-01" "1885-01-01" "1886-01-01" "1887-01-01" "1888-01-01"
[31] "1889-01-01" "1890-01-01" "1891-01-01" "1892-01-01" "1893-01-01"
[36] "1894-01-01" "1895-01-01" "1896-01-01" "1897-01-01" "1898-01-01"
[41] "1899-01-01" "1900-01-01" "1901-01-01" "1902-01-01" "1903-01-01"
[46] "1904-01-01" "1905-01-01" "1906-01-01" "1907-01-01" "1908-01-01"
[51] "1909-01-01" "1910-01-01" "1911-01-01" "1912-01-01" "1913-01-01"
[56] "1914-01-01" "1915-01-01" "1916-01-01" "1917-01-01" "1918-01-01"
[61] "1919-01-01" "1920-01-01" "1921-01-01" "1922-01-01" "1923-01-01"
[66] "1924-01-01" "1925-01-01" "1926-01-01" "1927-01-01" "1928-01-01"
[71] "1929-01-01" "1930-01-01" "1931-01-01" "1932-01-01" "1933-01-01"
[76] "1934-01-01" "1935-01-01" "1936-01-01" "1937-01-01" "1938-01-01"
[81] "1939-01-01" "1940-01-01" "1941-01-01" "1942-01-01" "1943-01-01"
[86] "1944-01-01" "1945-01-01" "1946-01-01" "1947-01-01" "1948-01-01"
[91] "1949-01-01" "1950-01-01" "1951-01-01" "1952-01-01" "1953-01-01"
[96] "1954-01-01" "1955-01-01" "1956-01-01" "1957-01-01" "1958-01-01"
[101] "1959-01-01" "1960-01-01" "1961-01-01" "1962-01-01" "1963-01-01"
[106] "1964-01-01" "1965-01-01" "1966-01-01" "1967-01-01" "1968-01-01"
[111] "1969-01-01" "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01"
[116] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01"
[121] "1979-01-01" "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01"
[126] "1984-01-01" "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01"
[131] "1989-01-01" "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01"
[136] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[141] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[146] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[151] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[156] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01"
[161] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01"
ggplot(data=unemploymnt_data,aes(x=date))+
geom_line(aes(y=Inflation,colour="Inflation Rate"))+
labs(title="Trends of Inflation Rate from 1859 to 2022",
caption="", y="Inflation Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y", breaks = "4 year")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Inflation Rate from 1859 to 2022
ggplot(data=unemploymnt_data,aes(x=date))+
geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
labs(title="Trends of Unemployment Rate from 1859 to 2022",
caption="", y="Unemployment Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y", breaks = "4 year")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
ggplot(data=unemploymnt_data,aes(x=date))+
geom_line(aes(y=FedRate,colour="Fed Fund Rate"))+
labs(title="Trends of Fed Fund Rate from 1859 to 2022",
caption="", y="Fed Fund Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y", breaks = "4 year")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Fed Fund Rate from 1859 to 2022
ggplot(data=unemploymnt_data,aes(x=date))+
geom_line(aes(y=FedRate,colour="Fed Fund Rate"))+
geom_line(aes(y=Inflation,colour="Inflation Rate"))+
geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
labs(title="Trends of Fed Fund Rate, Inflation Rate and Unemployment
Rate from 1859 to 2022",
caption="", y="Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y", breaks = "4 year")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Fed Rate, Inflation and Unemployment from 1859 to 2022
plot_1<-ggplot(data=unemploymnt_data,aes(x=date,y=Inflation))+geom_line()+
labs(title="Inflation Rate from 1859 to 2022",
caption="", y="Inflation Rate", x="Time in Years", color=3)+
scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
plot_2<-ggplot(data=unemploymnt_data,aes(x=date,y=Unemployment))+geom_line()+
labs(title="Unemployment Rate from 1859 to 2022",
caption="", y="Unemployment Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
plot_3<-ggplot(data=unemploymnt_data,aes(x=date,y=FedRate))+geom_line()+
labs(title="Fed Fund Rates from 1859 to 2022",
caption="", y="Fed Fund Rates", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_2),ggplotGrob(plot_3),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_2),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_3),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_2),ggplotGrob(plot_3),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_2),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_3),size="last"))
data11<-data.frame(unemploymnt_data$year,unemploymnt_data$Inflation,unemploymnt_data$FedRate,unemploymnt_data$Unemployment,RSD)
head(data11,10)
unemploymnt_data.year unemploymnt_data.Inflation unemploymnt_data.FedRate
1 1859 0.9084719 3.933333
2 1860 1.8107772 3.696667
3 1861 1.6227203 2.936667
4 1862 1.7953352 2.296667
5 1863 0.5370330 2.003333
6 1864 0.7149242 1.733333
7 1865 0.8918621 1.683333
8 1866 1.0676163 2.400000
9 1867 2.3034394 2.456667
10 1868 1.2348411 2.606667
unemploymnt_data.Unemployment RSD
1 5.133333 -0.82027544
2 5.233333 -0.07333963
3 5.533333 0.03190288
4 6.266667 0.35104981
5 6.800000 -0.73682650
6 7.000000 -0.31447083
7 6.766667 -0.05877223
8 6.200000 -0.20979873
9 5.633333 0.55330508
10 5.533333 -0.12486512
write.csv(data11,"C:\\Users\\user\\Downloads\\residual_data.csv", row.names = FALSE)
ggplot(data=data11,aes(x=year,y=RSD))+geom_line()+
labs(title="Regression Residuals",
caption="Source:Federal Reserve", y="Residuals", x="Time in Years", color=3)
residual_data<-read.csv("C:\\Users\\user\\Downloads\\residual_data.csv")
attach(residual_data)
head(residual_data,10)
unemploymnt_data.year unemploymnt_data.Inflation unemploymnt_data.FedRate
1 1859 0.9084719 3.933333
2 1860 1.8107772 3.696667
3 1861 1.6227203 2.936667
4 1862 1.7953352 2.296667
5 1863 0.5370330 2.003333
6 1864 0.7149242 1.733333
7 1865 0.8918621 1.683333
8 1866 1.0676163 2.400000
9 1867 2.3034394 2.456667
10 1868 1.2348411 2.606667
unemploymnt_data.Unemployment RSD
1 5.133333 -0.82027544
2 5.233333 -0.07333963
3 5.533333 0.03190288
4 6.266667 0.35104981
5 6.800000 -0.73682650
6 7.000000 -0.31447083
7 6.766667 -0.05877223
8 6.200000 -0.20979873
9 5.633333 0.55330508
10 5.533333 -0.12486512
Time<-seq(as.Date("1859-01-01"),by="1 year",length.out=length(residual_data$unemploymnt_data.year))
Time
[1] "1859-01-01" "1860-01-01" "1861-01-01" "1862-01-01" "1863-01-01"
[6] "1864-01-01" "1865-01-01" "1866-01-01" "1867-01-01" "1868-01-01"
[11] "1869-01-01" "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01"
[16] "1874-01-01" "1875-01-01" "1876-01-01" "1877-01-01" "1878-01-01"
[21] "1879-01-01" "1880-01-01" "1881-01-01" "1882-01-01" "1883-01-01"
[26] "1884-01-01" "1885-01-01" "1886-01-01" "1887-01-01" "1888-01-01"
[31] "1889-01-01" "1890-01-01" "1891-01-01" "1892-01-01" "1893-01-01"
[36] "1894-01-01" "1895-01-01" "1896-01-01" "1897-01-01" "1898-01-01"
[41] "1899-01-01" "1900-01-01" "1901-01-01" "1902-01-01" "1903-01-01"
[46] "1904-01-01" "1905-01-01" "1906-01-01" "1907-01-01" "1908-01-01"
[51] "1909-01-01" "1910-01-01" "1911-01-01" "1912-01-01" "1913-01-01"
[56] "1914-01-01" "1915-01-01" "1916-01-01" "1917-01-01" "1918-01-01"
[61] "1919-01-01" "1920-01-01" "1921-01-01" "1922-01-01" "1923-01-01"
[66] "1924-01-01" "1925-01-01" "1926-01-01" "1927-01-01" "1928-01-01"
[71] "1929-01-01" "1930-01-01" "1931-01-01" "1932-01-01" "1933-01-01"
[76] "1934-01-01" "1935-01-01" "1936-01-01" "1937-01-01" "1938-01-01"
[81] "1939-01-01" "1940-01-01" "1941-01-01" "1942-01-01" "1943-01-01"
[86] "1944-01-01" "1945-01-01" "1946-01-01" "1947-01-01" "1948-01-01"
[91] "1949-01-01" "1950-01-01" "1951-01-01" "1952-01-01" "1953-01-01"
[96] "1954-01-01" "1955-01-01" "1956-01-01" "1957-01-01" "1958-01-01"
[101] "1959-01-01" "1960-01-01" "1961-01-01" "1962-01-01" "1963-01-01"
[106] "1964-01-01" "1965-01-01" "1966-01-01" "1967-01-01" "1968-01-01"
[111] "1969-01-01" "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01"
[116] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01"
[121] "1979-01-01" "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01"
[126] "1984-01-01" "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01"
[131] "1989-01-01" "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01"
[136] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[141] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[146] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[151] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[156] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01"
[161] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01"
residual_plot<-ggplot(data=residual_data,aes(x=Time,y=RSD))+geom_line()+
labs(title="Regression Residuals",
caption="", y="Residual", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
grid.newpage()
grid.draw(rbind(ggplotGrob(residual_plot),size="last"))
sample_sales<-sample(Superstore_data$Sales,size = 200, replace = FALSE)
sample_profit<-sample(Superstore_data$Profit,size = 200, replace = FALSE)
sample_quantity<-sample(Superstore_data$Quantity,size = 200, replace = FALSE)
sample_data<-data.frame(sample_sales,sample_profit,sample_quantity)
head(sample_data,10)
sample_sales sample_profit sample_quantity
1 14.940 2.2240 2
2 45.960 7.2000 3
3 30.816 10.2480 3
4 119.960 24.8832 5
5 719.960 15.7640 3
6 181.860 1.3068 7
7 479.970 57.5016 5
8 6.848 -12.4146 1
9 10.368 18.6606 4
10 23.744 11.1900 5
tail(sample_data,10)
sample_sales sample_profit sample_quantity
191 22.580 4.3600 3
192 56.280 1.0430 4
193 23.976 14.2956 3
194 41.960 1.7640 2
195 25.344 18.0378 8
196 899.970 1.4742 5
197 49.080 3.6018 3
198 9.088 146.3880 4
199 129.920 2.3094 3
200 95.144 22.2264 5
In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias occurs when the omitted variable is correlated with one or more other independent variable and dependent variable. Consider the data below.
research_data<-read.csv("C:\\Users\\user\\Downloads\\research_data.csv")
attach(research_data)
head(research_data,10)
year gdp Private_investment income_tax value_added_tax import_tax
1 1973 15790 3645 1124.8 639.80 795.44
2 1974 18776 4075 1531.4 937.26 842.24
3 1975 21140 4837 1796.8 1185.48 983.62
4 1976 25562 5808 2149.4 1308.44 1057.18
5 1977 32699 7800 2846.8 1855.26 2083.94
6 1978 35601 10280 3121.4 1995.38 2025.48
7 1979 39543 10809 3437.0 3098.14 2049.64
8 1980 44648 12451 3951.6 3588.00 2919.00
9 1981 51641 14508 3993.4 3895.90 3674.24
10 1982 58214 13364 4624.6 3917.50 3305.84
tail(research_data,10)
year gdp Private_investment income_tax value_added_tax import_tax
37 2009 2365453 292488 288168.0 146791.0 41372.00
38 2010 2551161 238938 268291.0 172360.0 48903.00
39 2011 3725918 213845 290621.6 163940.4 48436.93
40 2012 4261370 242365 353693.7 174716.2 55397.81
41 2013 4745594 234586 422357.9 218297.2 61692.72
42 2014 5403471 273641 470102.0 243156.2 70245.12
43 2015 6284191 293645 540440.4 276504.4 75410.29
44 2016 7194163 261548 589921.4 287766.5 86329.96
45 2017 7749435 269542 557959.3 309977.4 85243.79
46 2018 7946248 256345 683377.3 381419.9 95354.98
mlr1 <- lm(Private_investment~income_tax+value_added_tax, data = research_data)
stargazer(mlr1,report = "vc*stp",type = "text",out = "./q7results.txt")
===============================================
Dependent variable:
---------------------------
Private_investment
-----------------------------------------------
income_tax -1.017***
(0.344)
t = -2.954
p = 0.006
value_added_tax 2.828***
(0.658)
t = 4.301
p = 0.0001
Constant 40,402.690***
(12,398.920)
t = 3.259
p = 0.003
-----------------------------------------------
Observations 46
R2 0.712
Adjusted R2 0.698
Residual Std. Error 60,037.290 (df = 43)
F Statistic 53.103*** (df = 2; 43)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
mlr2<- lm(Private_investment~income_tax+value_added_tax+import_tax, data = research_data)
stargazer(mlr2, report = "vc*stp",type = "text",out = "./q7results.txt")
===============================================
Dependent variable:
---------------------------
Private_investment
-----------------------------------------------
income_tax -1.026***
(0.323)
t = -3.174
p = 0.003
value_added_tax 1.752**
(0.743)
t = 2.358
p = 0.024
import_tax 4.249**
(1.630)
t = 2.606
p = 0.013
Constant 15,523.390
(15,054.700)
t = 1.031
p = 0.309
-----------------------------------------------
Observations 46
R2 0.752
Adjusted R2 0.734
Residual Std. Error 56,362.240 (df = 42)
F Statistic 42.433*** (df = 3; 42)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
waldtest(mlr1,mlr2,test = "F")
Wald test
Model 1: Private_investment ~ income_tax + value_added_tax
Model 2: Private_investment ~ income_tax + value_added_tax + import_tax
Res.Df Df F Pr(>F)
1 43
2 42 1 6.7904 0.01263 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis for this test states that the coefficient of the omitted variable is zero. Here the implication is that if we accept the null hypothesis, the variable was correctly omitted. On the other hand, the alternative hypothesis states that the coefficient of the omitted variable is not equal to zero. Therefore, rejecting the null hypothesis indicates that the variable was incorrectly omitted. From the results above, the p-value is approximately 0.01, which indicates that we should reject the null hypothesis and conclude that the variable was incorrectly omitted. Thus, the omitted variable helps to explain the variation in the response variable.
fdi_data<-read.csv("C:\\Users\\user\\Downloads\\fdi.csv")
attach(fdi_data)
head(fdi_data,10)
year FDI_USA....of.GDP. FDI_Kenya....of.GDP. FDI_UK....of.GDP.
1 1970 0.11366781 0.8606457 1.1387286
2 1971 0.06610293 0.4161064 1.1958529
3 1972 0.09928779 0.2989637 0.7107128
4 1973 0.13540287 0.6879239 1.4141275
5 1974 0.22909018 0.7885675 2.1218233
6 1975 0.13709980 0.5264477 1.3726762
7 1976 0.15533156 1.3346175 1.2923558
8 1977 0.14026148 1.2581322 1.6826721
9 1978 0.23388341 0.6488659 1.1275862
10 1979 0.30639436 1.3475238 1.4737058
FDI_USA....of.GDP.<-ts(fdi_data$FDI_USA....of.GDP.,frequency = 1,start = 1970)
ts.plot(FDI_USA....of.GDP.,main="Yearly time series plot of Foreign Direct
Investment Net Flow in the USA from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in USA
FDI_UK....of.GDP.<-ts(fdi_data$FDI_UK....of.GDP.,frequency = 1,start = 1970)
ts.plot(FDI_UK....of.GDP.,main="Yearly time series plot of Foreign Direct
Investment Net Flow in the UK from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in UK
FDI_Kenya....of.GDP.<-ts(fdi_data$FDI_Kenya....of.GDP.,frequency = 1,start = 1970)
ts.plot(FDI_Kenya....of.GDP.,main="Yearly time series plot of Foreign Direct
Investment Net Flow in Kenya from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in Kenya
date_time<-seq(as.Date("1970-01-01"),by="1 year",length.out=length(fdi_data$year))
date_time
[1] "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01" "1974-01-01"
[6] "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01" "1979-01-01"
[11] "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01" "1984-01-01"
[16] "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01" "1989-01-01"
[21] "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01" "1994-01-01"
[26] "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01" "1999-01-01"
[31] "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01" "2004-01-01"
[36] "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01" "2009-01-01"
[41] "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01"
[46] "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01" "2019-01-01"
[51] "2020-01-01"
USA<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_USA....of.GDP.))+geom_line()+
labs(title="FDI Inflow in the USA from 1970 to 2020",
caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Kenya<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_Kenya....of.GDP.))+geom_line()+
labs(title="FDI Inflow in Kenya from 1970 to 2020",
caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
UK<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_UK....of.GDP.))+geom_line()+
labs(title="FDI Inflow in UK from 1970 to 2020",
caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
grid.newpage()
grid.draw(rbind(ggplotGrob(USA),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(Kenya),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(UK),size="last"))
ggplot(data=fdi_data,aes(x=year))+
geom_line(aes(y=FDI_USA....of.GDP.,colour="FDI_USA"))+
geom_line(aes(y=FDI_UK....of.GDP.,colour="FDI_UK"))+
geom_line(aes(y=FDI_Kenya....of.GDP.,colour="FDI_Kenya"))+
labs(title="Trends of FDI inflow as a percentage of GDP from 1970 to 2020",
caption="", y="FDI", x="Time in Years",color="Key")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 10))
Trends of FDI inflow as a percentage of GDP from 1970 to 2020
ForDI<-read.csv("C:\\Users\\user\\Downloads\\ForDI.csv")
attach(ForDI)
head(ForDI,10)
Year United.States United.Kingdom Germany France South.Africa
1 1970 5270000000 189599849 NA NA 0
2 1971 4850000000 216920701 96932868 NA 0
3 1972 6060000000 809492007 -92717508 NA 0
4 1973 7410000000 2258637186 -138757334 NA 0
5 1974 1620000000 2246137 -54769412 NA 0
6 1975 11420000000 -317828335 1490736257 -225901953 0
7 1976 8410000000 1331645274 1287619730 660768470 0
8 1977 8360000000 -253403562 1457079065 -893591988 0
9 1978 8870000000 3027697093 2336498531 -570258945 0
10 1979 16670000000 6069585369 3282971189 -601262768 0
Kenya Nigeria
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 -15796942 NA
7 -42069308 NA
8 -53887117 -440514243
9 -32085354 -210933271
10 -78123859 -304632042
ForDI_TR<- ts(data=ForDI[,-1],start = min(ForDI$Year),end =max(ForDI$Year))
plot(ForDI_TR,plot.type = "single",col=1:7)
legend("topleft",legend = colnames(ForDI_TR),ncol = 2,lty = 1, col = 1:7,cex = 0.9)
Time series plot Foreign Direct Investment
plot(ForDI_TR)
ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. Before we introduce ARIMA models, we must first discuss the concept of stationarity and the technique of differencing time series.
A stationary time series is one whose properties do not depend on the time at which the series is observed.14 Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time. Some cases can be confusing — a time series with cyclic behaviour (but with no trend or seasonality) is stationary. This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be. In general, a stationary time series will have no predictable patterns in the long-term. Time plots will show the series to be roughly horizontal (although some cyclic behaviour is possible), with constant variance.
One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required. A number of unit root tests are available, which are based on different assumptions and may lead to conflicting answers. In our analysis, we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test (Kwiatkowski, Phillips, Schmidt, & Shin, 1992). In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required. The test can be computed using the ur.kpss() function from the urca package.
gdp_data<-read.csv("C:\\Users\\user\\Downloads\\ARIMA.csv")
head(gdp_data,5)
year realgdp
1 1817 1610.5
2 1818 1658.8
3 1819 1723.0
4 1820 1753.9
5 1821 1773.5
str(gdp_data)
'data.frame': 204 obs. of 2 variables:
$ year : int 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 ...
$ realgdp: num 1610 1659 1723 1754 1774 ...
class(gdp_data)
[1] "data.frame"
gdp_TS<-ts(gdp_data$realgdp,frequency = 1,start = 1817)
class(gdp_TS)
[1] "ts"
plot(gdp_TS)
adf.test(gdp_TS)
Augmented Dickey-Fuller Test
data: gdp_TS
Dickey-Fuller = 0.28354, Lag order = 5, p-value = 0.99
alternative hypothesis: stationary
From the result above, GDP is highly correlated with itself and therefore not stationary.
acf(gdp_TS)
pacf(gdp_TS)
auto.arima(gdp_TS,ic="aic", trace = TRUE)
Fitting models using approximations to speed things up...
ARIMA(2,2,2) : 2039.769
ARIMA(0,2,0) : 2114.799
ARIMA(1,2,0) : 2073.859
ARIMA(0,2,1) : 2049.559
ARIMA(1,2,2) : 2047.647
ARIMA(2,2,1) : 2037.924
ARIMA(1,2,1) : 2046.277
ARIMA(2,2,0) : 2067.973
ARIMA(3,2,1) : 2041.463
ARIMA(3,2,0) : 2064.69
ARIMA(3,2,2) : 2043.092
Now re-fitting the best model(s) without approximations...
ARIMA(2,2,1) : 2054.273
Best model: ARIMA(2,2,1)
Series: gdp_TS
ARIMA(2,2,1)
Coefficients:
ar1 ar2 ma1
0.2633 0.1442 -0.9656
s.e. 0.0725 0.0726 0.0195
sigma^2 = 1477: log likelihood = -1023.14
AIC=2054.27 AICc=2054.48 BIC=2067.51
ARIMAMODEL<-auto.arima(gdp_TS,ic="aic", trace = TRUE)
Fitting models using approximations to speed things up...
ARIMA(2,2,2) : 2039.769
ARIMA(0,2,0) : 2114.799
ARIMA(1,2,0) : 2073.859
ARIMA(0,2,1) : 2049.559
ARIMA(1,2,2) : 2047.647
ARIMA(2,2,1) : 2037.924
ARIMA(1,2,1) : 2046.277
ARIMA(2,2,0) : 2067.973
ARIMA(3,2,1) : 2041.463
ARIMA(3,2,0) : 2064.69
ARIMA(3,2,2) : 2043.092
Now re-fitting the best model(s) without approximations...
ARIMA(2,2,1) : 2054.273
Best model: ARIMA(2,2,1)
ARIMAMODEL
Series: gdp_TS
ARIMA(2,2,1)
Coefficients:
ar1 ar2 ma1
0.2633 0.1442 -0.9656
s.e. 0.0725 0.0726 0.0195
sigma^2 = 1477: log likelihood = -1023.14
AIC=2054.27 AICc=2054.48 BIC=2067.51
summary(ARIMAMODEL)
Series: gdp_TS
ARIMA(2,2,1)
Coefficients:
ar1 ar2 ma1
0.2633 0.1442 -0.9656
s.e. 0.0725 0.0726 0.0195
sigma^2 = 1477: log likelihood = -1023.14
AIC=2054.27 AICc=2054.48 BIC=2067.51
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 2.803157 37.96217 29.06838 0.02152401 0.7154865 0.6297229
ACF1
Training set -0.008628346
acf(ts(ARIMAMODEL$residuals))
pacf(ts(ARIMAMODEL$residuals))
From the many articles that we have read, the ACF is used to identify
order of Moving Average (MA) term, and PACF for Auto-regressive term
(AR). From the results above, MA will have one lag,with AR having two
lags.
MYGDPF<-forecast(ARIMAMODEL,level=c(95), h=50)
MYGDPF
Point Forecast Lo 95 Hi 95
2021 9357.475 9282.142 9432.809
2022 9415.513 9292.099 9538.927
2023 9476.136 9306.002 9646.271
2024 9538.083 9325.159 9751.008
2025 9600.752 9347.949 9853.555
2026 9663.801 9373.544 9954.058
2027 9727.055 9401.215 10052.895
2028 9790.417 9430.445 10150.390
2029 9853.838 9460.849 10246.827
2030 9917.289 9492.143 10342.436
2031 9980.757 9524.114 10437.401
2032 10044.234 9556.599 10531.870
2033 10107.716 9589.474 10625.958
2034 10171.200 9622.640 10719.761
2035 10234.686 9656.019 10813.352
2036 10298.172 9689.550 10906.793
2037 10361.658 9723.181 11000.136
2038 10425.145 9756.870 11093.420
2039 10488.632 9790.583 11186.681
2040 10552.119 9824.290 11279.948
2041 10615.606 9857.967 11373.244
2042 10679.093 9891.594 11466.592
2043 10742.580 9925.152 11560.008
2044 10806.067 9958.626 11653.508
2045 10869.554 9992.003 11747.105
2046 10933.041 10025.271 11840.810
2047 10996.528 10058.422 11934.634
2048 11060.015 10091.445 12028.584
2049 11123.502 10124.335 12122.669
2050 11186.989 10157.083 12216.894
2051 11250.476 10189.685 12311.266
2052 11313.963 10222.136 12405.790
2053 11377.450 10254.430 12500.470
2054 11440.937 10286.565 12595.309
2055 11504.424 10318.536 12690.312
2056 11567.911 10350.341 12785.481
2057 11631.398 10381.976 12880.819
2058 11694.885 10413.441 12976.329
2059 11758.372 10444.733 13072.011
2060 11821.859 10475.849 13167.869
2061 11885.346 10506.789 13263.903
2062 11948.833 10537.552 13360.114
2063 12012.320 10568.135 13456.505
2064 12075.807 10598.538 13553.075
2065 12139.294 10628.761 13649.826
2066 12202.781 10658.803 13746.759
2067 12266.268 10688.663 13843.873
2068 12329.755 10718.340 13941.170
2069 12393.242 10747.834 14038.649
2070 12456.729 10777.146 14136.312
plot(MYGDPF,type="l",main="Time Series plot of Forecasted GDP (ARIMA 2,2,1)",xlab="Time in Years",ylab="MYGDPF")
ARIMA forecast
Box.test(MYGDPF$resid, lag=5,type="Ljung-Box")
Box-Ljung test
data: MYGDPF$resid
X-squared = 1.6084, df = 5, p-value = 0.9002
Box.test(MYGDPF$resid, lag=2,type="Ljung-Box")
Box-Ljung test
data: MYGDPF$resid
X-squared = 0.025225, df = 2, p-value = 0.9875
Box.test(MYGDPF$resid, lag=20,type="Ljung-Box")
Box-Ljung test
data: MYGDPF$resid
X-squared = 23.863, df = 20, p-value = 0.2484
From the results, ARIMA(2,2,1) is good for prediction and has no autocorrelation issue. Box-Ljung test gave a p-value value of 0.9875, showing no autocorrelation
autoplot(forecast(MYGDPF))
ARIMA 2,2,1 Forecast
RESID<-residuals(ARIMAMODEL)
summary(RESID)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-127.529 -20.085 2.219 2.803 25.356 154.703
plot(RESID)
abline(2.803,0)
The plot of residuals with an abline
In today’s lesson, you’ll learn the basics of the vector autoregressive model. We lay the foundation for getting started with this crucial multivariate time series model and cover the important details including:
The vector autoregressive (VAR) model is a workhouse multivariate time series model that relates current observations of a variable with past observations of itself and past observations of other variables in the system. VAR models differ from univariate autoregressive models because they allow feedback to occur between the variables in the model. For example, we could use a VAR model to show how Inflation rate is a function of unemployment rate and how unemployment rate is, in turn, a function of inflation rate. In economics we might be interested in establishing a bi-directional relationship between personal income and personal consumption spending? A two-equation VAR system is used to model the relationship between income and consumption over time.
VAR models are traditionally widely used in finance and econometrics because they offer a framework for accomplishing important modeling goals, including (Stock and Watson 2001):
There are three broad types of VAR models, the reduced form, the recursive form, and the structural VAR model.
This model contain all the components of the reduced form model, but also allow some variables to be functions of other concurrent variables. By imposing these short-run relationships, the recursive model allows us to model structural shocks.
A VAR model is made up of a system of equations that represents the relationships between multiple variables. When referring to VAR models, we often use special language to specify:
In general, a VAR model is composed of n-equations (representing n endogenous variables) and includes p-lags of the variables.
Lag selection is one of the important aspects of VAR model specification. In practical applications, we generally choose a maximum number of lags, , and evaluate the performance of the model including. The optimal model is then the model VAR(p) which minimizes some lag selection criteria. The most commonly used lag selection criteria are:
From an estimation standpoint, it is important to be deliberate about how many variables we include in our VAR model. Adding additional variables: * Increases the number of coefficients to be estimated for each equation and each number of lags.
Deciding what variables to include in a VAR model should be founded in theory, as much as possible. We can use additional tools, like Granger causality or Sims causality, to test the forecasting relevance of variables.
Tests whether a variable is “helpful” for forecasting the behavior of another variable. It’s important to note that Granger causality only allows us to make inferences about forecasting capabilities – not about true causality.
Despite their seeming complexities, VAR models are quite easy to estimate. The equation can be estimated using ordinary least squares given a few assumptions:
The error term has a conditional mean of zero.
The variables in the model are stationary.
Large outliers are unlikely.
No perfect multicollinearity. Under these assumptions, the ordinary least squares estimates:
Will be consistent.
Can be evaluated using traditional t-statistics and p-values.
Can be used to jointly test restrictions across multiple equations.
One of the most important functions of VAR models is to generate forecasts. Forecasts are generated for VAR models using an iterative forecasting algorithm:
Often we are more interested in the dynamics that are predicted by our VAR models than the actual coefficients that are estimated. For this reason, it is most common that VAR studies report:
As we previously discussed, Granger-causality statistics test whether one variable is statistically significant when predicting another variable. In other words, the Granger-causality statistics are F-statistics that test if the coefficients of all lags of a variable are jointly equal to zero in the equation for another variable. As the p-value of the F-statistic decreases, evidence that a variable is relevant for predict another variable increases.
For example, in the Granger-causality test of on , if the p-value is 0.02 we would say that does help predict at the 5% level. However, if the p-value is 0.3 we would say that there is no evidence that helps predict .
The impulse response function traces the dynamic path of variables in the system to shocks to other variables in the system. This is done by:
Forecast error decomposition separates the forecast error variance into proportions attributed to each variable in the model. Intuitively, this measure helps us judge how much of an impact one variable has on another variable in the VAR model and how intertwined our variables’ dynamics are. For example, if X is responsible for 85% of the forecast error variance of Y, it is explaining a large amount of the forecast variation in Y. However, if X is only responsible for 20% of the forecast error variance of Y, much of the forecast error variance of is left unexplained by X.
VAR models are an essential component of multivariate time series modeling. After today’s blog, you should have a better understanding of the fundamentals of the VAR model including:
### Loas the dataset
var<-read.csv("C:\\Users\\user\\Downloads\\estimation.csv")
head(var,10)
Inflation unemployment
1 0.0000000 0.0000000
2 -1.2058566 0.7304148
3 -0.3206307 -1.8469421
4 0.5431992 -1.5017315
5 -0.6235946 -1.2889562
6 -0.5725677 -1.1614199
7 -0.6476382 -0.8271906
8 0.5254592 -0.9146628
9 0.1391321 0.6737464
10 -1.3313781 -0.1606959
Inflation<-ts(var$Inflation,start=c(2000,1),frequency = 12)
plot.ts(Inflation,type="l",main="Time Series plot Inflation",xlab="Year",ylab="inflation")
The time Series plot Inflation
unemployment<-ts(var$unemployment,start=c(2000,1),frequency = 12)
plot.ts(unemployment,type="l",main="Time Series plot unemployment",xlab="Year",ylab="unemployment")
Time Series plot unemployment
Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of a series.
The null hypothesis however is still the same as the Dickey Fuller test. A key point to remember here is: Since the null hypothesis assumes the presence of unit root, that is alpha=1, the p-value obtained should be less than the significance level (say 0.05) in order to reject the null hypothesis. Thereby, inferring that the series is stationary. However, this is a very common mistake analysts commit with this test. That is, if the p-value is less than significance level, people mistakenly take the series to be non-stationary.
ADF<-adf.test(Inflation)
ADF
Augmented Dickey-Fuller Test
data: Inflation
Dickey-Fuller = -5.7974, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
ADF1<-adf.test(unemployment)
ADF1
Augmented Dickey-Fuller Test
data: unemployment
Dickey-Fuller = -4.0617, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
lag<-VARselect(var, lag.max=4)
lag
$selection
AIC(n) HQ(n) SC(n) FPE(n)
1 1 1 1
$criteria
1 2 3 4
AIC(n) -0.093318151 -0.08006220 -0.08372238 -0.07044136
HQ(n) -0.052691480 -0.01235108 0.01107318 0.05143865
SC(n) 0.007032298 0.08718855 0.15042866 0.23060999
FPE(n) 0.910908005 0.92307937 0.91974245 0.93210286
The four criteria above indicates that the appropriate number of lags is one. Including more than one lag in the model will be inappropriate.
EST<-VAR(var, p=1, type = "none")
summary(EST)
VAR Estimation Results:
=========================
Endogenous variables: Inflation, unemployment
Deterministic variables: none
Sample size: 199
Log Likelihood: -549.51
Roots of the characteristic polynomial:
0.43 0.43
Call:
VAR(y = var, p = 1, type = "none")
Estimation results for equation Inflation:
==========================================
Inflation = Inflation.l1 + unemployment.l1
Estimate Std. Error t value Pr(>|t|)
Inflation.l1 0.25493 0.06887 3.702 0.000278 ***
unemployment.l1 -0.05589 0.04771 -1.171 0.242875
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9565 on 197 degrees of freedom
Multiple R-Squared: 0.06981, Adjusted R-squared: 0.06037
F-statistic: 7.392 on 2 and 197 DF, p-value: 0.0008023
Estimation results for equation unemployment:
=============================================
unemployment = Inflation.l1 + unemployment.l1
Estimate Std. Error t value Pr(>|t|)
Inflation.l1 0.58693 0.07050 8.325 0.000000000000014 ***
unemployment.l1 0.59648 0.04885 12.211 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9792 on 197 degrees of freedom
Multiple R-Squared: 0.5342, Adjusted R-squared: 0.5295
F-statistic: 113 on 2 and 197 DF, p-value: < 0.00000000000000022
Covariance matrix of residuals:
Inflation unemployment
Inflation 0.9104 -0.0391
unemployment -0.0391 0.9578
Correlation matrix of residuals:
Inflation unemployment
Inflation 1.00000 -0.04187
unemployment -0.04187 1.00000
stargazer(EST[["varresult"]], type='text')
===========================================================
Dependent variable:
----------------------------
y
(1) (2)
-----------------------------------------------------------
Inflation.l1 0.255*** 0.587***
(0.069) (0.071)
unemployment.l1 -0.056 0.596***
(0.048) (0.049)
-----------------------------------------------------------
Observations 199 199
R2 0.070 0.534
Adjusted R2 0.060 0.530
Residual Std. Error (df = 197) 0.957 0.979
F Statistic (df = 2; 197) 7.392*** 112.979***
===========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
In practice, you usually don’t don’t make interpretation for the coefficient of a VAR model. And there is a pretty intuitive reason for that. As you recall, VARs assume that all relevant variables are somehow affecting each other through time as a unique universe, so much that in practice VAR estimates as much equations as there are variables. For example, if you estimate a 3-var VAR, then you get an output that consists of 3 estimated equations. This means that each variable gets the “opportunity” of being endogenous while the others are exogenous. So, since VAR assumes this, you won’t want to interpret coefficients as doing that would imply that an x- var is causally affecting the y-var, which by construction, VAR does not assumes that. That’s why most analysis skip that and jump to other analysis, like impulse-response functions and graphs.
Strictly speaking, the coefficients of a reduced-form VAR have the same interpretation as the coefficients in any regression, ie. they are the estimated partial derivatives of the dependent variable with respect to the lagged explanatory variables. However, we are typically not interested in these coefficients, but rather on the dynamic properties of the whole system. After all, a VAR is intended to summarize the dynamic behavior of a group of variables, all of which are endogenous with respect to one another. So, we typically focus on the impulse response functions (IRFs) of the VAR, which provide the (cumulative) total derivatives of the endogenous variables with respect to an exogenous shock to one of the variables. More technically, calculating IRFs is equivalent to inverting the vector autoregressive representation to obtain the vector moving average representation. Note that doing so requires that the matrix be invertible, meaning no unit roots in the underlying data.
However, there is an additional complication, which gets to the root of the difference between VARs and the linear simultaneous equation models of the 1970s that they replaced. Since only lags appear as regressors in a VAR, any contemporaneous correlations between variables is captured in the residual covariance matrix. But IRFs become invalid when errors are correlated since the former explicitly assumes that only one variable is being shocked at t=0. Therefore, in addition to the reduced-form coefficient matrix being invertible, we require a means by which the residuals from the VAR can be rendered orthogonal (uncorrelated) with respect to one another. This is how we get from a reduced-form to a structural VAR whose IRFs are valid. Unfortunately, the orthogonality requirement alone is not sufficient to produce a unique structural representation. In other words, there exists an infinity of different structural representations of the VAR that (a) are consistent with the estimated reduced form parameters of the VAR, and (b) possess orthogonal shocks. One popular example is the Cholesky factorization of the residual covariance matrix.
To summarize, the IRFs of VAR are much easier to interpret than the parameters of the reduced-form VAR. But calculating valid IRFs requires (a) stationary ( I(0) ) endogenous variables (or cointegration of the variables) and (b) some form of factorization of the residual covariance matrix that renders the resulting shocks orthogonal with respect to one another. There are other more subtle requirements such as linearity of the underlying data generating process, but these are the 2 big ones.
roots(EST, modulus=TRUE)
[1] 0.4299625 0.4299625
The model is stable since the roots are inside the unit circle, that is, the roots are less than one
GRA<-causality(EST, cause = "Inflation")
GRA
$Granger
Granger causality H0: Inflation do not Granger-cause unemployment
data: VAR object EST
F-Test = 69.304, df1 = 1, df2 = 394, p-value = 0.000000000000001332
$Instant
H0: No instantaneous causality between: Inflation and unemployment
data: VAR object EST
Chi-squared = 0.38572, df = 1, p-value = 0.5346
GRA1<-causality(EST, cause = "unemployment")
GRA1
$Granger
Granger causality H0: unemployment do not Granger-cause Inflation
data: VAR object EST
F-Test = 1.372, df1 = 1, df2 = 394, p-value = 0.2422
$Instant
H0: No instantaneous causality between: unemployment and Inflation
data: VAR object EST
Chi-squared = 0.38572, df = 1, p-value = 0.5346
From the analysis, it was established that inflation granger cause unemployment, however, unemployment do not granger cause inflation.
IRF1<-irf(EST, impulse = "Inflation", response = "unemployment",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
IRF1
Impulse response coefficients
$Inflation
unemployment
[1,] -0.0431533864513
[2,] 0.5356735487953
[3,] 0.4640598613174
[4,] 0.2960802457858
[5,] 0.1662984086264
[6,] 0.0868538042711
[7,] 0.0432056980416
[8,] 0.0207296508563
[9,] 0.0096622621431
[10,] 0.0043943820861
[11,] 0.0019552162997
[12,] 0.0008523272496
[13,] 0.0003642305110
[14,] 0.0001525446726
[15,] 0.0000625448388
[16,] 0.0000250512251
[17,] 0.0000097665459
[18,] 0.0000036842516
[19,] 0.0000013313196
[20,] 0.0000004524105
[21,] 0.0000001390725
Lower Band, CI= 0.95
$Inflation
unemployment
[1,] -0.184075991840
[2,] 0.321066563874
[3,] 0.315294143889
[4,] 0.166563256552
[5,] 0.058447095129
[6,] 0.002749794944
[7,] -0.013138211838
[8,] -0.011395791645
[9,] -0.009979606392
[10,] -0.007013775897
[11,] -0.004085913025
[12,] -0.001923095405
[13,] -0.000775530661
[14,] -0.000368941838
[15,] -0.000179202716
[16,] -0.000080672026
[17,] -0.000045792068
[18,] -0.000021790150
[19,] -0.000007270298
[20,] -0.000003185182
[21,] -0.000002275386
Upper Band, CI= 0.95
$Inflation
unemployment
[1,] 0.0885639204
[2,] 0.7114816015
[3,] 0.6124293905
[4,] 0.4025760799
[5,] 0.2535375202
[6,] 0.1595431549
[7,] 0.1059810511
[8,] 0.0714119748
[9,] 0.0483910704
[10,] 0.0332510192
[11,] 0.0228477491
[12,] 0.0156285875
[13,] 0.0105281439
[14,] 0.0070995015
[15,] 0.0047940361
[16,] 0.0032382581
[17,] 0.0021387138
[18,] 0.0013962543
[19,] 0.0009116266
[20,] 0.0005953314
[21,] 0.0003893387
plot(IRF1, ylab="Inflation",
main="Inflation response to unemployment shock")
Inflation response to the unemployment shock
Usually, the impulse response functions are interpreted as something like “a one standard deviation shock to x causes significant increases (decreases) in y for m periods (determined by the length of period for which the SE bands are above 0 or below 0 in case of decrease) after which the effect dissipates. The IRF curve above shows that a one standard deviation shock from unemployment causes inflation to increase from 0 to 3 months and fall there after. The effect of do not last longer. The effects lasted for 3 months.
IRF2<-irf(EST, impulse = "unemployment", response = "Inflation",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF2, ylab="unemployment",
main="unemployment response to Inflation shock")
unemployment response to Inflation shock
From the graph above, a one standard deviation shock from inflation on unemployment cause unemployment to decrease upto about 2 months and starts to increase. However, the confidence bound contains a zero, which is an implication that the shock of inflation on unemployment on inflation is not statistically significant. This results confirms to the results gotten above which indicated that unemployment granger cause inflation but inflation do not granger cause unemployment.
IRF3<-irf(EST, impulse = "Inflation", response = "Inflation",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF3, ylab="Inflation",
main="Inflation response to Inflation shock")
Inflation response to Inflation shock
IRF4<-irf(EST, impulse = "unemployment", response = "unemployment",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF4, ylab="unemployment",
main="unemployment response to unemployment shock")
unemployment response to unemployment shock
The variance decomposition indicates the amount of information each variable contributes to the other variables in the autoregression. It determines how much of the forecast error variance of each of the variables can be explained by exogenous shocks to the other variables.
VARD<-fevd(EST, n.ahead=20)
VARD
$Inflation
Inflation unemployment
[1,] 1.0000000 0.000000000
[2,] 0.9969451 0.003054924
[3,] 0.9947479 0.005252059
[4,] 0.9938673 0.006132705
[5,] 0.9935938 0.006406153
[6,] 0.9935202 0.006479764
[7,] 0.9935022 0.006497786
[8,] 0.9934981 0.006501898
[9,] 0.9934972 0.006502784
[10,] 0.9934970 0.006502966
[11,] 0.9934970 0.006503001
[12,] 0.9934970 0.006503008
[13,] 0.9934970 0.006503009
[14,] 0.9934970 0.006503010
[15,] 0.9934970 0.006503010
[16,] 0.9934970 0.006503010
[17,] 0.9934970 0.006503010
[18,] 0.9934970 0.006503010
[19,] 0.9934970 0.006503010
[20,] 0.9934970 0.006503010
$unemployment
Inflation unemployment
[1,] 0.001942077 0.9980579
[2,] 0.182061322 0.8179387
[3,] 0.265135962 0.7348640
[4,] 0.293687415 0.7063126
[5,] 0.302330557 0.6976694
[6,] 0.304683598 0.6953164
[7,] 0.305269961 0.6947300
[8,] 0.305405969 0.6945940
[9,] 0.305435712 0.6945643
[10,] 0.305441899 0.6945581
[11,] 0.305443130 0.6945569
[12,] 0.305443365 0.6945566
[13,] 0.305443408 0.6945566
[14,] 0.305443415 0.6945566
[15,] 0.305443416 0.6945566
[16,] 0.305443417 0.6945566
[17,] 0.305443417 0.6945566
[18,] 0.305443417 0.6945566
[19,] 0.305443417 0.6945566
[20,] 0.305443417 0.6945566
plot(VARD)
Variance decomposition function for inflation and unemployment
my_var<-read.csv("C:\\Users\\user\\Downloads\\my_var.csv")
attach(my_var)
head(my_var,15)
Unemployment Inflation Fed.Rate
1 5.133 0.908 3.933
2 5.233 1.811 3.697
3 5.533 1.623 2.937
4 6.267 1.795 2.297
5 6.800 0.537 2.003
6 7.000 0.715 1.733
7 6.767 0.892 1.683
8 6.200 1.068 2.400
9 5.633 2.303 2.457
10 5.533 1.235 2.607
11 5.567 1.055 2.847
12 5.533 1.228 2.923
13 5.767 1.573 2.967
14 5.733 0.349 2.963
15 5.500 0.523 3.330
Inflation<-ts(my_var$Inflation,start=c(1960),frequency = 4)
plot.ts(Inflation,type="l",main="Time Series plot Inflation",xlab="Year",ylab="inflation")
Time series plot of inflation
Unemployment<-ts(my_var$Unemployment,start=c(1960),frequency = 4)
plot.ts(Unemployment,type="l",main="Time Series plot Unemployment",xlab="Year",ylab="Unemployment")
Time series plot of unemployment
Fed.Rate<-ts(my_var$Fed.Rate,start=c(1960),frequency = 4)
plot.ts(Fed.Rate,type="l",main="Time Series plot of FedRate",xlab="Year",ylab="FedRate")
Time series plot of FedRate
URT<-adf.test(my_var$Inflation)
URT
Augmented Dickey-Fuller Test
data: my_var$Inflation
Dickey-Fuller = -2.2805, Lag order = 5, p-value = 0.4593
alternative hypothesis: stationary
URT1<-adf.test(my_var$Unemployment)
URT1
Augmented Dickey-Fuller Test
data: my_var$Unemployment
Dickey-Fuller = -2.0581, Lag order = 5, p-value = 0.5521
alternative hypothesis: stationary
URT2<-adf.test(my_var$Fed.Rate)
URT2
Augmented Dickey-Fuller Test
data: my_var$Fed.Rate
Dickey-Fuller = -3.3068, Lag order = 5, p-value = 0.07246
alternative hypothesis: stationary
lag<-VARselect(my_var, lag.max=4)
lag
$selection
AIC(n) HQ(n) SC(n) FPE(n)
3 3 2 3
$criteria
1 2 3 4
AIC(n) -2.6746095 -3.15420316 -3.28434359 -3.25963208
HQ(n) -2.5809554 -2.99030845 -3.05020830 -2.95525620
SC(n) -2.4439714 -2.75058659 -2.70774850 -2.51005846
FPE(n) 0.0689359 0.04267955 0.03748351 0.03844391
VAR_mdl<-VAR(my_var, p=3, type = "none")
VAR_mdl
VAR Estimation Results:
=======================
Estimated coefficients for equation Unemployment:
=================================================
Call:
Unemployment = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3
Unemployment.l1 Inflation.l1 Fed.Rate.l1 Unemployment.l2 Inflation.l2
1.540461079 0.036092543 0.015635976 -0.567700896 -0.026670658
Fed.Rate.l2 Unemployment.l3 Inflation.l3 Fed.Rate.l3
0.034688923 -0.011735675 -0.003682298 -0.019977419
Estimated coefficients for equation Inflation:
==============================================
Call:
Inflation = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3
Unemployment.l1 Inflation.l1 Fed.Rate.l1 Unemployment.l2 Inflation.l2
-1.01182762 0.68232932 0.14845919 1.56424639 0.09094890
Fed.Rate.l2 Unemployment.l3 Inflation.l3 Fed.Rate.l3
-0.06004687 -0.57197523 0.18422864 -0.05318817
Estimated coefficients for equation Fed.Rate:
=============================================
Call:
Fed.Rate = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3
Unemployment.l1 Inflation.l1 Fed.Rate.l1 Unemployment.l2 Inflation.l2
-1.5314206 0.0541425 0.9752619 1.3715074 0.2532385
Fed.Rate.l2 Unemployment.l3 Inflation.l3 Fed.Rate.l3
-0.3649440 0.1113407 -0.1461168 0.3350045
stargazer(VAR_mdl[["varresult"]], type='text')
====================================================================
Dependent variable:
-------------------------------------
y
(1) (2) (3)
--------------------------------------------------------------------
Unemployment.l1 1.540*** -1.012** -1.531***
(0.088) (0.389) (0.343)
Inflation.l1 0.036** 0.682*** 0.054
(0.018) (0.079) (0.070)
Fed.Rate.l1 0.016 0.148 0.975***
(0.022) (0.096) (0.084)
Unemployment.l2 -0.568*** 1.564** 1.372**
(0.147) (0.646) (0.570)
Inflation.l2 -0.027 0.091 0.253***
(0.022) (0.096) (0.085)
Fed.Rate.l2 0.035 -0.060 -0.365***
(0.030) (0.133) (0.117)
Unemployment.l3 -0.012 -0.572 0.111
(0.086) (0.376) (0.332)
Inflation.l3 -0.004 0.184** -0.146**
(0.019) (0.084) (0.074)
Fed.Rate.l3 -0.020 -0.053 0.335***
(0.022) (0.098) (0.087)
--------------------------------------------------------------------
Observations 161 161 161
R2 0.999 0.955 0.986
Adjusted R2 0.999 0.952 0.985
Residual Std. Error (df = 152) 0.231 1.015 0.895
F Statistic (df = 9; 152) 12,729.890*** 356.559*** 1,193.802***
====================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
roots(VAR_mdl, modulus=TRUE)
[1] 0.99513614 0.94079541 0.76174764 0.76174764 0.59992529 0.59992529 0.38418659
[8] 0.38418659 0.08928642
GrI<-causality(VAR_mdl, cause = "Inflation")
GrI
$Granger
Granger causality H0: Inflation do not Granger-cause Unemployment
Fed.Rate
data: VAR object VAR_mdl
F-Test = 6.0224, df1 = 6, df2 = 456, p-value = 0.000004455
$Instant
H0: No instantaneous causality between: Inflation and Unemployment
Fed.Rate
data: VAR object VAR_mdl
Chi-squared = 4.219, df = 2, p-value = 0.1213
GrU<-causality(VAR_mdl, cause = "Unemployment")
GrU
$Granger
Granger causality H0: Unemployment do not Granger-cause Inflation
Fed.Rate
data: VAR object VAR_mdl
F-Test = 5.6633, df1 = 6, df2 = 456, p-value = 0.00001088
$Instant
H0: No instantaneous causality between: Unemployment and Inflation
Fed.Rate
data: VAR object VAR_mdl
Chi-squared = 23.73, df = 2, p-value = 0.000007031
GrF<-causality(VAR_mdl, cause = "Fed.Rate")
GrF
$Granger
Granger causality H0: Fed.Rate do not Granger-cause Unemployment
Inflation
data: VAR object VAR_mdl
F-Test = 3.2328, df1 = 6, df2 = 456, p-value = 0.004033
$Instant
H0: No instantaneous causality between: Fed.Rate and Unemployment
Inflation
data: VAR object VAR_mdl
Chi-squared = 26.054, df = 2, p-value = 0.0000022
From the analysis, it was established that both inflation, unemployment and fed rate granger cause each other.
IRF_1<-irf(VAR_mdl, impulse = "Inflation", response = "Unemployment",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF_1, ylab="Inflation",
main="Inflation response to Unemployment shock")
Inflation response to unemployment shock
VARD_1<-fevd(VAR_mdl, n.ahead=5)
VARD_1
$Unemployment
Unemployment Inflation Fed.Rate
[1,] 1.0000000 0.000000000 0.0000000000
[2,] 0.9906356 0.008466579 0.0008978494
[3,] 0.9689308 0.017920429 0.0131487360
[4,] 0.9348733 0.028723840 0.0364029017
[5,] 0.8903648 0.046527944 0.0631072218
$Inflation
Unemployment Inflation Fed.Rate
[1,] 0.001789413 0.9982106 0.000000000
[2,] 0.062362243 0.9290021 0.008635625
[3,] 0.092268984 0.8917762 0.015954800
[4,] 0.100141867 0.8864991 0.013359037
[5,] 0.105821679 0.8826761 0.011502228
$Fed.Rate
Unemployment Inflation Fed.Rate
[1,] 0.1722401 0.02083114 0.8069287
[2,] 0.3326729 0.02515215 0.6421749
[3,] 0.4415973 0.06171751 0.4966851
[4,] 0.5042660 0.07830998 0.4174240
[5,] 0.5349494 0.08422492 0.3808256
plot(VARD_1)
Variance decomposition function
Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH) is a statistical model used in analyzing time-series data where the variance error is believed to be serially auto-correlated. GARCH models assume that the variance of the error term follows an auto-regressive moving average process.
Although GARCH models can be used in the analysis of a number of different types of financial data, such as macroeconomic data, financial institutions typically use them to estimate the volatility of returns for stocks, bonds, and market indices. They use the resulting information to help determine pricing and judge which assets will potentially provide higher returns, as well as to forecast the returns of current investments to help in their asset allocation, hedging, risk management, and portfolio optimization decisions.
GARCH models are used when the variance of the error term is not constant. That is, the error term is heteroskedastic. Heteroskedasticity describes the irregular pattern of variation of an error term, or variable, in a statistical model. Essentially, wherever there is heteroskedasticity, observations do not conform to a linear pattern. Instead, they tend to cluster. Therefore, if statistical models that assume constant variance are used on this data, then the conclusions and predictive value one can draw from the model will not be reliable.
The variance of the error term in GARCH models is assumed to vary systematically, conditional on the average size of the error terms in previous periods. In other words, it has conditional heteroskedasticity, and the reason for the heteroskedasticity is that the error term is following an auto-regressive moving average pattern. This means that it is a function of an average of its own past values.
GARCH was developed in 1986 by Dr. Tim Bollerslev, a doctoral student at the time, as a way to address the problem of forecasting volatility in asset prices. It built on economist Robert Engle’s breakthrough 1982 work in introducing the Autoregressive Conditional Heteroskedasticity (ARCH) model. His model assumed the variation of financial returns was not constant over time but are autocorrelated, or conditional to/dependent on each other. For instance, one can see this in stock returns where periods of volatility in returns tend to be clustered together.
Since the original introduction, many variations of GARCH have emerged. These include Nonlinear (NGARCH), which addresses correlation and observed “volatility clustering” of returns, and Integrated GARCH (IGARCH), which restricts the volatility parameter. All the GARCH model variations seek to incorporate the direction, positive or negative, of returns in addition to the magnitude (addressed in the original model).
Each derivation of GARCH can be used to accommodate the specific qualities of the stock, industry, or economic data. When assessing risk, financial institutions incorporate GARCH models into their Value-at-Risk (VAR), maximum expected loss (whether for a single investment or trading position, portfolio, or at a division or firm-wide level) over a specified time period. GARCH models are viewed to provide better gauges of risk than can be obtained through tracking standard deviation alone.
Various studies have been conducted on the reliability of various GARCH models during different market conditions, including during the periods leading up to and after the Great Recession.
We are going to download the data for the apple stock price from the internet. We shall therefore use the command below.
apple<-getSymbols("AAPL",
from="2007-01-01",
to="2023-03-10")
plot(AAPL$AAPL.Close["2020"])
The plot of Returns for 2020
head(AAPL,10)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2007-01-03 3.081786 3.092143 2.925000 2.992857 1238319600 2.547275
2007-01-04 3.001786 3.069643 2.993571 3.059286 847260400 2.603815
2007-01-05 3.063214 3.078571 3.014286 3.037500 834741600 2.585272
2007-01-08 3.070000 3.090357 3.045714 3.052500 797106800 2.598039
2007-01-09 3.087500 3.320714 3.041071 3.306071 3349298400 2.813858
2007-01-10 3.383929 3.492857 3.337500 3.464286 2952880000 2.948517
2007-01-11 3.426429 3.456429 3.396429 3.421429 1440252800 2.912041
2007-01-12 3.378214 3.395000 3.329643 3.379286 1312690400 2.876172
2007-01-16 3.417143 3.473214 3.408929 3.467857 1244076400 2.951556
2007-01-17 3.484286 3.485714 3.386429 3.391071 1646260000 2.886203
tail(AAPL,10)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2023-02-24 147.11 147.19 145.72 146.71 55469600 146.71
2023-02-27 147.71 149.17 147.45 147.92 44998500 147.92
2023-02-28 147.05 149.08 146.83 147.41 50547000 147.41
2023-03-01 146.83 147.23 145.01 145.31 55479000 145.31
2023-03-02 144.38 146.71 143.90 145.91 52238100 145.91
2023-03-03 148.04 151.11 147.33 151.03 70732300 151.03
2023-03-06 153.79 156.30 153.46 153.83 87558000 153.83
2023-03-07 153.70 154.03 151.13 151.60 56182000 151.60
2023-03-08 152.81 153.47 151.83 152.87 47204800 152.87
2023-03-09 153.56 154.54 150.23 150.59 53833600 150.59
chartSeries(AAPL)
Chartseries for returns
chartSeries(AAPL["2020-12"])
Chart series showing returns for 2020/12
Note that the yellow candle sticks indicates that AAPL experience a lower closing stock than the opening stock price. On the other hand, the green candle stick shows that the company experienced a higher closing stock prices than opening stock prices.
chartSeries(AAPL["2008-12"])
Chart series showing returns for 2008/12
Returns<-CalculateReturns(AAPL$AAPL.Close)
head(Returns,10)
AAPL.Close
2007-01-03 NA
2007-01-04 0.022195848
2007-01-05 -0.007121269
2007-01-08 0.004938272
2007-01-09 0.083069943
2007-01-10 0.047855899
2007-01-11 -0.012371092
2007-01-12 -0.012317368
2007-01-16 0.026209975
2007-01-17 -0.022142205
Returns<-Returns[-1]
head(Returns,10)
AAPL.Close
2007-01-04 0.022195848
2007-01-05 -0.007121269
2007-01-08 0.004938272
2007-01-09 0.083069943
2007-01-10 0.047855899
2007-01-11 -0.012371092
2007-01-12 -0.012317368
2007-01-16 0.026209975
2007-01-17 -0.022142205
2007-01-18 -0.061927338
hist(Returns, breaks = 15)
Histogram for the returns
chart.Histogram(Returns,
methods = c('add.density','add.normal'),
colorset = c('blue','green','red'))
Histogram showing the distribution of returns superimposed with the normal curve
chartSeries(Returns,theme = 'white')
Volatility chart series
The graph above shows that there is volatility in the series, however, the chart shows that there is no seasonality in the series. High volatility is seen in the year 2008, specifically after the stock market crash in the united states due to a fall in the housing pricing. Similarly, the graph shows some periods where we had a higher variability in the returns.
We can capture volatility in the market with the help of standard deviation. First, we calculate the standard deviation of our series as shown below
sd(Returns)
[1] 0.02034422
The return’s standard deviation is approximately 2.03%. Now suppose we assume that we have 252 trading days in a year. Therefore, this standard deviation will be multiplied by the square root of 252.
sqrt(252)*sd(Returns)
[1] 0.3229545
Generally speaking, the average volatility in returns is approximately 32.19% per year.
sqrt(252)*sd(Returns["2008"])
[1] 0.5820481
Volatility in returns for the Apple company in the year 2008 was approximately 58.20%. Volatility measures how fast returns from an investment changes due to some factors, such as economic shocks,and or political shocks. Compare the volatility across the years, 2008, 2015 and 2020. In which year did the AAPL experienced a higher volatility in returns.
sqrt(252)*sd(Returns["2015"])
[1] 0.2673398
Volatility in returns for the Apple company in the year 2008 was approximately 26.73%.
sqrt(252)*sd(Returns["2020"])
[1] 0.4665857
Volatility in returns for the Apple company in the year 2008 was approximately 46.66%.
chart.RollingPerformance(R=Returns["2007::2023"],
width = 22,
FUN = "sd.annualized",
scale=252,
main = "Apple's monthly rolling volatility")
Apple’s monthly rolling volatility
chart.RollingPerformance(R=Returns["2007::2023"],
width = 6,
FUN = "sd.annualized",
scale=252,
main = "Apple's monthly rolling volatility")
Apple’s weekly rolling volatility
chart.RollingPerformance(R=Returns["2007::2023"],
width = 252,
FUN = "sd.annualized",
scale=252,
main = "Apple's monthly rolling volatility")
Apple’s Annual rolling volatility
Before we create the model, we create the specification for the model and so store the specification, and how do we do that;
s<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="sGARCH"),
distribution.model = 'norm')
m<- ugarchfit(data=Returns,spec = s)
m
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001939 0.000250 7.7438 0
omega 0.000014 0.000001 9.5552 0
alpha1 0.105801 0.001912 55.3270 0
beta1 0.860030 0.008249 104.2613 0
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001939 0.000328 5.9202 0.000000
omega 0.000014 0.000004 3.3442 0.000825
alpha1 0.105801 0.017987 5.8823 0.000000
beta1 0.860030 0.013021 66.0519 0.000000
LogLikelihood : 10554.42
Information Criteria
------------------------------------
Akaike -5.1807
Bayes -5.1745
Shibata -5.1807
Hannan-Quinn -5.1785
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.9189 0.3378
Lag[2*(p+q)+(p+q)-1][2] 1.4893 0.3635
Lag[4*(p+q)+(p+q)-1][5] 4.3042 0.2185
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.1366 0.7117
Lag[2*(p+q)+(p+q)-1][5] 1.3903 0.7668
Lag[4*(p+q)+(p+q)-1][9] 2.6138 0.8206
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.6404 0.500 2.000 0.4236
ARCH Lag[5] 2.3106 1.440 1.667 0.4066
ARCH Lag[7] 2.4778 2.315 1.543 0.6170
Nyblom stability test
------------------------------------
Joint Statistic: 15.8903
Individual Statistics:
mu 0.1630
omega 3.2265
alpha1 0.3831
beta1 0.4430
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.07 1.24 1.6
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 0.7662 0.44359
Negative Sign Bias 1.4747 0.14036
Positive Sign Bias 0.8164 0.41431
Joint Effect 9.8021 0.02033 **
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 136.5 0.00000000000000000008531
2 30 153.3 0.00000000000000000073860
3 40 172.2 0.00000000000000000116447
4 50 171.7 0.00000000000000156776602
Elapsed time : 0.66782
From the results above, the standard GARCH model is sGARCH(1,1). Remember we are using a constant mean model ASRIMA(0,0,0) with a normal distribution of error. The model has three parameters, mu, omega, alpha1 and beta1. From the results above, all the parameter with their estimates shows that the parameters are statistically significant.
We can write the equation in Rmarkdown using the command below; \[ Returns_{t}= \mu + e_{t} \] The main equation now becomes the following. \[ Returns_{t}= 0.001968 + e_{t} \] Remember the mean is constant. On the other hand, the equation for sigma squares variable at time t is given as follows; \[ \sigma_{t}^2 = w + \alpha*e_{t-1}^2 + \beta \sigma_{t-1}^2 \]
If we replace the value from the output above into our equation, the new equation will be as shown below.
\[ \sigma_{t}^2 = 0.000014 + 0.108468e_{t-1}^2 + 0.855942 \sigma_{t-1}^2 \]
The model above is our variance GARCH model. Remember if we do not have the beta term in the model, the entire model reduces to ARCH(1) model, which is basically as shown below; \[ \sigma_{t}^2 = w + \alpha*e_{t-1}^2 \]
This model also provides a lot of other useful analysis and information that we can use. One of the most useful information we can use is the information criteria, which include Akaike information criteria, Bayes information criteria, Hannan Quinn information criteria and Shibata information criteria. The information criteria helps to the best model without necessarily choosing a very complex model. The simple always has a lower information criteria. The model that has a lower information criteria will be useful in calculating returns. Therefore, we shall choose the model with lowest information value.
Another important information we have is the Ljung-Box test on standardized residuals. The null hypothesis in this case is that there is no serial correlation. From the output above, there is no sufficient evidence to reject the null hypothesis. As a result, we can conclude that there is no serial correlation between the standardized residuals. Besides, there is no serial correlation in the standardized squared residuals, and they more or less behave like the white noise process. This also justifies that our GARCH model is valid.
One thing we can notice is that the probability values are all less than 0.05 indicating that the model for the residuals that we used is not really a good choice. If the distribution was a good fit, then the p-value for the goodness fit test would be greater than 0.05. Thus, we shall assume that there is a scope to improve our model.
s<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="sGARCH"),
distribution.model = 'norm')
m<- ugarchfit(data=Returns,spec = s)
m
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001939 0.000250 7.7438 0
omega 0.000014 0.000001 9.5552 0
alpha1 0.105801 0.001912 55.3270 0
beta1 0.860030 0.008249 104.2613 0
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001939 0.000328 5.9202 0.000000
omega 0.000014 0.000004 3.3442 0.000825
alpha1 0.105801 0.017987 5.8823 0.000000
beta1 0.860030 0.013021 66.0519 0.000000
LogLikelihood : 10554.42
Information Criteria
------------------------------------
Akaike -5.1807
Bayes -5.1745
Shibata -5.1807
Hannan-Quinn -5.1785
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.9189 0.3378
Lag[2*(p+q)+(p+q)-1][2] 1.4893 0.3635
Lag[4*(p+q)+(p+q)-1][5] 4.3042 0.2185
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.1366 0.7117
Lag[2*(p+q)+(p+q)-1][5] 1.3903 0.7668
Lag[4*(p+q)+(p+q)-1][9] 2.6138 0.8206
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.6404 0.500 2.000 0.4236
ARCH Lag[5] 2.3106 1.440 1.667 0.4066
ARCH Lag[7] 2.4778 2.315 1.543 0.6170
Nyblom stability test
------------------------------------
Joint Statistic: 15.8903
Individual Statistics:
mu 0.1630
omega 3.2265
alpha1 0.3831
beta1 0.4430
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.07 1.24 1.6
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 0.7662 0.44359
Negative Sign Bias 1.4747 0.14036
Positive Sign Bias 0.8164 0.41431
Joint Effect 9.8021 0.02033 **
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 136.5 0.00000000000000000008531
2 30 153.3 0.00000000000000000073860
3 40 172.2 0.00000000000000000116447
4 50 171.7 0.00000000000000156776602
Elapsed time : 0.8176961
plot(m,which ='all')
please wait...calculating quantiles...
The 12 plots for the distribution of residuals
The first plot shows the series with 2 conditional standard deviation superimposed. Remember you have this data for returns and the red lines you are seeing on the plot are at 2 sd. Basically, the two lines represents 95% of the data on returns. The second plot shows the series with a 1% value risk limit, therefore the red line you are seeing are at 1%. The third plot is for conditional standard deviation versus returns. Remember returns are absolute returns. The blue line you are seeing on the third plot is the standard deviation. The forth plot shows the autocorrelation for the observations. In the fifth plot there is a significant autocorrelation as shown by all the bars being above the red line. There is also a similar situation in the sixth plot. The seventh plot shows the autocorrelation for the squared versus the actual observations. The eighth plot shows histogram for the residuals. The tails of the histograms deviates from the normal distribution. This can be well established by looking at the normal Q-Q plot. In other words, there is a significant deviation of the standard residuals from the normal distribution for both low and high values. The ninth plot shows the autocorrelation of the standardized residuals where we are able to see that the residuals improved significantly and for squared standardized residuals, all the values are within the red lines
We shall therefore use this model to make our forecast and we will store our forecast in the object F.
f<-ugarchforecast(fitORspec = m,
n.ahead = 20)
plot(fitted(f))
Fitted plot
Remember this model was estimated on a constant mean model. In other words, our prediction will be constant. We can also plot the variability using the command below.
plot(sigma(f))
Variability plot
The graph shows that volatility in returns is going to increase for the next 20 days predicted.
Portfolio allocation example will help and guide what percentage of portfolio should go to the riskier assets such as stocks and what percentage should go to free riskier assets. This will be followed by four different variants. * Model-2: GARCH with sstd * Model-3: GJR-GARCH * Model-4: AR(1) GJG-GARCH * Model-5: GJG-GARCH in mean
Let us now calculate the analyzed volatility using the code below and store the results in V; remember we shall assume that there are 252 trading days.
V<- sqrt(252) * sigma(m)
We shall also create an object W which represent the weight assigned to the risky assets. In this case, we shall assign a target of 5%, that is, 0.05.
W<-0.05/V
plot(merge(V,W),
multi.panel=T)
Weight assigned to risky assets
The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end.
tail(W,20)
[,1]
2023-02-09 0.1640393
2023-02-10 0.1710039
2023-02-13 0.1801228
2023-02-14 0.1797704
2023-02-15 0.1875874
2023-02-16 0.1911529
2023-02-17 0.1940037
2023-02-21 0.1991621
2023-02-22 0.1772824
2023-02-23 0.1864209
2023-02-24 0.1954879
2023-02-27 0.1884858
2023-02-28 0.1959915
2023-03-01 0.2037025
2023-03-02 0.2003178
2023-03-03 0.2090398
2023-03-06 0.1746004
2023-03-07 0.1752742
2023-03-08 0.1760180
2023-03-09 0.1837988
Let us assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 18.3%. In this case the model suggests that the company can allocate the following amount to the risky assets. \[ Amount_{riskyAssets}= 2000000*0.1831701 = 366340.2 \]
The calculations above shows that the company should have allocated $366340.2 on the risky assets as suggested by the model and allocate the following amount on the free risky assets; Some of the suggested free risky assets include but are limited to bank accounts
\[ Amount_{freeRisk}=2000000-366340.2= 1633659.8 \]
Suppose we increase the assigned to the risky assets to 10%, that is, 0.1. Consider the graph below
W_1<-0.1/V
plot(merge(V,W_1),
multi.panel=T)
Weight assigned to risky assets
The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end. However, the weight assigned to the risky assets is twice as compared to the previous graph.
tail(W_1,20)
[,1]
2023-02-09 0.3280787
2023-02-10 0.3420079
2023-02-13 0.3602457
2023-02-14 0.3595409
2023-02-15 0.3751747
2023-02-16 0.3823057
2023-02-17 0.3880073
2023-02-21 0.3983243
2023-02-22 0.3545649
2023-02-23 0.3728417
2023-02-24 0.3909758
2023-02-27 0.3769716
2023-02-28 0.3919831
2023-03-01 0.4074049
2023-03-02 0.4006355
2023-03-03 0.4180797
2023-03-06 0.3492009
2023-03-07 0.3505484
2023-03-08 0.3520361
2023-03-09 0.3675977
Let us assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 36.6%. In this case the model suggests that the company can allocate the following amount to the risky assets.
2000000*0.3663402
[1] 732680.4
According to this model, the company should therefore allocate $732680.4 to the risky assets and the remaining amount to be allocated to the free risky assets. This allocation is only possible if the compamy accepts set 10% as the target for the assigned weight to the risky assets.
In the previous model we saw that the normal distribution is not a good distribution for the residuals. Now let us make use of skewed student’s t-distribution to capture this distribution more accurately. We shall edit the model to skewed student’s t-distribution
S<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="sGARCH"),
distribution.model = 'sstd')
M<- ugarchfit(data=Returns,spec = S)
M
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001636 0.000253 6.4593 0.000000
omega 0.000007 0.000003 2.3531 0.018617
alpha1 0.095489 0.011945 7.9939 0.000000
beta1 0.891651 0.012970 68.7478 0.000000
skew 1.009803 0.021639 46.6649 0.000000
shape 5.103620 0.450734 11.3229 0.000000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001636 0.000255 6.4029 0.00000
omega 0.000007 0.000007 1.0254 0.30519
alpha1 0.095489 0.017466 5.4672 0.00000
beta1 0.891651 0.019261 46.2933 0.00000
skew 1.009803 0.020503 49.2508 0.00000
shape 5.103620 0.686130 7.4383 0.00000
LogLikelihood : 10711.67
Information Criteria
------------------------------------
Akaike -5.2569
Bayes -5.2476
Shibata -5.2569
Hannan-Quinn -5.2536
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.8989 0.3431
Lag[2*(p+q)+(p+q)-1][2] 1.4222 0.3794
Lag[4*(p+q)+(p+q)-1][5] 4.4234 0.2058
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.004239 0.9481
Lag[2*(p+q)+(p+q)-1][5] 1.043397 0.8498
Lag[4*(p+q)+(p+q)-1][9] 2.107863 0.8917
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.5377 0.500 2.000 0.4634
ARCH Lag[5] 1.9446 1.440 1.667 0.4836
ARCH Lag[7] 2.2279 2.315 1.543 0.6691
Nyblom stability test
------------------------------------
Joint Statistic: 3.7977
Individual Statistics:
mu 0.19072
omega 0.74030
alpha1 0.74349
beta1 0.86433
skew 0.04262
shape 1.03582
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.49 1.68 2.12
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.0011 0.31684
Negative Sign Bias 1.2064 0.22773
Positive Sign Bias 0.8987 0.36885
Joint Effect 10.3737 0.01564 **
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 27.25 0.09906
2 30 41.89 0.05744
3 40 49.85 0.11431
4 50 53.92 0.29176
Elapsed time : 1.565172
Having created the skewed student’s t-distribution, let us now have the 12 plots and see the distribution of the residuals.
plot(M,which ='all')
please wait...calculating quantiles...
The 12 plots for residual distribution
The plots above shows that our models improved compared to the previous model as indicated by the distribution by of the residuals. From the plots above, the sstd Q-Q plot shows that residuals relatively aligns to the diagonal line. Besides, the information criteria also improved. Additionally, the skew statistics shows that the distribution is symmetry as indicated by the skewed estimated of 1.007, which is quite close to 1.
V_1<- sqrt(252) * sigma(M)
We shall also create an object W which represent the weight assigned to the risky assets. In this case, we shall assign a target of 5%, that is, 0.05.
W_2<-0.05/V
plot(merge(V_1,W_2),
multi.panel=T)
Weight assigned to risky and non-risky assets
The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end.
tail(W_2,20)
[,1]
2023-02-09 0.1640393
2023-02-10 0.1710039
2023-02-13 0.1801228
2023-02-14 0.1797704
2023-02-15 0.1875874
2023-02-16 0.1911529
2023-02-17 0.1940037
2023-02-21 0.1991621
2023-02-22 0.1772824
2023-02-23 0.1864209
2023-02-24 0.1954879
2023-02-27 0.1884858
2023-02-28 0.1959915
2023-03-01 0.2037025
2023-03-02 0.2003178
2023-03-03 0.2090398
2023-03-06 0.1746004
2023-03-07 0.1752742
2023-03-08 0.1760180
2023-03-09 0.1837988
Similarly, if we assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 18.3%. In this case the model suggests that the company can allocate the following amount to the risky assets and the remaining to be allocated to the free risky assets.
2000000*0.1831701
[1] 366340.2
We are going to copy the same model we used above and modify so as to estimate the GJR GARCH model. The GJR GARCH model was developed by Glosen Jagannathan-Runkle. Consider the results below.
GJR<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="gjrGARCH"),
distribution.model = 'sstd')
M_1<- ugarchfit(data=Returns,spec = GJR)
M_1
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001423 0.000244 5.8404 0
omega 0.000010 0.000001 8.0677 0
alpha1 0.037218 0.001654 22.5009 0
beta1 0.874450 0.009426 92.7727 0
gamma1 0.137875 0.019465 7.0831 0
skew 1.004774 0.021448 46.8466 0
shape 5.434271 0.393137 13.8228 0
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001423 0.000267 5.3386 0.00000
omega 0.000010 0.000002 5.4622 0.00000
alpha1 0.037218 0.010454 3.5603 0.00037
beta1 0.874450 0.010689 81.8099 0.00000
gamma1 0.137875 0.024481 5.6319 0.00000
skew 1.004774 0.020432 49.1765 0.00000
shape 5.434271 0.495804 10.9605 0.00000
LogLikelihood : 10740.52
Information Criteria
------------------------------------
Akaike -5.2706
Bayes -5.2597
Shibata -5.2706
Hannan-Quinn -5.2667
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 1.885 0.1698
Lag[2*(p+q)+(p+q)-1][2] 2.319 0.2150
Lag[4*(p+q)+(p+q)-1][5] 5.179 0.1394
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.1851 0.6671
Lag[2*(p+q)+(p+q)-1][5] 1.3287 0.7819
Lag[4*(p+q)+(p+q)-1][9] 2.3390 0.8609
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.7895 0.500 2.000 0.3743
ARCH Lag[5] 2.1081 1.440 1.667 0.4478
ARCH Lag[7] 2.3698 2.315 1.543 0.6394
Nyblom stability test
------------------------------------
Joint Statistic: 21.7937
Individual Statistics:
mu 0.49535
omega 5.11751
alpha1 1.44995
beta1 1.43854
gamma1 0.97126
skew 0.03061
shape 1.36371
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.69 1.9 2.35
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.2770 0.2017
Negative Sign Bias 0.5576 0.5771
Positive Sign Bias 0.1498 0.8809
Joint Effect 2.6469 0.4493
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 23.23 0.2274
2 30 34.73 0.2135
3 40 40.62 0.3990
4 50 48.86 0.4786
Elapsed time : 2.870686
We can notice that there is a very big change in the last plot among the 12 plots as shown below;
plot(M_1,which ='all')
please wait...calculating quantiles...
Plot number 12 changed drastically due to the gjr garch modeling. According to this model, when the news have a positive impacts on the stock price, the increase in prices are gradual. However, when the news have a negative impacts on the stock price, the impact is significantly higher. Therefore, as per this model, the impacts of news on the stock market is not symmetric. However, the impacts is higher for the negative new than the positive news.
The model parameter are all significant. This model is written as follows
\[ Returns_{t} = \mu + e_{t}^2 \]
When we plug in values, the model will be as follows: \[ Returns_{t} = 0.001448 + e_{t} \]
Remember in this model, et follows a normal distribution with the mean of zero and a variance of sigma square. However, for sigma t-square, we will have two equations for the negative and positive news as shown below; \[ \sigma_{t}^2 (e_{t-1}<0) \] The equation above is for the impacts of the negative news. The full equation is going to as follows; \[ \omega + (\alpha + \gamma)e_{t-1}^2 + \beta\sigma_{t-1}^2 \]
If we plug in the values, this is going to be as follows; \[ 0.000011 + (0.039228 + 0.142040)e_{t-1}^2 + 0.869758 \sigma_{t-1}^2 \] When the news have a positive impacts on the stock market, then et square will be as follows; \[ \sigma_{t}^2 (e_{t-1}>0) \]
However, if the news have a positive impacts on the stocm market, we shall have a standard model as we’ve seen earlier.
\[ \omega + \alpha e_{t-1}^2 + \beta\sigma_{t-1}^2 \]
When we plug in values, we will have the following model.
\[ 0.000011 + 0.039228e_{t-1}^2 + 0.869758\sigma_{t-1}^2 \]
Generally, the third model is better than the first two models. To validate this claim we shall focus on the following * skeweness statistics * Information criteria The skeweness statistics for this models is close to one as compared to the first two models. Besides, the values for the information criteria for this models are relatively lower compared to the information criteria value in the other two models. This is enough evidence to shows that this model performed better than the first two models.
We shall edit the model we used above to have the AR(1) GJR GARCH model;
AR_GJR<-ugarchspec(mean.model = list(armaOrder =c(1,0)),
variance.model = list(model="gjrGARCH"),
distribution.model = 'sstd')
M_2<- ugarchfit(data=Returns,spec = AR_GJR)
M_2
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model : ARFIMA(1,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001413 0.000246 5.74103 0.00000
ar1 0.010777 0.015836 0.68056 0.49615
omega 0.000010 0.000001 8.19257 0.00000
alpha1 0.036718 0.001790 20.51119 0.00000
beta1 0.874146 0.009433 92.66428 0.00000
gamma1 0.139487 0.019706 7.07832 0.00000
skew 1.005077 0.021486 46.77744 0.00000
shape 5.458812 0.396628 13.76304 0.00000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001413 0.000267 5.29314 0.000000
ar1 0.010777 0.014465 0.74508 0.456226
omega 0.000010 0.000002 5.55291 0.000000
alpha1 0.036718 0.010358 3.54498 0.000393
beta1 0.874146 0.010674 81.89345 0.000000
gamma1 0.139487 0.024749 5.63612 0.000000
skew 1.005077 0.020616 48.75308 0.000000
shape 5.458812 0.497708 10.96789 0.000000
LogLikelihood : 10740.75
Information Criteria
------------------------------------
Akaike -5.2702
Bayes -5.2578
Shibata -5.2702
Hannan-Quinn -5.2658
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.6056 0.4365
Lag[2*(p+q)+(p+q)-1][2] 1.0428 0.7140
Lag[4*(p+q)+(p+q)-1][5] 3.8959 0.2425
d.o.f=1
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.2268 0.6339
Lag[2*(p+q)+(p+q)-1][5] 1.3456 0.7778
Lag[4*(p+q)+(p+q)-1][9] 2.3370 0.8612
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.7797 0.500 2.000 0.3772
ARCH Lag[5] 2.0551 1.440 1.667 0.4592
ARCH Lag[7] 2.3206 2.315 1.543 0.6497
Nyblom stability test
------------------------------------
Joint Statistic: 22.3087
Individual Statistics:
mu 0.49114
ar1 0.26770
omega 5.18582
alpha1 1.44965
beta1 1.43671
gamma1 0.97868
skew 0.03063
shape 1.36409
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.89 2.11 2.59
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.2416 0.2145
Negative Sign Bias 0.5609 0.5749
Positive Sign Bias 0.1590 0.8737
Joint Effect 2.5170 0.4722
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 25.73 0.1378
2 30 34.63 0.2170
3 40 44.41 0.2544
4 50 51.59 0.3729
Elapsed time : 3.45856
plot(M_2,which='all')
please wait...calculating quantiles...
There is nothing in the plots above; however, what is readily noticeable is that the AR(1) we’ve just included in our model is not statistically significant. The implication is that adding this term to GARCH model is not in any way helpful and we should therefore consider removing from our model. In other words, we are no long going to use this model.
Like usual, we shall edit the models we used above to have the GJR GARCH in mean model.
GJR_mean<-ugarchspec(mean.model = list(armaOrder =c(0,0),
archm=T,
archpow=2),
variance.model = list(model="gjrGARCH"),
distribution.model = 'sstd')
M_3<- ugarchfit(data=Returns,spec = GJR_mean)
M_3
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001456 0.000327 4.45794 0.000008
archm -0.123404 0.941008 -0.13114 0.895664
omega 0.000010 0.000001 8.51083 0.000000
alpha1 0.037186 0.004869 7.63665 0.000000
beta1 0.874803 0.009427 92.80199 0.000000
gamma1 0.137798 0.019951 6.90689 0.000000
skew 1.004600 0.021629 46.44748 0.000000
shape 5.436195 0.403567 13.47038 0.000000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001456 0.000382 3.81118 0.000138
archm -0.123404 0.974138 -0.12668 0.899194
omega 0.000010 0.000002 6.17301 0.000000
alpha1 0.037186 0.009680 3.84166 0.000122
beta1 0.874803 0.010795 81.03857 0.000000
gamma1 0.137798 0.024679 5.58354 0.000000
skew 1.004600 0.020274 49.55001 0.000000
shape 5.436195 0.474629 11.45355 0.000000
LogLikelihood : 10740.53
Information Criteria
------------------------------------
Akaike -5.2701
Bayes -5.2577
Shibata -5.2701
Hannan-Quinn -5.2657
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 1.837 0.1753
Lag[2*(p+q)+(p+q)-1][2] 2.286 0.2195
Lag[4*(p+q)+(p+q)-1][5] 5.150 0.1415
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.1814 0.6701
Lag[2*(p+q)+(p+q)-1][5] 1.3255 0.7827
Lag[4*(p+q)+(p+q)-1][9] 2.3359 0.8613
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.7881 0.500 2.000 0.3747
ARCH Lag[5] 2.1081 1.440 1.667 0.4478
ARCH Lag[7] 2.3712 2.315 1.543 0.6391
Nyblom stability test
------------------------------------
Joint Statistic: 22.2572
Individual Statistics:
mu 0.50324
archm 0.32431
omega 4.97450
alpha1 1.44914
beta1 1.43572
gamma1 0.97241
skew 0.03083
shape 1.36238
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.89 2.11 2.59
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.2901 0.1971
Negative Sign Bias 0.5520 0.5810
Positive Sign Bias 0.1468 0.8833
Joint Effect 2.6969 0.4408
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 22.96 0.2389
2 30 35.57 0.1864
3 40 40.64 0.3982
4 50 50.58 0.4108
Elapsed time : 4.48175
Again the additional term in our model is not statistically significant. Thus we shall ignore this model. Therefore we are not going to use this model to make our prediction.
In this part we shall make use of the best model out of the five models above. We shall therefore use the model to simulate the stock prices for apple. From the five models that we ran, the best model was GJR-GARCh model. This was the third model that we estimated. We shall therefore copy the model that we used and edit the model to make our simulation.
GJR<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="gjrGARCH"),
distribution.model = 'sstd')
M_1<- ugarchfit(data=Returns,spec = GJR)
M_1
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001423 0.000244 5.8404 0
omega 0.000010 0.000001 8.0677 0
alpha1 0.037218 0.001654 22.5009 0
beta1 0.874450 0.009426 92.7727 0
gamma1 0.137875 0.019465 7.0831 0
skew 1.004774 0.021448 46.8466 0
shape 5.434271 0.393137 13.8228 0
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001423 0.000267 5.3386 0.00000
omega 0.000010 0.000002 5.4622 0.00000
alpha1 0.037218 0.010454 3.5603 0.00037
beta1 0.874450 0.010689 81.8099 0.00000
gamma1 0.137875 0.024481 5.6319 0.00000
skew 1.004774 0.020432 49.1765 0.00000
shape 5.434271 0.495804 10.9605 0.00000
LogLikelihood : 10740.52
Information Criteria
------------------------------------
Akaike -5.2706
Bayes -5.2597
Shibata -5.2706
Hannan-Quinn -5.2667
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 1.885 0.1698
Lag[2*(p+q)+(p+q)-1][2] 2.319 0.2150
Lag[4*(p+q)+(p+q)-1][5] 5.179 0.1394
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.1851 0.6671
Lag[2*(p+q)+(p+q)-1][5] 1.3287 0.7819
Lag[4*(p+q)+(p+q)-1][9] 2.3390 0.8609
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.7895 0.500 2.000 0.3743
ARCH Lag[5] 2.1081 1.440 1.667 0.4478
ARCH Lag[7] 2.3698 2.315 1.543 0.6394
Nyblom stability test
------------------------------------
Joint Statistic: 21.7937
Individual Statistics:
mu 0.49535
omega 5.11751
alpha1 1.44995
beta1 1.43854
gamma1 0.97126
skew 0.03061
shape 1.36371
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.69 1.9 2.35
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.2770 0.2017
Negative Sign Bias 0.5576 0.5771
Positive Sign Bias 0.1498 0.8809
Joint Effect 2.6469 0.4493
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 23.23 0.2274
2 30 34.73 0.2135
3 40 40.62 0.3990
4 50 48.86 0.4786
Elapsed time : 2.729416
We shall treat this model as our final model. Run the two command lines above and run the command lines below as well.
GJR_final<-GJR
setfixed(GJR_final)<-as.list(coef(M_1))
Run the command below to obtain the coefficients of the model M_1.
coef(M_1)
mu omega alpha1 beta1 gamma1
0.00142344481 0.00001015981 0.03721790057 0.87445039679 0.13787499650
skew shape
1.00477372417 5.43427134460
Now, we need to compare a one year focus for two different years. We shall however use 2008 focus using the command below. We shall make our prediction for the next year, which has approximately 252 trading days
f2008<-ugarchforecast(data = Returns["/2008-12"],
fitORspec = GJR_final,
n.ahead = 252)
f2022<-ugarchforecast(data = Returns["/2022-12"],
fitORspec = GJR_final,
n.ahead = 252)
par(mfrow= c(2,1))
plot(sigma(f2008))
plot(sigma(f2022))
For the 2008, we had a higher volatility and it was forecasted that after 2008,in next one year, the volatility is going to fall. At the end of 2022, the volatility was low and therefore it was forecasted that the volatility was going to increase. In other words, the volatility was forecasted to increase in 2020. In the long run, the volatility was expected to go towards the average. Now, let us have the two results have simulation in SIM, showing simulation. Consider the command below;
SIM <- ugarchpath(spec = GJR_final,
m.sim = 3,
n.sim = 1*250,
rseed = 123)
plot.zoo(fitted(SIM))
The graph above shows three different simulated returns based on our model. We can as well plot the the simulated variability for the 252 days in 2023.
plot.zoo(sigma(SIM))
In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.
Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following
The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example:
A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.
Enter the following data
score<-c(50,65,72,77,73,85,88,80,65,56,66,78,82,90)
score
[1] 50 65 72 77 73 85 88 80 65 56 66 78 82 90
length(score)
[1] 14
mean(score)
[1] 73.35714
t.test(score,mu=80)
One Sample t-test
data: score
t = -2.0988, df = 13, p-value = 0.05593
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
66.51943 80.19486
sample estimates:
mean of x
73.35714
In the above output, the p-value of the test is 0.05593, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.05593.
t.test(score,mu=80,alternative="two.sided",conf.level=0.95)
One Sample t-test
data: score
t = -2.0988, df = 13, p-value = 0.05593
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
66.51943 80.19486
sample estimates:
mean of x
73.35714
In the above output, the p-value of the test is 0.05593, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.05593.
t.test(score,mu=80,alternative="less",conf.level=0.95)
One Sample t-test
data: score
t = -2.0988, df = 13, p-value = 0.02797
alternative hypothesis: true mean is less than 80
95 percent confidence interval:
-Inf 78.96227
sample estimates:
mean of x
73.35714
In the above output, the p-value of the test is 0.02797, which is less than the significance level alpha = 0.05. We can conclude that the true mean is not equal to 80 with a p-value = 0.02797.
t.test(score,mu=80,alternative="greater",conf.level=0.95)
One Sample t-test
data: score
t = -2.0988, df = 13, p-value = 0.972
alternative hypothesis: true mean is greater than 80
95 percent confidence interval:
67.75202 Inf
sample estimates:
mean of x
73.35714
In the above output, the p-value of the test is 0.972, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.972.
Paired T-Test. The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.
weight_before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
weight_after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myDatta<-data.frame(weight_before,weight_after)
head(myDatta,10)
weight_before weight_after
1 190.1 392.9
2 190.9 313.2
3 172.7 345.1
4 213.0 393.0
5 231.4 434.0
6 196.9 227.9
7 172.2 422.0
8 285.5 383.9
9 225.2 392.3
10 113.7 352.2
t.test(weight_before,weight_after, alternative="two.sided",paired = TRUE)
Paired t-test
data: weight_before and weight_after
t = -7.9059, df = 9, p-value = 0.00002433
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-214.1284 -118.8516
sample estimates:
mean difference
-166.49
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value approximately 0.0001.
t.test(weight_before,weight_after, alternative="less",paired = TRUE)
Paired t-test
data: weight_before and weight_after
t = -7.9059, df = 9, p-value = 0.00001216
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf -127.8868
sample estimates:
mean difference
-166.49
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value approximately 0.0001.
t.test(weight_before,weight_after, alternative="greater",paired = TRUE)
Paired t-test
data: weight_before and weight_after
t = -7.9059, df = 9, p-value = 1
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
-205.0932 Inf
sample estimates:
mean difference
-166.49
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference is equal to zero with a p-value =1.
weight_before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
weight_after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myData <- data.frame(
weight = c(weight_before, weight_after),
group = rep(c("weight_before", "weight_after"), each = 10)
)
myData
weight group
1 190.1 weight_before
2 190.9 weight_before
3 172.7 weight_before
4 213.0 weight_before
5 231.4 weight_before
6 196.9 weight_before
7 172.2 weight_before
8 285.5 weight_before
9 225.2 weight_before
10 113.7 weight_before
11 392.9 weight_after
12 313.2 weight_after
13 345.1 weight_after
14 393.0 weight_after
15 434.0 weight_after
16 227.9 weight_after
17 422.0 weight_after
18 383.9 weight_after
19 392.3 weight_after
20 352.2 weight_after
t.test(weight~group, alternative="two.sided",paired = TRUE, data=myData)
Paired t-test
data: weight by group
t = 7.9059, df = 9, p-value = 0.00002433
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
118.8516 214.1284
sample estimates:
mean difference
166.49
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value which is approximately 0.0001.
t.test(weight~group, alternative="less",paired = TRUE, data=myData)
Paired t-test
data: weight by group
t = 7.9059, df = 9, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf 205.0932
sample estimates:
mean difference
166.49
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference is equal to zero with a p-value =1.
t.test(weight~group, alternative="greater",paired = TRUE, data=myData)
Paired t-test
data: weight by group
t = 7.9059, df = 9, p-value = 0.00001216
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
127.8868 Inf
sample estimates:
mean difference
166.49
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value which is approximately 0.0001.
my_tbl2020 <- tibble::tribble(
~S.n, ~mathematics, ~arts,
1, 22, 53,
2, 37, 68,
3, 36, 42,
4, 38, 49,
5, 42, 51,
6, 58, 65,
7, 58, 51,
8, 60, 71,
9, 62, 55,
10, 65, 74,
11, 66, 68,
12, 56, 64,
13, 66, 67,
14, 67, 73,
15, 62, 65
)
require(knitr)
kable(my_tbl2020, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| S.n | mathematics | arts |
|---|---|---|
| 1 | 22 | 53 |
| 2 | 37 | 68 |
| 3 | 36 | 42 |
| 4 | 38 | 49 |
| 5 | 42 | 51 |
| 6 | 58 | 65 |
| 7 | 58 | 51 |
| 8 | 60 | 71 |
| 9 | 62 | 55 |
| 10 | 65 | 74 |
| 11 | 66 | 68 |
| 12 | 56 | 64 |
| 13 | 66 | 67 |
| 14 | 67 | 73 |
| 15 | 62 | 65 |
attach(my_tbl2020)
head(my_tbl2020,10)
# A tibble: 10 × 3
S.n mathematics arts
<dbl> <dbl> <dbl>
1 1 22 53
2 2 37 68
3 3 36 42
4 4 38 49
5 5 42 51
6 6 58 65
7 7 58 51
8 8 60 71
9 9 62 55
10 10 65 74
TTEST<-t.test(mathematics,arts, alternative = "two.sided",paired = TRUE,data=my_tbl2020, conf.level=.95)
TTEST
Paired t-test
data: mathematics and arts
t = -2.8805, df = 14, p-value = 0.0121
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-14.073042 -2.060291
sample estimates:
mean difference
-8.066667
In the above output, the p-value of the test is approximately 0.0121, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between mathematics and arts is not equal to zero with a p-value which is approximately 0.0121 at a 5% level of significance.
before<-c(117,111,98,104,105,100,81,89,78)
after<-c(83,85,75,82,82,77,62,69,64)
My_framed_data<-data.frame(before,after)
My_framed_data
before after
1 117 83
2 111 85
3 98 75
4 104 82
5 105 82
6 100 77
7 81 62
8 89 69
9 78 64
MY_TTEST<-t.test(before,after, alternative = "two.sided",paired = TRUE,data=My_framed_data, conf.level = 0.99)
MY_TTEST
Paired t-test
data: before and after
t = 12.52, df = 8, p-value = 0.000001551
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
16.59186 28.74147
sample estimates:
mean difference
22.66667
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can therefore conclude that the true mean difference between the weight before and after is not equal to zero with a p-value which is approximately 0.0001 at a 5% level of significance.
MY_TTEST1<-t.test(before,after, alternative = "less",paired = TRUE,data=My_framed_data)
MY_TTEST1
Paired t-test
data: before and after
t = 12.52, df = 8, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf 26.03331
sample estimates:
mean difference
22.66667
In the above output, the p-value of the test is approximately 1, which is significantly greater than the significance level alpha = 0.05. We therefore retain the null hypothesis and conclude that the true mean difference between the weight before and after is greater than zero with a p-value which is approximately 1 at a 5% level of significance.
MY_TTEST2<-t.test(before,after, alternative = "greater",paired = TRUE,data=My_framed_data)
MY_TTEST2
Paired t-test
data: before and after
t = 12.52, df = 8, p-value = 0.0000007754
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
19.30002 Inf
sample estimates:
mean difference
22.66667
In the above output, the p-value of the test is approximately 0.0001, which is significantly less than the significance level alpha = 0.05. We therefore reject the null hypothesis and conclude that the true mean difference between the weight before and after is less than zero with a p-value which is approximately 0.0001 at a 5% level of significance.
What is an unpaired t-test? An unpaired t-test (also known as an independent t-test) is a statistical procedure that compares the averages/means of two independent or unrelated groups to determine if there is a significant difference between the two.
data("gapminder")
data2021<-gapminder%>%
dplyr::select(country, lifeExp)%>%
filter(country == "Kenya"|
country == "United States")
head(data2021,14)
# A tibble: 14 × 2
country lifeExp
<fct> <dbl>
1 Kenya 42.3
2 Kenya 44.7
3 Kenya 47.9
4 Kenya 50.7
5 Kenya 53.6
6 Kenya 56.2
7 Kenya 58.8
8 Kenya 59.3
9 Kenya 59.3
10 Kenya 54.4
11 Kenya 51.0
12 Kenya 54.1
13 United States 68.4
14 United States 69.5
t.test(lifeExp ~ country,alternative="two.sided",data = data2021)
Welch Two Sample t-test
data: lifeExp by country
t = -11.051, df = 17.966, p-value = 0.000000001917
alternative hypothesis: true difference in means between group Kenya and group United States is not equal to 0
95 percent confidence interval:
-24.75174 -16.84326
sample estimates:
mean in group Kenya mean in group United States
52.6810 73.4785
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States is not equal to zero with a p-value which is approximately 0.0001.
data2021<-gapminder %>%
dplyr::select(country, lifeExp)%>%
filter(country == "Kenya"|
country == "United States")
head(data2021,14)
# A tibble: 14 × 2
country lifeExp
<fct> <dbl>
1 Kenya 42.3
2 Kenya 44.7
3 Kenya 47.9
4 Kenya 50.7
5 Kenya 53.6
6 Kenya 56.2
7 Kenya 58.8
8 Kenya 59.3
9 Kenya 59.3
10 Kenya 54.4
11 Kenya 51.0
12 Kenya 54.1
13 United States 68.4
14 United States 69.5
t.test(lifeExp ~ country,alternative="less",data = data2021)
Welch Two Sample t-test
data: lifeExp by country
t = -11.051, df = 17.966, p-value = 0.0000000009583
alternative hypothesis: true difference in means between group Kenya and group United States is less than 0
95 percent confidence interval:
-Inf -17.53385
sample estimates:
mean in group Kenya mean in group United States
52.6810 73.4785
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States is greater or equal to zero with a p-value which is approximately 0.0001.
data2021<-gapminder %>%
dplyr::select(country, lifeExp)%>%
filter(country == "Kenya"|
country == "United States")
head(data2021,14)
# A tibble: 14 × 2
country lifeExp
<fct> <dbl>
1 Kenya 42.3
2 Kenya 44.7
3 Kenya 47.9
4 Kenya 50.7
5 Kenya 53.6
6 Kenya 56.2
7 Kenya 58.8
8 Kenya 59.3
9 Kenya 59.3
10 Kenya 54.4
11 Kenya 51.0
12 Kenya 54.1
13 United States 68.4
14 United States 69.5
t.test(lifeExp ~ country,alternative="greater",data = data2021)
Welch Two Sample t-test
data: lifeExp by country
t = -11.051, df = 17.966, p-value = 1
alternative hypothesis: true difference in means between group Kenya and group United States is greater than 0
95 percent confidence interval:
-24.06115 Inf
sample estimates:
mean in group Kenya mean in group United States
52.6810 73.4785
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States less than or equal to zero with a p-value =1.
male<-c(135,180,108,128,160,143,175,170)
female<-c(205,195,185,150,175,190,180,220)
daatta<-data.frame(male,female)
daatta
male female
1 135 205
2 180 195
3 108 185
4 128 150
5 160 175
6 143 190
7 175 180
8 170 220
TTEST1<-t.test(male,female, alternative = "two.sided",paired = FALSE,data=daatta)
TTEST1
Welch Two Sample t-test
data: male and female
t = -3.2304, df = 13.477, p-value = 0.006308
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-62.69717 -12.55283
sample estimates:
mean of x mean of y
149.875 187.500
In the above output, the p-value of the test is approximately 0.006308, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is not equal to zero with a p-value which is approximately 0.006308 at a 5% level of significance.
TTEST2<-t.test(male,female, alternative = "less",paired = FALSE,data=daatta)
TTEST2
Welch Two Sample t-test
data: male and female
t = -3.2304, df = 13.477, p-value = 0.003154
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -17.05415
sample estimates:
mean of x mean of y
149.875 187.500
In the above output, the p-value of the test is approximately 0.003154, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.003154 at a 5% level of significance.
TTEST3<-t.test(male,female, alternative = "greater",paired = FALSE,data=daatta)
TTEST3
Welch Two Sample t-test
data: male and female
t = -3.2304, df = 13.477, p-value = 0.9968
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-58.19585 Inf
sample estimates:
mean of x mean of y
149.875 187.500
In the above output, the p-value of the test is approximately 0.9968, which is greater than the significance level alpha = 0.05. We therefore fail to reject the null hypothesis and conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.9968 at a 5% level of significance.
weight<-c(135,180,108,128,160,143,175,170,205,195,185,150,175,190,180,220)
weight
[1] 135 180 108 128 160 143 175 170 205 195 185 150 175 190 180 220
gender<-gl(2,8,labels=c("male","female"))
gender
[1] male male male male male male male male female female
[11] female female female female female female
Levels: male female
paired_data<-data.frame(weight,gender)
paired_data
weight gender
1 135 male
2 180 male
3 108 male
4 128 male
5 160 male
6 143 male
7 175 male
8 170 male
9 205 female
10 195 female
11 185 female
12 150 female
13 175 female
14 190 female
15 180 female
16 220 female
TTEST4<-t.test(weight~gender, alternative = "two.sided",paired = FALSE,data=paired_data)
TTEST4
Welch Two Sample t-test
data: weight by gender
t = -3.2304, df = 13.477, p-value = 0.006308
alternative hypothesis: true difference in means between group male and group female is not equal to 0
95 percent confidence interval:
-62.69717 -12.55283
sample estimates:
mean in group male mean in group female
149.875 187.500
In the above output, the p-value of the test is approximately 0.006308, which is less than the significance level alpha = 0.05. We therefore reject the null hypothesis and conclude that the true mean difference in height between male and female is not equal zero with a p-value which is approximately 0.006308 at a 5% level of significance.
TTEST5<-t.test(weight~gender, alternative = "less",paired = FALSE,data=paired_data)
TTEST5
Welch Two Sample t-test
data: weight by gender
t = -3.2304, df = 13.477, p-value = 0.003154
alternative hypothesis: true difference in means between group male and group female is less than 0
95 percent confidence interval:
-Inf -17.05415
sample estimates:
mean in group male mean in group female
149.875 187.500
In the above output, the p-value of the test is approximately 0.003154, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.003154 at a 5% level of significance.
TTEST6<-t.test(weight~gender, alternative = "greater",paired = FALSE,data=paired_data)
TTEST6
Welch Two Sample t-test
data: weight by gender
t = -3.2304, df = 13.477, p-value = 0.9968
alternative hypothesis: true difference in means between group male and group female is greater than 0
95 percent confidence interval:
-58.19585 Inf
sample estimates:
mean in group male mean in group female
149.875 187.500
In the above output, the p-value of the test is approximately 0.9968, which is greater than the significance level alpha = 0.05. We therefore fail to reject the null hypothesis and conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.9968 at a 5% level of significance.
Analysis of Variance (ANOVA) is a statistical approach used to compare variances across the means (or average) of different groups. A range of scenarios use it to determine if there is any difference between the means of three of more different groups.
head(gapminder,10)
# A tibble: 10 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
attach(gapminder)
ANOVA<-aov(gdpPercap~continent)
summary(ANOVA)
Df Sum Sq Mean Sq F value Pr(>F)
continent 4 37990228667 9497557167 126.6 <0.0000000000000002 ***
Residuals 1699 127489276662 75037832
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 126.6 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the mean gdp per capita across continent at a 5% level of significance.
library(agricolae)
TK<-TukeyHSD(ANOVA)
TK
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = gdpPercap ~ continent)
$continent
diff lwr upr p adj
Americas-Africa 4942.3558 3280.4806 6604.231 0.0000000
Asia-Africa 5708.3958 4188.6340 7228.158 0.0000000
Europe-Africa 12275.7210 10710.1623 13841.280 0.0000000
Oceania-Africa 16427.8546 11507.4034 21348.306 0.0000000
Asia-Americas 766.0401 -1044.5150 2576.595 0.7767582
Europe-Americas 7333.3652 5484.2011 9182.529 0.0000000
Oceania-Americas 11485.4989 6467.6035 16503.394 0.0000000
Europe-Asia 6567.3251 4844.7530 8289.897 0.0000000
Oceania-Asia 10719.4588 5746.8215 15692.096 0.0000000
Oceania-Europe 4152.1337 -834.6909 9138.958 0.1539474
From the results above a significant difference in mean gdp per capita exist across various continent except for Asia and America and Oceania and Europe.
plot(TK)
plot(TK, las=1)
ANOVA2<-aov(lifeExp~continent)
summary(ANOVA2)
Df Sum Sq Mean Sq F value Pr(>F)
continent 4 139343 34836 408.7 <0.0000000000000002 ***
Residuals 1699 144805 85
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 126.6 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the mean life expectancy across continent at a 5% level of significance.
TK2<-TukeyHSD(ANOVA2)
TK2
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lifeExp ~ continent)
$continent
diff lwr upr p adj
Americas-Africa 15.793407 14.022263 17.564550 0.0000000
Asia-Africa 11.199573 9.579887 12.819259 0.0000000
Europe-Africa 23.038356 21.369862 24.706850 0.0000000
Oceania-Africa 25.460878 20.216908 30.704848 0.0000000
Asia-Americas -4.593833 -6.523432 -2.664235 0.0000000
Europe-Americas 7.244949 5.274203 9.215696 0.0000000
Oceania-Americas 9.667472 4.319650 15.015293 0.0000086
Europe-Asia 11.838783 10.002952 13.674614 0.0000000
Oceania-Asia 14.261305 8.961718 19.560892 0.0000000
Oceania-Europe 2.422522 -2.892185 7.737230 0.7250559
From the results above a significant difference in mean life expectancy exist across various continent except for Oceania and Europe.
plot(TK2)
plot(TK2, las=1)
library(haven)
MAIZE_YIELDS_two_way_ANOVA <- read_dta("D:/Data/data and other training stuff/STATA training DATA set/MAIZE YIELDS two way ANOVA.dta")
head(MAIZE_YIELDS_two_way_ANOVA,10)
# A tibble: 10 × 3
maize_yields treatment blocks
<dbl> <dbl+lbl> <dbl+lbl>
1 18.5 1 [TR 1] 1 [BLOCK 1]
2 11.7 1 [TR 1] 2 [BLOCK 2]
3 15.4 1 [TR 1] 3 [BLOCK 3]
4 16.5 1 [TR 1] 4 [BLOCK 4]
5 15.7 2 [TR 2] 1 [BLOCK 1]
6 14.2 2 [TR 2] 2 [BLOCK 2]
7 16.6 2 [TR 2] 3 [BLOCK 3]
8 18.5 2 [TR 2] 4 [BLOCK 4]
9 16.2 3 [TR 3] 1 [BLOCK 1]
10 12.9 3 [TR 3] 2 [BLOCK 2]
my_2anova<-MAIZE_YIELDS_two_way_ANOVA
str(my_2anova)
tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
$ maize_yields: num [1:24] 18.5 11.7 15.4 16.5 15.7 14.2 16.6 18.5 16.2 12.9 ...
..- attr(*, "format.stata")= chr "%10.0g"
$ treatment : dbl+lbl [1:24] 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5...
..@ format.stata: chr "%10.0g"
..@ labels : Named num [1:6] 1 2 3 4 5 6
.. ..- attr(*, "names")= chr [1:6] "TR 1" "TR 2" "TR 3" "TR 4" ...
$ blocks : dbl+lbl [1:24] 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3...
..@ format.stata: chr "%10.0g"
..@ labels : Named num [1:4] 1 2 3 4
.. ..- attr(*, "names")= chr [1:4] "BLOCK 1" "BLOCK 2" "BLOCK 3" "BLOCK 4"
my_2anova
# A tibble: 24 × 3
maize_yields treatment blocks
<dbl> <dbl+lbl> <dbl+lbl>
1 18.5 1 [TR 1] 1 [BLOCK 1]
2 11.7 1 [TR 1] 2 [BLOCK 2]
3 15.4 1 [TR 1] 3 [BLOCK 3]
4 16.5 1 [TR 1] 4 [BLOCK 4]
5 15.7 2 [TR 2] 1 [BLOCK 1]
6 14.2 2 [TR 2] 2 [BLOCK 2]
7 16.6 2 [TR 2] 3 [BLOCK 3]
8 18.5 2 [TR 2] 4 [BLOCK 4]
9 16.2 3 [TR 3] 1 [BLOCK 1]
10 12.9 3 [TR 3] 2 [BLOCK 2]
# … with 14 more rows
attach(my_2anova)
my_2anova$treatment <- as.factor(my_2anova$treatment)
my_2anova$blocks <- as.factor(my_2anova$blocks)
str(my_2anova)
tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
$ maize_yields: num [1:24] 18.5 11.7 15.4 16.5 15.7 14.2 16.6 18.5 16.2 12.9 ...
..- attr(*, "format.stata")= chr "%10.0g"
$ treatment : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
$ blocks : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
modell<-lm(maize_yields~treatment+blocks,data=my_2anova)
modell
Call:
lm(formula = maize_yields ~ treatment + blocks, data = my_2anova)
Coefficients:
(Intercept) treatment2 treatment3 treatment4 treatment5 treatment6
14.921 0.725 -1.200 0.575 0.675 0.900
blocks2 blocks3 blocks4
-1.467 2.767 1.117
summary(modell)
Call:
lm(formula = maize_yields ~ treatment + blocks, data = my_2anova)
Residuals:
Min 1Q Median 3Q Max
-2.5958 -1.7688 0.0292 1.2313 3.5792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.921 1.410 10.585 0.0000000235 ***
treatment2 0.725 1.628 0.445 0.6624
treatment3 -1.200 1.628 -0.737 0.4723
treatment4 0.575 1.628 0.353 0.7288
treatment5 0.675 1.628 0.415 0.6842
treatment6 0.900 1.628 0.553 0.5884
blocks2 -1.467 1.329 -1.104 0.2872
blocks3 2.767 1.329 2.082 0.0549 .
blocks4 1.117 1.329 0.840 0.4140
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.302 on 15 degrees of freedom
Multiple R-squared: 0.4681, Adjusted R-squared: 0.1843
F-statistic: 1.65 on 8 and 15 DF, p-value: 0.192
two_way<-anova(modell)
two_way
Analysis of Variance Table
Response: maize_yields
Df Sum Sq Mean Sq F value Pr(>F)
treatment 5 12.377 2.4754 0.4672 0.79478
blocks 3 57.555 19.1849 3.6208 0.03802 *
Residuals 15 79.478 5.2985
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics for treatment is given as 0.4672 with its associated p-value of approximately 0.79478, which is significantly greater than 0.05. This is an indication that there is no significant difference in the mean maize yield across treatment at a 5% level of significance. On the other, F statistics for blocks is 3.6208 with its associated p-value of 0.03802, which is less than 0.05. This is a clear indication that there exists a significant difference in the mean maize yield across blocks at a 5% level of significant.
A Pearson’s correlation is used when the two statistics we want to analyze are both quantitative. This means we will be comparing quantitative variables to find a linear relationship (if the variables represent a nonlinear relationship, a correlation is not appropriate). The Pearson correlation measures the strength of the linear relationship between two variables. It has a value between -1 to 1, with a value of -1 meaning a total negative linear correlation, 0 being no correlation, and + 1 meaning a total positive correlation.
data("gapminder")
head(gapminder,10)
# A tibble: 10 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
cor.test(gdpPercap, lifeExp, method = "pearson")
Pearson's product-moment correlation
data: gdpPercap and lifeExp
t = 29.658, df = 1702, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5515065 0.6141690
sample estimates:
cor
0.5837062
“t” is the value of the test statistic (t=29.658). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true correlation is not equal to 0). Sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s named as cor (cor = 0.5837). From the results we shall therefore reject the null hypothesis and conclude that there is a positive moderate correlation between life expectancy and gdp per capita at a 5% level of significance.
library(haven)
Pearson <- read_sav("C:/Users/user/Downloads/Pearson.sav")
head(Pearson,10)
# A tibble: 10 × 2
mpg weight
<dbl> <dbl>
1 18 3504
2 15 3693
3 18 3436
4 16 3433
5 17 3449
6 15 4341
7 14 4354
8 14 4312
9 14 4425
10 15 3850
length(Pearson)
[1] 2
str(Pearson)
tibble [300 × 2] (S3: tbl_df/tbl/data.frame)
$ mpg : num [1:300] 18 15 18 16 17 15 14 14 14 15 ...
..- attr(*, "format.spss")= chr "F4.1"
$ weight: num [1:300] 3504 3693 3436 3433 3449 ...
..- attr(*, "format.spss")= chr "F4.0"
attach(Pearson)
cor(Pearson)
mpg weight
mpg 1.0000000 -0.8782815
weight -0.8782815 1.0000000
cor.test(Pearson$mpg, Pearson$weight, method = "pearson", data=Pearson)
Pearson's product-moment correlation
data: Pearson$mpg and Pearson$weight
t = -31.709, df = 298, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9018288 -0.8495329
sample estimates:
cor
-0.8782815
“t” is the value of the test statistic (t=-31.709). P-value is the
significance level of the test statistic (p-value = 0.0001). Alternative
hypothesis is a character string describing the alternative hypothesis
(true correlation is not equal to 0). Sample estimates is the
correlation coefficient. For Pearson correlation coefficient it’s named
as cor (cor = -0.8783). From the results we shall therefore reject the
null hypothesis and conclude that the true correlation is not equal to
zero. Further, correlation coefficient from the results above shows that
there a strong negative correlation between mpg and weight at a 5% level
of significance.
There is sufficient evidence to conclude that there is a significant
linear relationship between mpg and weight because the correlation
coefficient is significantly different from zero.
plot(Pearson$mpg, Pearson$weight)
knitr::include_graphics("C:/Users/user/Downloads/my_scatter.png")
My scatter plot showing negative correlation
library(devtools)
my_tbl <- tibble::tribble(
~ln_population, ~ln_life_exp, ~ln_gdp_cap,
6.925587075, 1.45940757, 2.891786,
6.965715868, 1.48190105, 2.914265,
7.011447073, 1.50510926, 2.931,
7.062129255, 1.53173431, 2.922309,
7.116589814, 1.55736281, 2.869221,
7.172613788, 1.58476078, 2.895485,
7.109977092, 1.60047192, 2.990344,
7.142012486, 1.61089428, 2.930641,
7.212664826, 1.61986519, 2.812473,
7.346888958, 1.62079169, 2.803007,
7.402577829, 1.62458115, 2.861376,
7.503653471, 1.64175165, 2.988818,
6.108124079, 1.74217504, 3.204407,
6.169234922, 1.77290819, 3.288313,
6.237578169, 1.81170903, 3.364155,
6.297554802, 1.82098918, 3.44094,
6.35479086, 1.83052451, 3.520277,
6.39950897, 1.83840828, 3.548144,
6.444059949, 1.84769602, 3.560012
)
require(knitr)
kable(my_tbl, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| ln_population | ln_life_exp | ln_gdp_cap |
|---|---|---|
| 6.926 | 1.459 | 2.892 |
| 6.966 | 1.482 | 2.914 |
| 7.011 | 1.505 | 2.931 |
| 7.062 | 1.532 | 2.922 |
| 7.117 | 1.557 | 2.869 |
| 7.173 | 1.585 | 2.895 |
| 7.110 | 1.600 | 2.990 |
| 7.142 | 1.611 | 2.931 |
| 7.213 | 1.620 | 2.812 |
| 7.347 | 1.621 | 2.803 |
| 7.403 | 1.625 | 2.861 |
| 7.504 | 1.642 | 2.989 |
| 6.108 | 1.742 | 3.204 |
| 6.169 | 1.773 | 3.288 |
| 6.238 | 1.812 | 3.364 |
| 6.298 | 1.821 | 3.441 |
| 6.355 | 1.831 | 3.520 |
| 6.400 | 1.838 | 3.548 |
| 6.444 | 1.848 | 3.560 |
attach(my_tbl)
my_model<-lm(ln_life_exp~ln_gdp_cap,data=my_tbl)
stargazer(my_model,report = "vc*stp",type = "text",out = "./q7results.txt")
===============================================
Dependent variable:
---------------------------
ln_life_exp
-----------------------------------------------
ln_gdp_cap 0.430***
(0.050)
t = 8.585
p = 0.00000
Constant 0.328**
(0.155)
t = 2.111
p = 0.050
-----------------------------------------------
Observations 19
R2 0.813
Adjusted R2 0.802
Residual Std. Error 0.058 (df = 17)
F Statistic 73.702*** (df = 1; 17)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
exp(0.328)
[1] 1.388189
my_tbl1 <- tibble::tribble(
~Year, ~Income, ~Consumption,
1950, 791.8, 733.2,
1951, 819, 748.7,
1952, 844.3, 771.4,
1953, 880, 802.5,
1954, 894, 822.7,
1955, 944.5, 873.8,
1956, 989.4, 899.8,
1957, 1012.1, 919.7,
1958, 1028.8, 932.9,
1959, 1067.2, 979.4,
1960, 1091.1, 1005.1,
1961, 1123.2, 1025.2,
1962, 1170.2, 1069,
1963, 1207.3, 1108.4,
1964, 1291, 1170.6,
1965, 1365.7, 1236.4,
1966, 1431.3, 1298.9,
1967, 1493.2, 1337.7,
1968, 1551.3, 1405.9,
1969, 1599.8, 1456.7,
1970, 1688.1, 1492,
1971, 1728.4, 1538.8,
1972, 1797.4, 1621.9,
1973, 1916.3, 1689.6,
1974, 1896.6, 1674,
1975, 1931.7, 1711.9,
1976, 2001, 1803.9,
1977, 2066.6, 1883.8,
1978, 2167.4, 1961,
1979, 2216.2, 2004.4
)
require(knitr)
kable(my_tbl1, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| Year | Income | Consumption |
|---|---|---|
| 1950 | 791.8 | 733.2 |
| 1951 | 819.0 | 748.7 |
| 1952 | 844.3 | 771.4 |
| 1953 | 880.0 | 802.5 |
| 1954 | 894.0 | 822.7 |
| 1955 | 944.5 | 873.8 |
| 1956 | 989.4 | 899.8 |
| 1957 | 1012.1 | 919.7 |
| 1958 | 1028.8 | 932.9 |
| 1959 | 1067.2 | 979.4 |
| 1960 | 1091.1 | 1005.1 |
| 1961 | 1123.2 | 1025.2 |
| 1962 | 1170.2 | 1069.0 |
| 1963 | 1207.3 | 1108.4 |
| 1964 | 1291.0 | 1170.6 |
| 1965 | 1365.7 | 1236.4 |
| 1966 | 1431.3 | 1298.9 |
| 1967 | 1493.2 | 1337.7 |
| 1968 | 1551.3 | 1405.9 |
| 1969 | 1599.8 | 1456.7 |
| 1970 | 1688.1 | 1492.0 |
| 1971 | 1728.4 | 1538.8 |
| 1972 | 1797.4 | 1621.9 |
| 1973 | 1916.3 | 1689.6 |
| 1974 | 1896.6 | 1674.0 |
| 1975 | 1931.7 | 1711.9 |
| 1976 | 2001.0 | 1803.9 |
| 1977 | 2066.6 | 1883.8 |
| 1978 | 2167.4 | 1961.0 |
| 1979 | 2216.2 | 2004.4 |
attach(my_tbl1)
tail(my_tbl1)
# A tibble: 6 × 3
Year Income Consumption
<dbl> <dbl> <dbl>
1 1974 1897. 1674
2 1975 1932. 1712.
3 1976 2001 1804.
4 1977 2067. 1884.
5 1978 2167. 1961
6 1979 2216. 2004.
head(my_tbl1)
# A tibble: 6 × 3
Year Income Consumption
<dbl> <dbl> <dbl>
1 1950 792. 733.
2 1951 819 749.
3 1952 844. 771.
4 1953 880 802.
5 1954 894 823.
6 1955 944. 874.
str(my_tbl)
tibble [19 × 3] (S3: tbl_df/tbl/data.frame)
$ ln_population: num [1:19] 6.93 6.97 7.01 7.06 7.12 ...
$ ln_life_exp : num [1:19] 1.46 1.48 1.51 1.53 1.56 ...
$ ln_gdp_cap : num [1:19] 2.89 2.91 2.93 2.92 2.87 ...
Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.
myy_model<-lm(Consumption~Income,data = my_tbl1)
stargazer(myy_model,report = "vc*stp",type = "text",out = "./q7results.txt")
===============================================
Dependent variable:
---------------------------
Consumption
-----------------------------------------------
Income 0.881***
(0.006)
t = 143.727
p = 0.000
Constant 31.935***
(9.004)
t = 3.547
p = 0.002
-----------------------------------------------
Observations 30
R2 0.999
Adjusted R2 0.999
Residual Std. Error 14.858 (df = 28)
F Statistic 20,657.590*** (df = 1; 28)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
myy_model<-lm(log(Consumption)~log(Income),data = my_tbl1)
stargazer(myy_model,report = "vc*stp",type = "text",out = "./q7results.txt")
===============================================
Dependent variable:
---------------------------
log(Consumption)
-----------------------------------------------
log(Income) 0.972***
(0.006)
t = 174.576
p = 0.000
Constant 0.107**
(0.040)
t = 2.667
p = 0.013
-----------------------------------------------
Observations 30
R2 0.999
Adjusted R2 0.999
Residual Std. Error 0.010 (df = 28)
F Statistic 30,476.920*** (df = 1; 28)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
According to Relative Income Theory (RIT), the level of consumption expenditures is not determined by the absolute level of income but by the relative level of income, with the APC declining as relative income increases. More specifically, the RIT argues that the level of consumption spending is determined by the household’s level of current income relative to the highest level of income previously earned. Symbolically, it is shown:
knitr::include_graphics("consumption.png")
Simbolic Representation of RIT
where C represents the current level of consumption expenditures, Y, the current level of income, Y^d, the highest level of income previously earned, and a and b, represent numerical constants which relate income to consumption. From the above equation, we find that when the households experience a temporary and short-run increase in current income above its previous peak level of income, it increases its consumption expenditures by an amount which is less than proportional to the increase in current income.
Consequently, when current income rises relative to peak income, the APC declines and the increase in total consumption expenditures is not proportional to the increase in total income. Again, when a household experiences current and peak income growing by the same percentage amount, it increases its consumption expenditures by an amount which is proportional to the increase in current income.
Consequently, the APC remains constant and the increase in total consumption expenditure is proportional to the increase in total income. Thus, according to the RIT, changes in current consumption are not proportional to the changes in current income only when current income increases relative to previous peak income.
If current and peak income grow together, changes in consumption are always proportional to the changes in income. However, it must be noted that RIT works for decreases as well as increases in the level of current income. The RIT is fundamentally different from AIT. The RIT explains away the short-run consumption function as a result of temporary deviations in current income, while the AIT explains away the long-run consumption function as the result of factors other than income on consumption.
From the results above, autonomous consumption is about 31.935 US dollar. On the other hand, a unit change in income results in 88.1% change in consumption. The effect is statistically significant at a 1% level of significance.
Pre<-predict(myy_model,data.frame(Income=2000))
Pre
1
7.491265
my_tbl2021 <- tibble::tribble(
~YEAR, ~GDP, ~PIV, ~ICT, ~VAT, ~IMPTT,
1973, 15790, 3645, 1124.8, 639.8, 795.44,
1974, 18776, 4075, 1531.4, 937.26, 842.24,
1975, 21140, 4837, 1796.8, 1185.48, 983.62,
1976, 25562, 5808, 2149.4, 1308.44, 1057.18,
1977, 32699, 7800, 2846.8, 1855.26, 2083.94,
1978, 35601, 10280, 3121.4, 1995.38, 2025.48,
1979, 39543, 10809, 3437, 3098.14, 2049.64,
1980, 44648, 12451, 3951.6, 3588, 2919,
1981, 51641, 14508, 3993.4, 3895.9, 3674.24,
1982, 58214, 13364, 4624.6, 3917.5, 3305.84,
1983, 66218, 14349, 5023, 5062.38, 3424.38,
1984, 72550, 16143, 6019.4, 5471, 3043.58,
1985, 100831, 17631, 7162.4, 6065.86, 4236.8,
1986, 117472, 23064, 7714.6, 7950.4, 4934.2,
1987, 131169, 25735, 9089.6, 10399.14, 5473.72,
1988, 151194, 30359, 10240.4, 9774.92, 10240.5,
1989, 171589, 33056, 11983, 9461.94, 11983.06,
1990, 195536, 40560, 14261.6, 12140.28, 14261.56,
1991, 224232, 42671, 17027.8, 18555.4, 5118.78,
1992, 264476, 43777, 19970.4, 22142.72, 9183,
1993, 333616, 56580, 36767.2, 28994.34, 14792.78,
1994, 400700, 75616, 43505.8, 24533.86, 18598.28,
1995, 465654, 99497, 48082.3, 28403.72, 21175.68,
1996, 687998, 110142, 48375.02, 29850.08, 22594.06,
1997, 770312, 118535, 55577.9, 34468.1, 24567.06,
1998, 850808, 133366, 55234.9, 90204.76, 28443.93,
1999, 906928, 141403, 53316.99, 40944.11, 28605.16,
2000, 967838, 161714, 53428.93, 50220.92, 28803.74,
2001, 1020020, 185186, 55861.95, 50871.68, 27302,
2002, 1035370, 178466, 70140.28, 56135.25, 24936.09,
2003, 1138060, 179254, 77409.73, 58853.08, 25214,
2004, 1286460, 207196, 99312.47, 75995.66, 23531.72,
2005, 1445480, 264912, 114629.06, 79925.91, 20511.43,
2006, 1642400, 309402, 130719, 96497.01, 27927,
2007, 1833511, 309781, 165078, 111904.51, 32944.35,
2008, 2111173, 239525, 194155, 126854, 36181,
2009, 2365453, 292488, 288168, 146791, 41372,
2010, 2551161, 238938, 268291, 172360, 48903,
2011, 3725918, 213845, 290621.6, 163940.39, 48436.93,
2012, 4261370, 242365, 353693.71, 174716.17, 55397.81,
2013, 4745594, 234586, 422357.86, 218297.19, 61692.72,
2014, 5403471, 273641, 470101.98, 243156.2, 70245.12,
2015, 6284191, 293645, 540440.43, 276504.4, 75410.29,
2016, 7194163, 261548, 589921.37, 287766.52, 86329.96,
2017, 7749435, 269542, 557959.32, 309977.4, 85243.79,
2018, 7946248, 256345, 683377.33, 381419.9, 95354.98
)
require(knitr)
kable(my_tbl2021, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| YEAR | GDP | PIV | ICT | VAT | IMPTT |
|---|---|---|---|---|---|
| 1973 | 15790 | 3645 | 1124.80 | 639.80 | 795.44 |
| 1974 | 18776 | 4075 | 1531.40 | 937.26 | 842.24 |
| 1975 | 21140 | 4837 | 1796.80 | 1185.48 | 983.62 |
| 1976 | 25562 | 5808 | 2149.40 | 1308.44 | 1057.18 |
| 1977 | 32699 | 7800 | 2846.80 | 1855.26 | 2083.94 |
| 1978 | 35601 | 10280 | 3121.40 | 1995.38 | 2025.48 |
| 1979 | 39543 | 10809 | 3437.00 | 3098.14 | 2049.64 |
| 1980 | 44648 | 12451 | 3951.60 | 3588.00 | 2919.00 |
| 1981 | 51641 | 14508 | 3993.40 | 3895.90 | 3674.24 |
| 1982 | 58214 | 13364 | 4624.60 | 3917.50 | 3305.84 |
| 1983 | 66218 | 14349 | 5023.00 | 5062.38 | 3424.38 |
| 1984 | 72550 | 16143 | 6019.40 | 5471.00 | 3043.58 |
| 1985 | 100831 | 17631 | 7162.40 | 6065.86 | 4236.80 |
| 1986 | 117472 | 23064 | 7714.60 | 7950.40 | 4934.20 |
| 1987 | 131169 | 25735 | 9089.60 | 10399.14 | 5473.72 |
| 1988 | 151194 | 30359 | 10240.40 | 9774.92 | 10240.50 |
| 1989 | 171589 | 33056 | 11983.00 | 9461.94 | 11983.06 |
| 1990 | 195536 | 40560 | 14261.60 | 12140.28 | 14261.56 |
| 1991 | 224232 | 42671 | 17027.80 | 18555.40 | 5118.78 |
| 1992 | 264476 | 43777 | 19970.40 | 22142.72 | 9183.00 |
| 1993 | 333616 | 56580 | 36767.20 | 28994.34 | 14792.78 |
| 1994 | 400700 | 75616 | 43505.80 | 24533.86 | 18598.28 |
| 1995 | 465654 | 99497 | 48082.30 | 28403.72 | 21175.68 |
| 1996 | 687998 | 110142 | 48375.02 | 29850.08 | 22594.06 |
| 1997 | 770312 | 118535 | 55577.90 | 34468.10 | 24567.06 |
| 1998 | 850808 | 133366 | 55234.90 | 90204.76 | 28443.93 |
| 1999 | 906928 | 141403 | 53316.99 | 40944.11 | 28605.16 |
| 2000 | 967838 | 161714 | 53428.93 | 50220.92 | 28803.74 |
| 2001 | 1020020 | 185186 | 55861.95 | 50871.68 | 27302.00 |
| 2002 | 1035370 | 178466 | 70140.28 | 56135.25 | 24936.09 |
| 2003 | 1138060 | 179254 | 77409.73 | 58853.08 | 25214.00 |
| 2004 | 1286460 | 207196 | 99312.47 | 75995.66 | 23531.72 |
| 2005 | 1445480 | 264912 | 114629.06 | 79925.91 | 20511.43 |
| 2006 | 1642400 | 309402 | 130719.00 | 96497.01 | 27927.00 |
| 2007 | 1833511 | 309781 | 165078.00 | 111904.51 | 32944.35 |
| 2008 | 2111173 | 239525 | 194155.00 | 126854.00 | 36181.00 |
| 2009 | 2365453 | 292488 | 288168.00 | 146791.00 | 41372.00 |
| 2010 | 2551161 | 238938 | 268291.00 | 172360.00 | 48903.00 |
| 2011 | 3725918 | 213845 | 290621.60 | 163940.39 | 48436.93 |
| 2012 | 4261370 | 242365 | 353693.71 | 174716.17 | 55397.81 |
| 2013 | 4745594 | 234586 | 422357.86 | 218297.19 | 61692.72 |
| 2014 | 5403471 | 273641 | 470101.98 | 243156.20 | 70245.12 |
| 2015 | 6284191 | 293645 | 540440.43 | 276504.40 | 75410.29 |
| 2016 | 7194163 | 261548 | 589921.37 | 287766.52 | 86329.96 |
| 2017 | 7749435 | 269542 | 557959.32 | 309977.40 | 85243.79 |
| 2018 | 7946248 | 256345 | 683377.33 | 381419.90 | 95354.98 |
head(my_tbl2021,10)
# A tibble: 10 × 6
YEAR GDP PIV ICT VAT IMPTT
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1973 15790 3645 1125. 640. 795.
2 1974 18776 4075 1531. 937. 842.
3 1975 21140 4837 1797. 1185. 984.
4 1976 25562 5808 2149. 1308. 1057.
5 1977 32699 7800 2847. 1855. 2084.
6 1978 35601 10280 3121. 1995. 2025.
7 1979 39543 10809 3437 3098. 2050.
8 1980 44648 12451 3952. 3588 2919
9 1981 51641 14508 3993. 3896. 3674.
10 1982 58214 13364 4625. 3918. 3306.
summary(my_tbl2021)
YEAR GDP PIV ICT
Min. :1973 Min. : 15790 Min. : 3645 Min. : 1125
1st Qu.:1984 1st Qu.: 79620 1st Qu.: 16515 1st Qu.: 6305
Median :1996 Median : 576826 Median :104820 Median : 48229
Mean :1996 Mean :1542657 Mean :124401 Mean :128339
3rd Qu.:2007 3rd Qu.:1785733 3rd Qu.:237850 3rd Qu.:156488
Max. :2018 Max. :7946248 Max. :309781 Max. :683377
VAT IMPTT
Min. : 639.8 Min. : 795.4
1st Qu.: 5619.7 1st Qu.: 3814.9
Median : 29422.2 Median :20843.6
Mean : 75848.5 Mean :25351.1
3rd Qu.:108052.6 3rd Qu.:31909.2
Max. :381419.9 Max. :95355.0
attach(my_tbl2021)
MDL<-lm(log(PIV)~log(ICT)+log(VAT)+log(IMPTT),data=my_tbl2021)
stargazer(MDL,report = "vc*stp",type = "text",out = "./q7results.txt")
===============================================
Dependent variable:
---------------------------
log(PIV)
-----------------------------------------------
log(ICT) -0.135
(0.177)
t = -0.761
p = 0.452
log(VAT) 0.647***
(0.211)
t = 3.060
p = 0.004
log(IMPTT) 0.356**
(0.162)
t = 2.193
p = 0.034
Constant 2.532***
(0.398)
t = 6.357
p = 0.00000
-----------------------------------------------
Observations 46
R2 0.958
Adjusted R2 0.955
Residual Std. Error 0.303 (df = 42)
F Statistic 317.211*** (df = 3; 42)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
From the analysis it was found that the value added tax has a significantly positive effect on the level of private investment in Kenya. The findings are concluded on the basis that the value added tax has positive coefficient. Other studies shows that the increase in the value of value added tax increases the level of private investment (Adejare, 2017). This is an indicator that the increase in the value added tax increases the level of private investment. According to Gitahi (2010), value added tax negatively influences the level of private investment. From the table of coefficients, value added tax had a statistically significant effect on the level of private investment as opposed to the null hypothesis which stated that value added tax has no statistically significant effect on the level of private investment. This implies that we will reject the null hypothesis and conclude that value added tax has a statistically significant effect on the level of private investment at 5% level of significant. The coefficient of 0.647 show that one unit change in the level of value added tax results into 0.647 unit change in the level of private investment.
From the table 4.8, the coefficient for the income shows that income tax has a negative effect on the level of private investment. The argument here is that an increase in the rate of income tax constrain private investment. The assumption here is that, according to Keynesian economist, an increase in income tax reduce the disposable income. A decrease in the value of disposable income implies a decrease in the level of saving. Consequently, whenever the level of saving decreases private investment decreases. This is so because, Keynes (1936) argued that in the long run savings equates investment. An increase in the value of income tax charged results into a decrease in the level of private investment in country (Gitahi, 2010). The results from the analysis in the coefficient table shows that the effect of income tax on private investment is significant. The p- value for the income tax given as 0.028 which is less than 0.05 implies that we will therefore reject the null hypothesis and conclude that income tax has a statistically significant effect on the level of private investment at 5% level of significant. The fact that the Kenyan government is subjecting its citizens to higher income tax rate more than other countries in the region, this has accelerated higher capital flight to other countries leading shrinking of the private investment level in Kenya.
The effect of Import tax on private investment was found to be positive and statistically significant. From the existing literature, the results shows that import tax encouraged private investment that deterring private investment. Newbery (1987) and Coady (1997) argued that import taxes and barriers on international trade of any sort tend to encourage domestic production of final consumer goods while permitting relatively free imports of capital or intermediate goods. This tends to be associated with high rates of effective protection, high cost of domestic production, and creating a bias against exports. Import tax charged on final consumer goods leads production of import substitute. The production of import substitute in the country enables for the expansion of the local business enterprises leading to an increase in the level of private investment (Gitahi, 2010) Consequently, while reducing the dependence of the country on imports of final consumption goods, the economy becomes highly dependent on imports of intermediate goods. The above scenario will lead to the expansion of the local business enterprises and consequently increasing the aggregate demand which will reflect as the positive effect of import tax on private investment. From the table 4.8 import tax is statistically significant shown by the p-value of 0.034 which is less than 0.05 implying that we will reject the null hypothesis and conclude import tax has statistically significant effect on the level of private investment in Kenya at 5% level of significance.
RESID<-residuals(MDL)
mean(RESID)
[1] -0.00000000000000001027399
cov(RESID,log(ICT))
[1] 0.0000000000000000004384995
cov(RESID,log(VAT))
[1] -0.000000000000000001207078
cov(RESID,log(IMPTT))
[1] 0.00000000000000000800864
Rehearsal_and_non_rehearsal_groups <- read_sav("D:/Training/Data/Rehearsal and non rehearsal groups.sav")
head(Rehearsal_and_non_rehearsal_groups,10)
# A tibble: 10 × 2
No.ofwordsrecalled Groups
<dbl> <dbl+lbl>
1 24 1 [Rehearsal group]
2 21 1 [Rehearsal group]
3 19 1 [Rehearsal group]
4 27 1 [Rehearsal group]
5 16 1 [Rehearsal group]
6 15 1 [Rehearsal group]
7 27 1 [Rehearsal group]
8 18 1 [Rehearsal group]
9 22 1 [Rehearsal group]
10 26 1 [Rehearsal group]
data222<-Rehearsal_and_non_rehearsal_groups
head(data222,10)
# A tibble: 10 × 2
No.ofwordsrecalled Groups
<dbl> <dbl+lbl>
1 24 1 [Rehearsal group]
2 21 1 [Rehearsal group]
3 19 1 [Rehearsal group]
4 27 1 [Rehearsal group]
5 16 1 [Rehearsal group]
6 15 1 [Rehearsal group]
7 27 1 [Rehearsal group]
8 18 1 [Rehearsal group]
9 22 1 [Rehearsal group]
10 26 1 [Rehearsal group]
attach(data222)
var.test(No.ofwordsrecalled~Groups, ratio=1,
alternative=c("two.sided","less","greater"),
conf.level=0.95, data=data222)
F test to compare two variances
data: No.ofwordsrecalled by Groups
F = 1.1019, num df = 9, denom df = 9, p-value = 0.8875
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2736844 4.4360491
sample estimates:
ratio of variances
1.101852
In the above output, the p-value of the test is 0.8875, which is greater than the significance level alpha = 0.05. We can conclude that there is equal variance across the two groups with a p-value = 0.8875.
Kruskal-Wallis test, proposed by Kruskal and Wallis in 1952, is a nonparametric method for testing whether samples are originated from the same distribution.597,681 It extends the Mann-Whitney U test to more than two groups. The null hypothesis of the Kruskal-Wallis test is that the mean ranks of the groups are the same. As the nonparametric equivalent one-way ANOVA, Kruskal-Wallis test is called one-way ANOVA on ranks. Unlike the analogous one-way ANOVA, the nonparametric Kruskal-Wallis test does not assume a normal distribution of the underlying data. Thus, Kruskal-Wallis test is more suitable for analysis of microbiome data. Because the postsequencing microbiome data are often not normally distributed and contain some strong outliers, it is more appropriate to use ranks rather than actual values to avoid the testing being affected by the presence of outliers or by the nonnormal distribution of data.
The Kruskal Wallis test is the non parametric alternative to the One Way ANOVA. Non parametric means that the test doesn’t assume your data comes from a particular distribution. The H test is used when the assumptions for ANOVA aren’t met (like the assumption of normality). It is sometimes called the one-way ANOVA on ranks, as the ranks of the data values are used in the test rather than the actual data points.
H_test_miles_per_gassolin <- read_sav("D:/Data/data and other training stuff/SPSS_training data sets/H-test ,miles per gassolin.sav")
data333<-H_test_miles_per_gassolin
head(data333,10)
# A tibble: 10 × 2
miles gassolin
<dbl> <dbl+lbl>
1 20 1 [A]
2 31 1 [A]
3 24 1 [A]
4 33 1 [A]
5 23 1 [A]
6 24 1 [A]
7 28 1 [A]
8 16 1 [A]
9 19 1 [A]
10 26 1 [A]
attach(data333)
gassolin<-as.factor(data333$gassolin)
kruskal.test(miles~gassolin, data=data333)
Kruskal-Wallis rank sum test
data: miles by gassolin
Kruskal-Wallis chi-squared = 0.86831, df = 2, p-value = 0.6478
In the above output, the p-value of the test is 0.6478, which is greater than the significance level alpha = 0.05. We can conclude that the median miles covered using the three types of gasoline is not significantly different with a p-value =0.6478
The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. This tutorial describes how to compute paired samples Wilcoxon test in R.
before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myData <- data.frame(
group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)
head(myData,10)
group weight
1 before 190.1
2 before 190.9
3 before 172.7
4 before 213.0
5 before 231.4
6 before 196.9
7 before 172.2
8 before 285.5
9 before 225.2
10 before 113.7
attach(myData)
wilcox.test(weight~group, data=myData)
Wilcoxon rank sum exact test
data: weight by group
W = 98, p-value = 0.0000433
alternative hypothesis: true location shift is not equal to 0
In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 4.33e-05.
result = wilcox.test(weight ~ group,
data = myData,
paired = TRUE,
alternative = "less")
print(result)
Wilcoxon signed rank exact test
data: weight by group
V = 55, p-value = 1
alternative hypothesis: true location shift is less than 0
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater or equal to the median weight after treatment with a p-value = 1.
result2 = wilcox.test(weight ~ group,
data = myData,
paired = TRUE,
alternative = "greater")
print(result2)
Wilcoxon signed rank exact test
data: weight by group
V = 55, p-value = 0.0009766
alternative hypothesis: true location shift is greater than 0
In the above output, the p-value of the test is 0.0009766, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than the median weight after treatment with a p-value = 0.0009766.
weight <-c(27.6,30.6,32.2,25.3,30.9,31.0,28.9,28.9,28.9,28.2)
print(weight)
[1] 27.6 30.6 32.2 25.3 30.9 31.0 28.9 28.9 28.9 28.2
result3 = wilcox.test(weight, mu = 25)
print(result3)
Wilcoxon signed rank test with continuity correction
data: weight
V = 55, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
wilcox.test(myData$weight, mu = 25,
alternative = "less")
Wilcoxon signed rank exact test
data: myData$weight
V = 210, p-value = 1
alternative hypothesis: true location is less than 25
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than or equal to the median weight after treatment with a p-value = 1.
wilcox.test(myData$weight, mu = 25,
alternative = "greater")
Wilcoxon signed rank exact test
data: myData$weight
V = 210, p-value = 0.0000009537
alternative hypothesis: true location is greater than 25
In the above output, the p-value of the test is approximately 0.0001, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than or equal to the median weight after treatment with a p-value approximately 0.0001.
The sign test is used to test the null hypothesis that the median of a distribution is equal to some value. It can be used a) in place of a one-sample t-test b) in place of a paired t-test or c) for ordered categorical data where a numerical scale is inappropriate but where it is possible to rank the observations.
weight <-c(27.6,30.6,32.2,25.3,30.9,31.0,28.9,28.9,28.9,28.2)
print(weight)
[1] 27.6 30.6 32.2 25.3 30.9 31.0 28.9 28.9 28.9 28.2
library(BSDA)
result4 = SIGN.test(weight, mu = 25)
print(result4)
One-sample Sign-Test
data: weight
s = 10, p-value = 0.001953
alternative hypothesis: true median is not equal to 0
95 percent confidence interval:
27.79467 30.96756
sample estimates:
median of x
28.9
Achieved and Interpolated Confidence Intervals:
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.8906 28.2000 30.9000
Interpolated CI 0.9500 27.7947 30.9676
Upper Achieved CI 0.9785 27.6000 31.0000
before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myDatta<-data.frame(before,after)
head(myDatta,10)
before after
1 190.1 392.9
2 190.9 313.2
3 172.7 345.1
4 213.0 393.0
5 231.4 434.0
6 196.9 227.9
7 172.2 422.0
8 285.5 383.9
9 225.2 392.3
10 113.7 352.2
SIGN.test(before,after,md=0, alternative ="greater",paired=T, data=myDatta)
Dependent-samples Sign-Test
data: before and after
S = 0, p-value = 1
alternative hypothesis: true median difference is greater than 0
95 percent confidence interval:
-206.608 Inf
sample estimates:
median of x-y
-176.2
Achieved and Interpolated Confidence Intervals:
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.9453 -202.800 Inf
Interpolated CI 0.9500 -206.608 Inf
Upper Achieved CI 0.9893 -238.500 Inf
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is equal to zero with a p-value = 1.
SIGN.test(before,after,md=0, alternative ="less",paired=T, data=myDatta)
Dependent-samples Sign-Test
data: before and after
S = 0, p-value = 0.0009766
alternative hypothesis: true median difference is less than 0
95 percent confidence interval:
-Inf -119.7507
sample estimates:
median of x-y
-176.2
Achieved and Interpolated Confidence Intervals:
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.9453 -Inf -122.3000
Interpolated CI 0.9500 -Inf -119.7507
Upper Achieved CI 0.9893 -Inf -98.4000
In the above output, the p-value of the test is 0.0009766, which is less than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is not equal to zero with a p-value = 0.009766.
SIGN.test(before,after,md=0, alternative ="two.sided",paired=T, data=myDatta)
Dependent-samples Sign-Test
data: before and after
S = 0, p-value = 0.001953
alternative hypothesis: true median difference is not equal to 0
95 percent confidence interval:
-226.9173 -106.1542
sample estimates:
median of x-y
-176.2
Achieved and Interpolated Confidence Intervals:
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.8906 -202.8000 -122.3000
Interpolated CI 0.9500 -226.9173 -106.1542
Upper Achieved CI 0.9785 -238.5000 -98.4000
In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is not equal to zero with a p-value = 0.001953.
The Mann-Whitney U test is used to compare differences between two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed. For example, you could use the Mann-Whitney U test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differ based on gender (i.e., your dependent variable would be “attitudes towards pay discrimination” and your independent variable would be “gender”, which has two groups: “male” and “female”). Alternately, you could use the Mann-Whitney U test to understand whether salaries, measured on a continuous scale, differed based on educational level (i.e., your dependent variable would be “salary” and your independent variable would be “educational level”, which has two groups: “high school” and “university”). The Mann-Whitney U test is often considered the nonparametric alternative to the independent t-test although this is not always the case.
Unlike the independent-samples t-test, the Mann-Whitney U test allows you to draw different conclusions about your data depending on the assumptions you make about your data’s distribution. These conclusions can range from simply stating whether the two populations differ through to determining if there are differences in medians between groups. These different conclusions hinge on the shape of the distributions of your data, which we explain more about later.
red_bulb <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8)
orange_bulb <- c(47.8, 60, 63.4, 76, 89.4, 67.3, 61.3, 62.4)
BULB_PRICE = c(red_bulb, orange_bulb)
BULB_TYPE = rep(c("red", "orange"), each = 8)
DATASET <- data.frame(BULB_TYPE, BULB_PRICE, stringsAsFactors = TRUE)
DATASET
BULB_TYPE BULB_PRICE
1 red 38.9
2 red 61.2
3 red 73.3
4 red 21.8
5 red 63.4
6 red 64.6
7 red 48.4
8 red 48.8
9 orange 47.8
10 orange 60.0
11 orange 63.4
12 orange 76.0
13 orange 89.4
14 orange 67.3
15 orange 61.3
16 orange 62.4
RSLT<-wilcox.test(BULB_PRICE~ BULB_TYPE,
data = DATASET,
exact = FALSE)
RSLT
Wilcoxon rank sum test with continuity correction
data: BULB_PRICE by BULB_TYPE
W = 44.5, p-value = 0.2072
alternative hypothesis: true location shift is not equal to 0
Here as we can see that the p-value of is coming out to be 0.2072 which is significantly greater than the null hypothesis(0.05). Due to which it will be accepted. And it can be conclude that the distribution of prices of red and orange bulbs is the same.We can therefore assume that prices for both orange and red bulbs come from the same distribution.
Spearman’s rank correlation is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function
My_data<-read.csv("C:\\Users\\user\\Downloads\\gapminder.csv")
head(My_data,10)
country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
7 Afghanistan Asia 1982 39.854 12881816 978.0114
8 Afghanistan Asia 1987 40.822 13867957 852.3959
9 Afghanistan Asia 1992 41.674 16317921 649.3414
10 Afghanistan Asia 1997 41.763 22227415 635.3414
attach(My_data)
res <- cor.test(gdpPercap, lifeExp, method = "spearman")
res
Spearman's rank correlation rho
data: gdpPercap and lifeExp
S = 143096490, p-value < 0.00000000000000022
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8264712
S is the value of the test statistic (S = 14309649). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true rho is not equal to 0). Sample estimates is the correlation coefficient. For Spearmann correlation coefficient it’s named as rho (Cor.coeff = 0.8264). From the results we shall therefore reject the null hypothesis and conclude that life expectancy and gdp per capita are dependent, and therefore true rho is not equal to 0.
matrix, a set of numbers arranged in rows and columns so as to form a rectangular array. The numbers are called the elements, or entries, of the matrix. Matrices have wide applications in engineering, physics, economics, and statistics as well as in various branches of mathematics. Matrices also have important applications in computer graphics, where they have been used to represent rotations and other transformations of images.
A<-matrix(c(4,19,16,21,14,8,10,18,14),nrow=3,ncol=3,byrow=T)
A
[,1] [,2] [,3]
[1,] 4 19 16
[2,] 21 14 8
[3,] 10 18 14
The determinant is a special number that can be calculated from a matrix. The matrix has to be square (same number of rows and columns) like this one:
Mat<-matrix(c(3,8,4,6),nrow=2,byrow = T)
Mat
[,1] [,2]
[1,] 3 8
[2,] 4 6
\[ (3*6)-(4*8)= 18-32 = -14 \]
det(Mat)
[1] -14
mtrx <- matrix( c(-1, -3, -5, -7, -8, 9, 11, 13, 15),nrow = 3,byrow = T)
mtrx
[,1] [,2] [,3]
[1,] -1 -3 -5
[2,] -7 -8 9
[3,] 11 13 15
det(mtrx)
[1] -360
A<-matrix(c(4,19,16,21,14,8,10,18,14),nrow=3,ncol=3,byrow=T)
A
[,1] [,2] [,3]
[1,] 4 19 16
[2,] 21 14 8
[3,] 10 18 14
B<-matrix(c(161,191,123,165,125,155,100,200,155),nrow=3,ncol=3,byrow=T)
B
[,1] [,2] [,3]
[1,] 161 191 123
[2,] 165 125 155
[3,] 100 200 155
solve(B)
[,1] [,2] [,3]
[1,] 0.009121582 0.003927184 -0.011165601
[2,] 0.007905371 -0.009929774 0.003656479
[3,] -0.016085370 0.010278944 0.008937189
zapsmall(B%*%solve(B))
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
A+B
[,1] [,2] [,3]
[1,] 165 210 139
[2,] 186 139 163
[3,] 110 218 169
B-A
[,1] [,2] [,3]
[1,] 157 172 107
[2,] 144 111 147
[3,] 90 182 141
A%*%B
[,1] [,2] [,3]
[1,] 5379 6339 5917
[2,] 6491 7361 5993
[3,] 5980 6960 6190
B/A
[,1] [,2] [,3]
[1,] 40.250000 10.052632 7.68750
[2,] 7.857143 8.928571 19.37500
[3,] 10.000000 11.111111 11.07143
M1<-matrix(c(16,121,166,121,129,177,163,135,198),nrow=3,ncol=3,byrow=T)
M1
[,1] [,2] [,3]
[1,] 16 121 166
[2,] 121 129 177
[3,] 163 135 198
M2<-matrix(c(100,170,188,131,175,178,151,199,502),nrow=3,ncol=3,byrow=T)
M2
[,1] [,2] [,3]
[1,] 100 170 188
[2,] 131 175 178
[3,] 151 199 502
S<-solve(A)
S
[,1] [,2] [,3]
[1,] -1.04 -0.44 1.44
[2,] 4.28 2.08 -6.08
[3,] -4.76 -2.36 6.86
solve(B)
[,1] [,2] [,3]
[1,] 0.009121582 0.003927184 -0.011165601
[2,] 0.007905371 -0.009929774 0.003656479
[3,] -0.016085370 0.010278944 0.008937189
cbind(M1,M2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 16 121 166 100 170 188
[2,] 121 129 177 131 175 178
[3,] 163 135 198 151 199 502
rbind(M1,M2)
[,1] [,2] [,3]
[1,] 16 121 166
[2,] 121 129 177
[3,] 163 135 198
[4,] 100 170 188
[5,] 131 175 178
[6,] 151 199 502
zapsmall(A%*%S)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
S<-matrix(c(2.879,10.002,-1.810,10.002,199.798,-5.627,-1.810,-5.627,3.628),nrow=3)
S
[,1] [,2] [,3]
[1,] 2.879 10.002 -1.810
[2,] 10.002 199.798 -5.627
[3,] -1.810 -5.627 3.628
S_inv<-solve(S)
S_inv
[,1] [,2] [,3]
[1,] 0.58648233 -0.022083814 0.258342724
[2,] -0.02208381 0.006065228 -0.001610437
[3,] 0.25834272 -0.001610437 0.402022712
zapsmall(S%*%S_inv)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
R<-matrix(c(4,3,4,5,8,4,7,8,2,5,3,8,9,5,5,6),nrow = 4)
R
[,1] [,2] [,3] [,4]
[1,] 4 8 2 9
[2,] 3 4 5 5
[3,] 4 7 3 5
[4,] 5 8 8 6
det(R)
[1] -71
solve(R)
[,1] [,2] [,3] [,4]
[1,] -0.87323944 1.5211268 1.6901408 -1.36619718
[2,] 0.33802817 -0.9436620 -0.4929577 0.69014085
[3,] 0.07042254 -0.1549296 -0.3943662 0.35211268
[4,] 0.18309859 0.1971831 -0.2253521 -0.08450704
zapsmall(R%*%solve(R))
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
my_mat<-matrix(c(25,22,18,26,12,16,24,20,26,28,30,28,12,
17,16,17,10,15,16,20,14,19,17,14),ncol=3,nrow = 8)
my_mat
[,1] [,2] [,3]
[1,] 25 26 10
[2,] 22 28 15
[3,] 18 30 16
[4,] 26 28 20
[5,] 12 12 14
[6,] 16 17 19
[7,] 24 16 17
[8,] 20 17 14
colnames(my_mat) <- c("sales", "profit","quantity")
rownames(my_mat)<- c("John","Chang","Michael","Bryan","Jose","McTon","Jayden","Jane")
my_mat
sales profit quantity
John 25 26 10
Chang 22 28 15
Michael 18 30 16
Bryan 26 28 20
Jose 12 12 14
McTon 16 17 19
Jayden 24 16 17
Jane 20 17 14
my_mat<-as.data.frame(my_mat)
my_mat
sales profit quantity
John 25 26 10
Chang 22 28 15
Michael 18 30 16
Bryan 26 28 20
Jose 12 12 14
McTon 16 17 19
Jayden 24 16 17
Jane 20 17 14
Flour<-c(230,181,165,150,97,182,191,199,172,170)
Cooking_oil<-c(125,99,97,115,120,100,80,90,95,125)
Kerosene<-c(200,55,105,85,125,150,85,120,110,130)
Rice<-c(109,107,98,71,82,103,111,93,86,78)
data_frame<-data.frame(Flour,Cooking_oil,Kerosene,Rice)
data_frame
Flour Cooking_oil Kerosene Rice
1 230 125 200 109
2 181 99 55 107
3 165 97 105 98
4 150 115 85 71
5 97 120 125 82
6 182 100 150 103
7 191 80 85 111
8 199 90 120 93
9 172 95 110 86
10 170 125 130 78
cov(data_frame)
Flour Cooking_oil Kerosene Rice
Flour 1196.4556 -127.3556 469.94444 309.60000
Cooking_oil -127.3556 244.2667 316.22222 -101.75556
Kerosene 469.9444 316.2222 1589.16667 69.77778
Rice 309.6000 -101.7556 69.77778 197.06667
var(data_frame)
Flour Cooking_oil Kerosene Rice
Flour 1196.4556 -127.3556 469.94444 309.60000
Cooking_oil -127.3556 244.2667 316.22222 -101.75556
Kerosene 469.9444 316.2222 1589.16667 69.77778
Rice 309.6000 -101.7556 69.77778 197.06667
mean(data_frame$Flour)
[1] 173.7
income<-c(25,26,32,31,25,26,24,36,21,26,21,23,25)
expenditure<-c(15,16,14,12,20,16,19,18,14,13,19,18,17)
investment<-c(10,12,14,12,12,14,13,16,21,14,15,19,17)
data.frame(income,expenditure,investment)
income expenditure investment
1 25 15 10
2 26 16 12
3 32 14 14
4 31 12 12
5 25 20 12
6 26 16 14
7 24 19 13
8 36 18 16
9 21 14 21
10 26 13 14
11 21 19 15
12 23 18 19
13 25 17 17
Regression_model=lm(expenditure~income+investment)
summary(Regression_model)
Call:
lm(formula = expenditure ~ income + investment)
Residuals:
Min 1Q Median 3Q Max
-3.5335 -1.4506 -0.2696 1.9866 3.5648
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.284694 6.894461 2.942 0.0147 *
income -0.150291 0.183800 -0.818 0.4326
investment -0.007682 0.259844 -0.030 0.9770
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.67 on 10 degrees of freedom
Multiple R-squared: 0.06582, Adjusted R-squared: -0.121
F-statistic: 0.3523 on 2 and 10 DF, p-value: 0.7115
stargazer(Regression_model, type="text")
===============================================
Dependent variable:
---------------------------
expenditure
-----------------------------------------------
income -0.150
(0.184)
investment -0.008
(0.260)
Constant 20.285**
(6.894)
-----------------------------------------------
Observations 13
R2 0.066
Adjusted R2 -0.121
Residual Std. Error 2.670 (df = 10)
F Statistic 0.352 (df = 2; 10)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
In many cancer studies, the main outcome under assessment is the time to an event of interest. The generic name for the time is survival time, although it may be applied to the time ‘survived’ from complete remission to relapse or progression as equally as to the time from diagnosis to death. If the event occurred in all individuals, many methods of analysis would be applicable. However, it is usual that at the end of follow-up some of the individuals have not had the event of interest, and thus their true time to event is unknown. Further, survival data are rarely Normally distributed, but are skewed and comprise typically of many early events and relatively few late ones. It is these features of the data that make the special methods called survival analysis necessary.
This paper is the first of a series of four articles that aim to introduce and explain the basic concepts of survival analysis. Most survival analyses in cancer journals use some or all of Kaplan–Meier (KM) plots, logrank tests, and Cox (proportional hazards) regression. We will discuss the background to, and interpretation of, each of these methods but also other approaches to analysis that deserve to be used more often. In this first article, we will present the basic concepts of survival analysis, including how to produce and interpret survival curves, and how to quantify and test survival differences between two or more groups of patients. Future papers in the series cover multivariate analysis and the last paper introduces some more advanced concepts in a brief question and answer format. More detailed accounts of these methods can be found in books written specifically about survival analysis, for example, Collett (1994), Parmar and Machin (1995) and Kleinbaum (1996). In addition, individual references for the methods are presented throughout the series. Several introductory texts also describe the basis of survival analysis, for example, Altman (2003) and Piantadosi (1997).
In other words, survival analysis is a branch of statistics for analyzing the expected duration of time until one event happens, such as a death in biological organisms and failure in mechanical systems. Today, I have gone through this topic and found it very interesting. How one can predict the survival of cancer patients, churn rate or students’ dropout rate, or machine failure.
my_data<-ovarian
data(cancer, package="survival")
head(my_data,10)
futime fustat age resid.ds rx ecog.ps
1 59 1 72.3315 2 1 1
2 115 1 74.4932 2 1 1
3 156 1 66.4658 2 1 2
4 421 0 53.3644 2 2 1
5 431 1 50.3397 2 1 1
6 448 0 56.4301 1 1 2
7 464 1 56.9370 2 2 2
8 475 1 59.8548 2 2 2
9 477 0 64.1753 2 1 1
10 563 1 55.1781 1 2 2
tail(my_data,10)
futime fustat age resid.ds rx ecog.ps
17 1040 0 38.8932 2 1 2
18 1106 0 44.6000 1 1 1
19 1129 0 53.9068 1 2 1
20 1206 0 44.2055 2 2 1
21 1227 0 59.5890 1 2 2
22 268 1 74.5041 2 1 2
23 329 1 43.1370 2 1 1
24 353 1 63.2192 1 2 2
25 365 1 64.4247 2 2 1
26 377 0 58.3096 1 2 1
fustat= 1= event occurred(death), =0= no event(they are censored). When fitting a model, make sure you know what 0 and 1 code for. This is the survival after diagnosis.
futime: survival or censoring time
fustat: censoring status
age: in years
resid.ds: residual disease present (1=no,2=yes)
rx: treatment group
ecog.ps: ECOG performance status (1 is better, see reference)
A patient is scored as censored if he or she did not suffer the outcome of interest. In survival analysis, patients who do not have an “event” during a specified period are said to have censored observation.
names(my_data)
[1] "futime" "fustat" "age" "resid.ds" "rx" "ecog.ps"
attach(my_data)
Kaplan-Meier estimate is one of the best options to be used to measure the fraction of subjects living for a certain amount of time after treatment. In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects survived or saved after that intervention over a period of time. The time starting from a defined point to the occurrence of a given event, for example death is called as survival time and the analysis of group data as survival analysis. This can be affected by subjects under study that are uncooperative and refused to be remained in the study or when some of the subjects may not experience the event or death before the end of the study, although they would have experienced or died if observation continued, or we lose touch with them midway in the study. We label these situations as censored observations. The Kaplan-Meier estimate is the simplest way of computing the survival over time in spite of all these difficulties associated with subjects or situations. The survival curve can be created assuming various situations. It involves computing of probabilities of occurrence of event at a certain point of time and multiplying these successive probabilities by any earlier computed probabilities to get the final estimate. This can be calculated for two groups of subjects and also their statistical difference in the survivals. This can be used in Ayurveda research when they are comparing two drugs and looking for survival of subjects.
km.model<-survfit(Surv(futime,fustat)~1,
type="kaplan-meier")
Since we have no x-variable, we must put a 1 there, its just survival, no variable to relate it to. In the command above, we specified the model, however, if we don’t the model, the system will fit the Kaplan-meier model by default.
km.model
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")
n events median 0.95LCL 0.95UCL
[1,] 26 12 638 464 NA
This gives the MEDIAN survival and confidence interval (CI) for it. From the results above, we have 26 individuals and 12 events. The results also gives the median survival time. From the results, more than half of the people survived beyond 638 days.
summary(km.model)
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 26 1 0.962 0.0377 0.890 1.000
115 25 1 0.923 0.0523 0.826 1.000
156 24 1 0.885 0.0627 0.770 1.000
268 23 1 0.846 0.0708 0.718 0.997
329 22 1 0.808 0.0773 0.670 0.974
353 21 1 0.769 0.0826 0.623 0.949
365 20 1 0.731 0.0870 0.579 0.923
431 17 1 0.688 0.0919 0.529 0.894
464 15 1 0.642 0.0965 0.478 0.862
475 14 1 0.596 0.0999 0.429 0.828
563 12 1 0.546 0.1032 0.377 0.791
638 11 1 0.497 0.1051 0.328 0.752
From the summary table above, at time (59 days) 26 individuals were at risk and an event occurred. In other words, and individual died. Additionally, beyond 59 days, the probability of an individual surviving is 96.2%. Further, from the summary table above, we are 95% confident that there are 89% to 100% chances of survival beyond 59 days. At time 115 days, 25 individuals were at risk and one died. The probability of survival beyond 115 days is 92.3% with the standard error of 0.0523. Besides, we are 95% confident that beyond 115 days, there are 82.6% to 100% chances of survival.
plot(km.model,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model")
From the command above, we provide conf.int = F, if we do not want the coefficient level in the plot. Consider the plot below with confidence level.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
From the graph above, you get the reason why why the median survival days is 638.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
abline(h=0.5,col="red")
Remember we can also include a ‘tick’ where there is a censored observation by using the “mark-time” argument.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1, mark.time = T)
The graph above shows every point where there were censored observation.
cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")
head(cancer,10)
futime fustat age resid.ds rx ecog.ps over50 gender
1 59 1 72.3315 2 1 1 1 1
2 115 1 74.4932 2 1 1 1 0
3 156 1 66.4658 2 1 2 1 0
4 421 0 53.3644 2 2 1 1 0
5 431 1 50.3397 2 1 1 1 0
6 448 0 56.4301 1 1 2 1 1
7 464 1 56.9370 2 2 2 1 0
8 475 1 59.8548 2 2 2 1 0
9 477 0 64.1753 2 1 1 1 1
10 563 1 55.1781 1 2 2 1 1
tail(cancer,10)
futime fustat age resid.ds rx ecog.ps over50 gender
17 1040 0 38.8932 2 1 2 0 1
18 1106 0 44.6000 1 1 1 0 0
19 1129 0 53.9068 1 2 1 1 1
20 1206 0 44.2055 2 2 1 0 0
21 1227 0 59.5890 1 2 2 1 0
22 268 1 74.5041 2 1 2 1 1
23 329 1 43.1370 2 1 1 0 1
24 353 1 63.2192 1 2 2 1 0
25 365 1 64.4247 2 2 1 1 0
26 377 0 58.3096 1 2 1 1 1
km.model2<-survfit(Surv(futime,fustat)~over50,
type="kaplan-meier", data = cancer)
plot(km.model2,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="K-M Model: Survival Analysis", las=1)
ggsurvplot(km.model2,data = cancer, pval = TRUE)
plot(km.model2, conf.int = F,xlab = "Time(days)",ylab = "Survival Probability",main="K-M model: Survival Analysis", col = c("red","blue"),las=1,lwd=2,mark.time = T)
## legend
legend(100, 0.2, legend = c("below50","above50"),lty = 1,lwd = 2,col=c("red","blue"),bty = "",cex = 0.6)
km.model2
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer,
type = "kaplan-meier")
n events median 0.95LCL 0.95UCL
over50=0 6 1 NA NA NA
over50=1 20 11 563 431 NA
From the summary statistics above, there were six individuals who were under 50 years of age, from which only one event occurred. On the other hand, we have 20 individual who were at risk and over 50 years from which 11 events occurred. This is a clear indication that there is a relationship between age and survival.
summary(km.model2)
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer,
type = "kaplan-meier")
over50=0
time n.risk n.event survival std.err lower 95% CI
329.000 6.000 1.000 0.833 0.152 0.583
upper 95% CI
1.000
over50=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 20 1 0.950 0.0487 0.859 1.000
115 19 1 0.900 0.0671 0.778 1.000
156 18 1 0.850 0.0798 0.707 1.000
268 17 1 0.800 0.0894 0.643 0.996
353 16 1 0.750 0.0968 0.582 0.966
365 15 1 0.700 0.1025 0.525 0.933
431 12 1 0.642 0.1093 0.460 0.896
464 10 1 0.577 0.1157 0.390 0.855
475 9 1 0.513 0.1193 0.326 0.809
563 7 1 0.440 0.1227 0.255 0.760
638 6 1 0.367 0.1222 0.191 0.705
From the summary table above, at time (329 days) 6 individuals below 50 years were at risk and an event occurred. In other words, and individual died. Additionally, beyond 329 days, the probability of an individual below 50 years surviving is 83.3%. Further, from the summary table above, we are 95% confident that there are 58.3% to 100% chances of survival beyond 329 days for the individuals below 50 years. On the other hand, at time (59 days) 20 individuals above 50 years were at risk and an event occurred. In other words, and individual died. Besides, beyond 59 days, the probability of an individual above 50 years surviving is 95%. Further, from the summary table above, we are 95% confident that there are 85.9% to 100% chances of survival beyond 59 days for the individuals above 50 years. It is important to note that the tick marks in the plot above shows censored observations.
From the graph, we can argue that survival differs significantly across the two age category. Individuals below 50 years seems to have a higher survival probability as compared to individuals above 50 years.
survdiff(Surv(futime,fustat)~over50, data = cancer)
Call:
survdiff(formula = Surv(futime, fustat) ~ over50, data = cancer)
N Observed Expected (O-E)^2/E (O-E)^2/V
over50=0 6 1 3.6 1.876 2.75
over50=1 20 11 8.4 0.804 2.75
Chisq= 2.7 on 1 degrees of freedom, p= 0.1
From the inferential results above, there is no evidence to reject the null hypothesis. In other words, there is no statistically sufficient evidence to prove that survival in the two groups is statistically different as shown by the p-value of 0.1 which is more than 0.05.
The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.
In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:
The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age. An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.
The Cox proportional hazards model is called a semi-parametric model, because there are no assumptions about the shape of the baseline hazard function. There are however, other assumptions as noted above (i.e., independence, changes in predictors produce proportional changes in the hazard regardless of time, and a linear association between the natural logarithm of the relative hazard and the predictors). There are other regression models used in survival analysis that assume specific distributions for the survival times such as the exponential, Weibull, Gompertz and log-normal distributions1,8. The exponential regression survival model, for example, assumes that the hazard function is constant. Other distributions assume that the hazard is increasing over time, decreasing over time, or increasing initially and then decreasing. Example 5 will illustrate estimation of a Cox proportional hazards regression model and discuss the interpretation of the regression coefficients.
data("lung")
head(lung)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
4 5 210 2 57 1 1 90 60 1150 11
5 1 883 2 60 1 0 100 90 NA 0
6 12 1022 1 74 1 1 50 80 513 0
tail(lung,5)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
224 1 188 1 77 1 1 80 60 NA 3
225 13 191 1 39 1 0 90 90 2350 -5
226 32 105 1 75 2 2 60 70 1025 5
227 6 174 1 66 1 1 90 100 1075 1
228 22 177 1 58 2 1 80 90 1060 0
names(lung)
[1] "inst" "time" "status" "age" "sex" "ph.ecog"
[7] "ph.karno" "pat.karno" "meal.cal" "wt.loss"
The data set above is an Rstudio inbuilt data set that can be used to run Cox Proportional hazard model, however, in this case, we are going to use the data below.
cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")
head(cancer,10)
futime fustat age resid.ds rx ecog.ps over50 gender
1 59 1 72.3315 2 1 1 1 1
2 115 1 74.4932 2 1 1 1 0
3 156 1 66.4658 2 1 2 1 0
4 421 0 53.3644 2 2 1 1 0
5 431 1 50.3397 2 1 1 1 0
6 448 0 56.4301 1 1 2 1 1
7 464 1 56.9370 2 2 2 1 0
8 475 1 59.8548 2 2 2 1 0
9 477 0 64.1753 2 1 1 1 1
10 563 1 55.1781 1 2 2 1 1
cancer$over50<-as.factor(cancer$over50)
cancer$ecog.ps<-as.factor(cancer$ecog.ps)
attach(cancer)
cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)
summary(cox.model)
Call:
coxph(formula = Surv(futime, fustat) ~ over50 + ecog.ps)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
over501 1.5356 4.6442 1.0541 1.457 0.145
ecog.ps2 0.2628 1.3006 0.5893 0.446 0.656
exp(coef) exp(-coef) lower .95 upper .95
over501 4.644 0.2153 0.5884 36.655
ecog.ps2 1.301 0.7689 0.4098 4.128
Concordance= 0.603 (se = 0.081 )
Likelihood ratio test= 3.64 on 2 df, p=0.2
Wald test = 2.46 on 2 df, p=0.3
Score (logrank) test = 2.95 on 2 df, p=0.2
Remember, cox proportional hazard model do not have the coefficient for the intercept.
At any given time, someone is above 50 years is 4.6442 times as likely to to die as someone who is under 50 years adjusting for ECOG performance status. ECOG performance status describes a patient’s level of functioning in terms of their ability to care for themselves, daily activity, and physical ability (walking, working, etc.). Researchers worldwide consider the ECOG Performance Status Scale when planning cancer clinical trials to study new treatments. From the results (4.6442- 1) indicates that someone who is over 50 years is 364.42% times likely to die compared to someone who is under 50 years when adjusted for ECOG performance status. What this means is that if we pick two individualz who have same the ECOG performance status, the one above 50 years is 364.42% times likely to dies than the one below 50 years. Additionally, someone who is below 50 years is 0.2153 times as likey to die as compared to someone who is above 50 years adjusting to ECOG performance status.
This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.
cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)
cox.model2<-coxph(Surv(futime,fustat)~over50)
The test aims to establish whether dropping ECOG performance would better or worsen our model significantly. Consider the command below.
anova(cox.model2,cox.model,test = "LRT")
Analysis of Deviance Table
Cox model: response is Surv(futime, fustat)
Model 1: ~ over50
Model 2: ~ over50 + ecog.ps
loglik Chisq Df Pr(>|Chi|)
1 -33.266
2 -33.166 0.2013 1 0.6537
From the results above(p-value>0.05), we cam drop the ECOG performance status without the loss of predictive power from the model. In other words, ECOG performance status is not very important in this model.
data("lung")
head(lung)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
4 5 210 2 57 1 1 90 60 1150 11
5 1 883 2 60 1 0 100 90 NA 0
6 12 1022 1 74 1 1 50 80 513 0
write.csv(lung, "C:\\Users\\user\\Downloads\\lung.csv", row.names=FALSE)
lung_data<-read.csv("C:\\Users\\user\\Downloads\\lung_data.csv")
attach(lung_data)
head(lung_data,10)
inst time status age over50 sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1 3 306 1 74 1 0 1 90 100 1175 NA
2 3 455 1 68 1 0 0 90 90 1225 15
3 3 1010 0 56 1 0 0 90 90 NA 15
4 5 210 1 57 1 0 1 90 60 1150 11
5 1 883 1 60 1 0 0 100 90 NA 0
6 12 1022 0 74 1 0 1 50 80 513 0
7 7 310 1 68 1 1 2 70 60 384 10
8 11 361 1 71 1 1 2 60 80 538 1
9 1 218 1 53 1 0 1 70 80 825 16
10 7 166 1 61 1 0 2 70 70 271 34
cox.model3<-coxph(Surv(time,status)~over50+sex)
summary(cox.model3)
Call:
coxph(formula = Surv(time, status) ~ over50 + sex)
n= 228, number of events= 165
coef exp(coef) se(coef) z Pr(>|z|)
over50 0.2803 1.3236 0.3003 0.933 0.35057
sex -0.5270 0.5904 0.1672 -3.151 0.00162 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
over50 1.3236 0.7555 0.7347 2.3844
sex 0.5904 1.6938 0.4254 0.8193
Concordance= 0.585 (se = 0.022 )
Likelihood ratio test= 11.58 on 2 df, p=0.003
Wald test = 10.94 on 2 df, p=0.004
Score (logrank) test = 11.19 on 2 df, p=0.004
At any given time, someone is above 50 years is 1.3236 times as likely to to die as someone who is under 50 years adjusting for gender. From the results (1.3236- 1) indicates that someone who is over 50 years is 32.36% times likely to die compared to someone who is under 50 years when adjusted for gender. What this means is that if we pick two individuals who have same the gender, the one above 50 years is 32.36% times likely to die than the one below 50 years. Additionally, someone who is below 50 years is 0.7555 times likely to die as compared to someone who is above 50 years adjusting for gender.
This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.
cox.model3<-coxph(Surv(time,status)~over50+sex)
cox.model4<-coxph(Surv(time,status)~over50)
The test aims to establish whether dropping gender would better or worsen our model significantly. Consider the command below.
anova(cox.model4,cox.model3,test = "LRT")
Analysis of Deviance Table
Cox model: response is Surv(time, status)
Model 1: ~ over50
Model 2: ~ over50 + sex
loglik Chisq Df Pr(>|Chi|)
1 -749.35
2 -744.12 10.464 1 0.001217 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above(p-value<0.05), we cannot drop the the X-variable sex from the model. Doing so leads to loss of predictive power from the model. In other words, the X-variable sex is very important in this model.
Now, we have worked with adding categorical X-variables, let us now add numeric X-variable in the model.
cox.model5<-coxph(Surv(time,status)~age+wt.loss)
summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.021697 1.021934 0.009627 2.254 0.0242 *
wt.loss 0.001339 1.001340 0.006239 0.215 0.8301
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.022 0.9785 1.0028 1.041
wt.loss 1.001 0.9987 0.9892 1.014
Concordance= 0.563 (se = 0.026 )
Likelihood ratio test= 5.27 on 2 df, p=0.07
Wald test = 5.12 on 2 df, p=0.08
Score (logrank) test = 5.14 on 2 df, p=0.08
At any given instant in time, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger adjusting for weight loss in the past six months. What this means is that if we pick two individuals who have similar weight loss in the past six months, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger.
The Cox proportional hazards model makes two assumptions: * survival curves for different strata must have hazard functions that are proportional over the time t. * the relationship between the log hazard and each covariate is linear, which can be verified with residual plots. Examples of covariates can be categorical such as race or treatment group, or continuous such as biomarker concentrations.
cox.model5<-coxph(Surv(time,status)~age+wt.loss)
summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.021697 1.021934 0.009627 2.254 0.0242 *
wt.loss 0.001339 1.001340 0.006239 0.215 0.8301
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.022 0.9785 1.0028 1.041
wt.loss 1.001 0.9987 0.9892 1.014
Concordance= 0.563 (se = 0.026 )
Likelihood ratio test= 5.27 on 2 df, p=0.07
Wald test = 5.12 on 2 df, p=0.08
Score (logrank) test = 5.14 on 2 df, p=0.08
If you recall, we assume that the relationship between the covariates or rather X-variable and the log hazard is linear. We can check this assumption in the same way we check linearity in the linear regression model, logistic regression model, or Poisson regression model. In order to do, we can look at the residual plot. On the x-axis, we plot the predicted values and y-axis we plot the residuals.
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
abline(h=0)
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
abline(h=0)
lines(smooth.spline(predict(cox.model5),residuals(cox.model5,type = "martingale")),col="red")
plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
abline(h=0)
lines(smooth.spline(predict(cox.model5),
residuals(cox.model5,type = "deviance")),col="red")
From the results above, there is linearity between the covariates/X-variables and the log hazard.
This assumption is tested using Schoenfeld test which has the following hypothesis;
Performing this test will return test for each X-variable and for overall model
cox.zph(cox.model5)
chisq df p
age 0.6910 1 0.41
wt.loss 0.0121 1 0.91
GLOBAL 0.7209 2 0.70
The p-values for the covariates and overall model are greater than 0.05. This is an indicator that there is statistically significant evidence to reject the null hypothesis. We therefore conculude that HAZARDS are proportional over time.
We can see a plot of these as well …(one plot for each parameter). These are plots of “changes in b over time”, if we let “b” vary over time recall…if “b” vary over time, this means that there is NO PH!. The effect is not constant over time… it varies! However pay less attention to the extremes, as line is sensitive…
plot(cox.zph(cox.model5))
The smooth line in the plot tells us how HAZARds will change if we allow them to change. The dotted line on either side of the solid line is the confidence limit. Let us add the abline through zero.
plot(cox.zph(cox.model5)[1], main="Plot of Hazards")
abline(h=0,col="red")
From the graph above, the reference line [abline] lies with the confidence bound at least 1/3 of the time. However, that is not enough to show that coefficient of HAZARDS for age is not changing over time. Partially, the proportional hazards assumption seems to be partially met.
plot(cox.zph(cox.model5)[2], main="Plot of Hazards")
abline(h=0,col="red")
From the graph above, the reference line [abline] lies with the confidence bound at least throught the time. This is a clear indication that is enough evidence to enough to show that coefficient of HAZARDS for weight loss is not changing over time. From the graph, we can conlude that the coefficient for weight loss is not really changing. In other words proportional hazards assumption is fully met.