In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although, it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.
Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following
The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example: 1. A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.
knitr::opts_chunk$set(comment = NA)
library(haven)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.3 ✔ forcats 0.5.2
Warning: package 'ggplot2' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(agricolae)
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(ggplot2)
library(ggthemes)
library(plotrix)
library(tseries)
library(tseries)
library(tidyverse)
library(vars)
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Loading required package: strucchange
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Attaching package: 'strucchange'
The following object is masked from 'package:stringr':
boundary
Loading required package: urca
Loading required package: lmtest
Attaching package: 'lmtest'
The following object is masked from 'package:highcharter':
unemployment
library(olsrr)
Attaching package: 'olsrr'
The following object is masked from 'package:MASS':
cement
The following object is masked from 'package:datasets':
rivers
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
library(grid)
library(rmarkdown)
library(ggtext)
library(RColorBrewer)
library(rhandsontable)
library(forecast)
library(quantmod)
Loading required package: xts
Attaching package: 'xts'
The following objects are masked from 'package:dplyr':
first, last
Loading required package: TTR
library(xts)
library(PerformanceAnalytics)
Attaching package: 'PerformanceAnalytics'
The following objects are masked from 'package:agricolae':
kurtosis, skewness
The following object is masked from 'package:graphics':
legend
library(rugarch)
Loading required package: parallel
Attaching package: 'rugarch'
The following object is masked from 'package:purrr':
reduce
The following object is masked from 'package:stats':
sigma
library(dplyr)
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
library(gapminder)
library(plyr)
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------
Attaching package: 'plyr'
The following objects are masked from 'package:dplyr':
arrange, count, desc, failwith, id, mutate, rename, summarise,
summarize
The following object is masked from 'package:purrr':
compact
library(plotrix)
library(survival)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(survminer)
Warning: package 'survminer' was built under R version 4.2.2
Loading required package: ggpubr
Attaching package: 'ggpubr'
The following object is masked from 'package:plyr':
mutate
The following object is masked from 'package:forecast':
gghistogram
Attaching package: 'survminer'
The following object is masked from 'package:survival':
myeloma
library(magrittr)
library(ggpubr)
library(stargazer)
Use the command below to ensure that the output values are not written in scientific notation.
options(scipen=999)
theme_set(theme_economist())
theme_set(theme_base())
Superstore_data<-read.csv("C:\\Users\\user\\Downloads\\Superstore_data.csv")
attach(Superstore_data)
head(Superstore_data,10)
tail(Superstore_data,10)
Sales<-ts(Superstore_data$Sales,start=1265,frequency = 1)
Profit<-ts(Superstore_data$Profit,start = 1265,frequency = 1)
ts.plot(Sales, main="A time series plot of Sales from Superstore company",xlab="Time in years",ylab="Sales")
A time series plot of Sales from Superstore company
ts.plot(Profit, main="A time series plot of Profit from Superstore company",xlab="Time in years",ylab="Profit")
A time series plot of Profit from Superstore company
The one-sample t-test is a statistical hypothesis test used to determine whether an unknown population mean is different from a specific value.
You can use the test for continuous data. Your data should be a random sample from a normal population.
If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.
The company assumes that from 2016 to 2019, they made an average Sales of approximately 200. Test this hypothesis to determine whether the company management were right on their assumption.
t.test(Superstore_data$Sales,mu=200,alternative="two.sided",conf.level=0.95)
One Sample t-test
data: Superstore_data$Sales
t = 4.7893, df = 9993, p-value = 0.000001698
alternative hypothesis: true mean is not equal to 200
95 percent confidence interval:
217.6375 242.0785
sample estimates:
mean of x
229.858
In the above output, the p-value of the test is approximately close to 0.0001, which is significantly less than the significance level alpha = 0.05. We will therefore the null hypothesis and conclude that the true of sales from 2016 to 2019 is not equal to 200 at a 5% level of significance given the p-value of approximately 0.0001
The Independent Samples t Test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test.
data22<-Superstore_data %>%
dplyr::select(Region, Sales)%>%
filter(Region == "West"|
Region == "East")
head(data22,5)
tail(data22,5)
t.test(Sales ~ Region,alternative="two.sided",data = data22)
Welch Two Sample t-test
data: Sales by Region
t = 0.79611, df = 5603.9, p-value = 0.426
alternative hypothesis: true difference in means between group East and group West is not equal to 0
95 percent confidence interval:
-17.31977 41.00552
sample estimates:
mean in group East mean in group West
238.3361 226.4932
In the above output, the p-value of the test is 0.426, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from West region and East region is equal to zero with a p-value of 0.426 at a 5% level of significance.
data23<-Superstore_data %>%
dplyr::select(Category, Sales)%>%
filter(Category == "Technology"|
Category == "Furniture")
head(data23,5)
tail(data23,5)
t.test(Sales ~ Category,alternative="two.sided",data = data23)
Welch Two Sample t-test
data: Sales by Category
t = -3.6721, df = 2497.7, p-value = 0.0002456
alternative hypothesis: true difference in means between group Furniture and group Technology is not equal to 0
95 percent confidence interval:
-157.8094 -47.9394
sample estimates:
mean in group Furniture mean in group Technology
349.8349 452.7093
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and furniture category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.
data24<-Superstore_data %>%
dplyr::select(Category, Sales)%>%
filter(Category == "Technology"|
Category == "Office Supplies")
head(data24,5)
t.test(Sales ~ Category,alternative="two.sided",data = data24)
Welch Two Sample t-test
data: Sales by Category
t = -12.694, df = 1982.1, p-value < 0.00000000000000022
alternative hypothesis: true difference in means between group Office Supplies and group Technology is not equal to 0
95 percent confidence interval:
-384.8897 -281.8807
sample estimates:
mean in group Office Supplies mean in group Technology
119.3241 452.7093
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between sales from technology category and office supplies category is not equal to zero with a p-value of approximately 0.0001 at a 5% level of significance.
One-Way ANOVA (“analysis of variance”) compares the means of two or more independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. One-Way ANOVA is a parametric test.
anova_model<-aov(Sales~Category)
summary(anova_model)
Df Sum Sq Mean Sq F value Pr(>F)
Category 2 195881745 97940872 265.5 <0.0000000000000002 ***
Residuals 9991 3685743767 368906
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 265.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the three categories at a 5% level of significance.
Tukey HSD is the commonly used post hoc test to determine which two categories are differing in sales from 2016 to 2019.
tukey_hsd<-TukeyHSD(anova_model)
tukey_hsd
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Sales ~ Category)
$Category
diff lwr upr p adj
Office Supplies-Furniture -230.5108 -266.45583 -194.5657 0.0000000
Technology-Furniture 102.8744 57.56303 148.1858 0.0000003
Technology-Office Supplies 333.3852 295.51937 371.2510 0.0000000
From the results above a significant difference in the average Sales exists across all the categories. This is indicated by the p-value less than 0.05 for all the pairs. This results can be visualized as shown below.
plot(tukey_hsd, las=1)
Tukeys honesty significant difference
anova_model2<-aov(Sales~Sub.Category)
summary(anova_model2)
Df Sum Sq Mean Sq F value Pr(>F)
Sub.Category 16 776253968 48515873 155.9 <0.0000000000000002 ***
Residuals 9977 3105371543 311253
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 155.5 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the average sales across the seventeen sub categories at a 5% level of significance.
tukey_hsd2<-TukeyHSD(anova_model2)
tukey_hsd2
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Sales ~ Sub.Category)
$Sub.Category
diff lwr upr p adj
Appliances-Accessories 14.7811064 -98.352371 127.914584 1.0000000
Art-Accessories -181.9057697 -279.299223 -84.512317 0.0000000
Binders-Accessories -82.4140438 -167.571615 2.743527 0.0707858
Bookcases-Accessories 287.8850290 142.479391 433.290667 0.0000000
Chairs-Accessories 316.3578159 212.227967 420.487665 0.0000000
Copiers-Accessories 1982.9670138 1738.872771 2227.061256 0.0000000
Envelopes-Accessories -151.1068795 -290.643768 -11.569991 0.0187882
Fasteners-Accessories -202.0378297 -350.263756 -53.811903 0.0003125
Furnishings-Accessories -120.1489362 -213.413396 -26.884476 0.0010412
Labels-Accessories -181.6715489 -304.305120 -59.037978 0.0000400
Machines-Accessories 1429.5787092 1236.717765 1622.439653 0.0000000
Paper-Accessories -158.6905119 -245.436934 -71.944090 0.0000000
Phones-Accessories 155.2369304 60.389846 250.084015 0.0000021
Storage-Accessories 48.6159493 -47.347208 144.579107 0.9455486
Supplies-Accessories 29.6755961 -126.561825 185.913017 0.9999997
Tables-Accessories 432.8201673 304.435984 561.204351 0.0000000
Art-Appliances -196.6868761 -309.258580 -84.115172 0.0000002
Binders-Appliances -97.1951502 -199.365184 4.974884 0.0839321
Bookcases-Appliances 273.1039226 117.124105 429.083741 0.0000002
Chairs-Appliances 301.5767095 183.128706 420.024713 0.0000000
Copiers-Appliances 1968.1859073 1717.648678 2218.723136 0.0000000
Envelopes-Appliances -165.8879859 -316.411895 -15.364077 0.0147424
Fasteners-Appliances -216.8189361 -375.431134 -58.206739 0.0002927
Furnishings-Appliances -134.9300426 -243.949139 -25.910946 0.0022497
Labels-Appliances -196.4526554 -331.455976 -61.449334 0.0000641
Machines-Appliances 1414.7976027 1213.844256 1615.750949 0.0000000
Paper-Appliances -173.4716183 -276.969665 -69.973572 0.0000009
Phones-Appliances 140.4558240 30.079770 250.831878 0.0013283
Storage-Appliances 33.8348429 -77.501726 145.171412 0.9998021
Supplies-Appliances 14.8944897 -151.229065 181.018045 1.0000000
Tables-Appliances 418.0390609 277.791414 558.286708 0.0000000
Binders-Art 99.4917259 15.081912 183.901540 0.0052743
Bookcases-Art 469.7907987 324.821821 614.759777 0.0000000
Chairs-Art 498.2635856 394.744359 601.782813 0.0000000
Copiers-Art 2164.8727835 1921.038405 2408.707162 0.0000000
Envelopes-Art 30.7988902 -108.282913 169.880694 0.9999974
Fasteners-Art -20.1320600 -167.929659 127.665539 1.0000000
Furnishings-Art 61.7568335 -30.825370 154.339037 0.6491218
Labels-Art 0.2342208 -121.881288 122.349730 1.0000000
Machines-Art 1611.4844789 1418.952537 1804.016420 0.0000000
Paper-Art 23.2152578 -62.797221 109.227737 0.9999589
Phones-Art 337.1427001 242.966407 431.318994 0.0000000
Storage-Art 230.5217190 135.221496 325.821942 0.0000000
Supplies-Art 211.5813658 55.750251 367.412481 0.0003411
Tables-Art 614.7259370 486.836518 742.615356 0.0000000
Bookcases-Binders 370.2990728 233.250425 507.347720 0.0000000
Chairs-Binders 398.7718597 306.671058 490.872661 0.0000000
Copiers-Binders 2065.3810576 1826.170834 2304.591281 0.0000000
Envelopes-Binders -68.6928357 -199.498322 62.112651 0.9265977
Fasteners-Binders -119.6237859 -259.661129 20.413557 0.2055723
Furnishings-Binders -37.7348924 -117.345140 41.875355 0.9698971
Labels-Binders -99.2575051 -211.856461 13.341450 0.1629283
Machines-Binders 1511.9927530 1325.351719 1698.633787 0.0000000
Paper-Binders -76.2764681 -148.140741 -4.412195 0.0244962
Phones-Binders 237.6509742 156.192387 319.109562 0.0000000
Storage-Binders 131.0299931 48.274572 213.785414 0.0000059
Supplies-Binders 112.0896399 -36.401652 260.580932 0.4181442
Tables-Binders 515.2342111 396.398061 634.070362 0.0000000
Chairs-Bookcases 28.4727869 -121.105104 178.050678 0.9999997
Copiers-Bookcases 1695.0819848 1428.412678 1961.751291 0.0000000
Envelopes-Bookcases -438.9919085 -615.063096 -262.920721 0.0000000
Fasteners-Bookcases -489.9228587 -672.956859 -306.888859 0.0000000
Furnishings-Bookcases -408.0339652 -550.261879 -265.806052 0.0000000
Labels-Bookcases -469.5565779 -632.558150 -306.555006 0.0000000
Machines-Bookcases 1141.6936801 920.954324 1362.433037 0.0000000
Paper-Bookcases -446.5755409 -584.617062 -308.534020 0.0000000
Phones-Bookcases -132.6480986 -275.918784 10.622587 0.1080946
Storage-Bookcases -239.2690797 -383.281050 -95.257110 0.0000013
Supplies-Bookcases -258.2094329 -447.789631 -68.629235 0.0003179
Tables-Bookcases 144.9351383 -22.435763 312.306039 0.1863844
Copiers-Chairs 1666.6091979 1420.006796 1913.211600 0.0000000
Envelopes-Chairs -467.4646954 -611.344120 -323.585271 0.0000000
Fasteners-Chairs -518.3956456 -670.716593 -366.074698 0.0000000
Furnishings-Chairs -436.5067521 -536.151146 -336.862358 0.0000000
Labels-Chairs -498.0293648 -625.582250 -370.476480 0.0000000
Machines-Chairs 1113.2208933 917.195157 1309.246630 0.0000000
Paper-Chairs -475.0483278 -568.620158 -381.476498 0.0000000
Phones-Chairs -161.1208855 -262.248108 -59.993663 0.0000049
Storage-Chairs -267.7418666 -369.916586 -165.567147 0.0000000
Supplies-Chairs -286.6822198 -446.809910 -126.554529 0.0000001
Tables-Chairs 116.4623514 -16.628761 249.553464 0.1723836
Envelopes-Copiers -2134.0738932 -2397.589097 -1870.558690 0.0000000
Fasteners-Copiers -2185.0048435 -2453.222376 -1916.787311 0.0000000
Furnishings-Copiers -2103.1159499 -2345.330687 -1860.901213 0.0000000
Labels-Copiers -2164.6385627 -2419.606624 -1909.670502 0.0000000
Machines-Copiers -553.3883046 -848.625604 -258.151005 0.0000000
Paper-Copiers -2141.6575257 -2381.437969 -1901.877082 0.0000000
Phones-Copiers -1827.7300833 -2070.558600 -1584.901566 0.0000000
Storage-Copiers -1934.3510645 -2177.617681 -1691.084447 0.0000000
Supplies-Copiers -1953.2914176 -2226.018114 -1680.564721 0.0000000
Tables-Copiers -1550.1468465 -1807.930124 -1292.363569 0.0000000
Fasteners-Envelopes -50.9309502 -229.338317 127.476417 0.9999128
Furnishings-Envelopes 30.9579433 -105.264385 167.180272 0.9999962
Labels-Envelopes -30.5646695 -188.353313 127.223974 0.9999996
Machines-Envelopes 1580.6855886 1363.767155 1797.604023 0.0000000
Paper-Envelopes -7.5836324 -139.429015 124.261751 1.0000000
Phones-Envelopes 306.3438099 169.033094 443.654526 0.0000000
Storage-Envelopes 199.7228288 61.638829 337.806829 0.0000749
Supplies-Envelopes 180.7824756 -4.334771 365.899722 0.0644947
Tables-Envelopes 583.9270468 421.628674 746.225420 0.0000000
Furnishings-Fasteners 81.8888935 -63.221082 226.998869 0.8728069
Labels-Fasteners 20.3662808 -145.156039 185.888600 1.0000000
Machines-Fasteners 1631.6165388 1409.009286 1854.223792 0.0000000
Paper-Fasteners 43.3473178 -97.661856 184.356492 0.9997704
Phones-Fasteners 357.2747601 211.142577 503.406943 0.0000000
Storage-Fasteners 250.6537790 103.794754 397.512804 0.0000005
Supplies-Fasteners 231.7134258 39.961562 423.465289 0.0034553
Tables-Fasteners 634.8579970 465.031191 804.684803 0.0000000
Labels-Furnishings -61.5226128 -180.371140 57.325914 0.9347411
Machines-Furnishings 1549.7276453 1359.251076 1740.204214 0.0000000
Paper-Furnishings -38.5415757 -119.849148 42.765997 0.9698815
Phones-Furnishings 275.3858666 185.486205 365.285528 0.0000000
Storage-Furnishings 168.7648855 77.688504 259.841267 0.0000000
Supplies-Furnishings 149.8245323 -3.459881 303.108945 0.0639163
Tables-Furnishings 552.9691034 428.195395 677.742812 0.0000000
Machines-Labels 1611.2502581 1404.799158 1817.701358 0.0000000
Paper-Labels 22.9810370 -90.824299 136.786373 0.9999993
Phones-Labels 336.9084794 216.814007 457.002952 0.0000000
Storage-Labels 230.2874982 109.309647 351.265350 0.0000000
Supplies-Labels 211.3471451 38.613696 384.080594 0.0027717
Tables-Labels 614.4917162 466.474111 762.509322 0.0000000
Paper-Machines -1588.2692211 -1775.640525 -1400.897917 0.0000000
Phones-Machines -1274.3417787 -1465.598238 -1083.085319 0.0000000
Storage-Machines -1380.9627599 -1572.775146 -1189.150374 0.0000000
Supplies-Machines -1399.9031130 -1627.923278 -1171.882949 0.0000000
Tables-Machines -996.7585419 -1206.676532 -786.840552 0.0000000
Phones-Paper 313.9274423 230.809266 397.045619 0.0000000
Storage-Paper 207.3064612 122.916951 291.695972 0.0000000
Supplies-Paper 188.3661080 38.957964 337.774252 0.0015901
Tables-Paper 591.5106792 471.530846 711.490513 0.0000000
Storage-Phones -106.6209811 -199.317352 -13.924610 0.0078092
Supplies-Phones -125.5613343 -279.813794 28.691125 0.2813586
Tables-Phones 277.5832369 151.622179 403.544295 0.0000000
Supplies-Storage -18.9403532 -173.881566 136.000860 1.0000000
Tables-Storage 384.2042180 257.400644 511.007792 0.0000000
Tables-Supplies 403.1445712 226.282053 580.007090 0.0000000
Make an interpretation of the results above. ## Plot
plot(tukey_hsd2, las=1)
A plot of Tukey honesty significant difference
ANOVA stands for analysis of variance and tests for differences in the effects of independent variables on a dependent variable. A two-way ANOVA test is a statistical test used to determine the effect of two nominal predictor variables on a continuous outcome variable.
Superstore_data$Category<-as.factor(Superstore_data$Category)
Superstore_data$Sub.Category<- as.factor((Superstore_data$Sub.Category))
linear_model<- lm(Sales~Category+Sub.Category, data = Superstore_data)
Two_factor_anova<-anova(linear_model)
Two_factor_anova
From the results above, the F statistics for category is given as 314.67 with its associated p-value of approximately 0.0001, which is significantly less than 0.01. This is an indication that there is a significant difference in the mean Sales across various categories at a 1% level of significance. On the other, F statistics for sub category is 133.19 with its associated p-value of approximately 0.0001, which is less than 0.01. This is a clear indication that there exists a significant difference in the average Sales across sub categories at a 1% level of significant.
stargazer(Two_factor_anova, type="text")
=================================================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------------------------------------
Df 3 3,331.000 5,755.608 2 9,977
Sum Sq 3 1,293,875,171.000 1,580,537,112.000 195,881,745.000 3,105,371,543.000
Mean Sq 3 46,569,095.000 49,015,303.000 311,253.000 97,940,872.000
F value 2 223.927 128.325 133.188 314.666
Pr(> F) 2 0.000 0.000 0 0
---------------------------------------------------------------------------------
The Chi-square goodness of fit test is a statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. It is often used to evaluate whether sample data is representative of the full population.
Let us enter some data using the command below: The data entered is gotten from linkedin regarding the number of jobs opened from various professions. There are fours different categories of professions.
jobs<-c(11091,11282,15378,12696)
Input the following categories that have the following names
names(jobs)<-c("Project Management","Supply chain","Services","Quality control")
Run the following command to view the available jobs from various categories of professions
jobs
Project Management Supply chain Services Quality control
11091 11282 15378 12696
It is important to calculate the proportion of job vacancies in various categories.
jobs/sum(jobs)
Project Management Supply chain Services Quality control
0.2198545 0.2236407 0.3048348 0.2516701
These are different proportions that we get. Remember these proportions may change on a daily basis. In other words, new jobs will keep opening as others closes, therefore, this proportion is prone to variation. We would therefore like to find out if these proportion in job vacancies are real or out of chances. In other words, we would to established if these proportions are statistically significant or not, that is by chances. From the results above, let us assume that the proportion are all the same in the fours categories. Assuming that the proportion is approximately 25% in all the four categories. We shall create the object probability as shown below.
probability<-c(0.25,0.25,0.25,0.25)
For the chi square test of goodness, we shall have our null hypothesis as follows; * Null hypothesis: Proportions in job categories is not statistically different from 0.25 * Alternative hypothesis: Proportions in job categories is statistically different from 0.25
The alternative hypothesis indicates that fluctuations in job categories will be below or above 0.25. Run the following code chunk to run the chi square test
chisq.test(jobs,p=probability)
Chi-squared test for given probabilities
data: jobs
X-squared = 930.89, df = 3, p-value < 0.00000000000000022
From the result, the square test value if 930.89 and the associated probability of approximately 0.001. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that the proportion of job is statistically different from 0.25, that is, it is below or above 0.25 at a 5% level of significance. The degree of freedom is 3 because we gave four categories. (n-1)
Students in a social studies class hypothesize that the literacy rates across the world for every region are 82%. The table below shows the actual literacy rates across the world broken down by region. What are the test statistic and the degrees of freedom?
illit_lavel<-c(99,99.5,67.3,62.5,91,93.8,61.9,91.9,84.5,66.4)
Input the following categories that have the following names
names(illit_lavel)<-c("Developed Regions","Commonwealth of Independent States","Northern Africa", "Sub-saharan Africa","Latin American and the caribbean","Eastern Asia","Southern Asia","South-Eastern Asia","Western Asia","Oceania")
Run the following command to view the illiteracy level from various regions
illit_lavel
Developed Regions Commonwealth of Independent States
99.0 99.5
Northern Africa Sub-saharan Africa
67.3 62.5
Latin American and the caribbean Eastern Asia
91.0 93.8
Southern Asia South-Eastern Asia
61.9 91.9
Western Asia Oceania
84.5 66.4
It is important to calculate the proportion of illiteracy level in various regions.
illit_lavel/sum(illit_lavel)
Developed Regions Commonwealth of Independent States
0.12105649 0.12166789
Northern Africa Sub-saharan Africa
0.08229396 0.07642455
Latin American and the caribbean Eastern Asia
0.11127415 0.11469797
Southern Asia South-Eastern Asia
0.07569088 0.11237466
Western Asia Oceania
0.10332600 0.08119345
chisq.test(illit_lavel)
Chi-squared test for given probabilities
data: illit_lavel
X-squared = 26.449, df = 9, p-value = 0.001725
1-pchisq(26.449,9)
[1] 0.001724323
From the result, the square test value if 26.449 and the associated probability of approximately 0.002. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that the illiteracy level across the world is statistically different from 82%, that is, it is below or above 82% at a 5% level of significance. The degree of freedom is 9 because we gave ten regions, (n-1)
Employers want to know which days of the week employees are absent in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of 60 managers were asked on which day of the week they had the highest number of employee absences. The results were distributed as in the table below. For the population of employees, do the days for the highest number of absences occur with equal frequencies during a five-day work week? Test at a 5% significance level.
absenteeism<-c(15,12,9,9,15)
Input the following categories that have the following names
names(absenteeism)<-c("Monday","Tuesday","Wednesday","Thursday","Friday")
Run the following command to view the illiteracy level from various regions
absenteeism
Monday Tuesday Wednesday Thursday Friday
15 12 9 9 15
It is important to calculate the proportion of illiteracy level in various regions.
absenteeism/sum(absenteeism)
Monday Tuesday Wednesday Thursday Friday
0.25 0.20 0.15 0.15 0.25
prob<-c(0.20,0.20,0.20,0.20,0.20)
chisq.test(absenteeism,p=prob)
Chi-squared test for given probabilities
data: absenteeism
X-squared = 3, df = 4, p-value = 0.5578
1- pchisq(3,4)
[1] 0.5578254
From the result, the square test value if 3 and the associated probability of approximately 0.5578. This probability is far above 0.05. This is an indication that there is sufficient evidence to support the null hypothesis and therefore conclude that at a 5% level of significance, from the sample data, there is not sufficient evidence to conclude that the absent days do not occur with equal frequencies. In other words, absent days occurs with equal variation. ## For further practice consider the example below.
knitr::include_graphics("chi.png")
Distribution of voters
The Chi Square statistic is commonly used for testing relationships
between categorical variables.
The null hypothesis of the Chi-Square test is that no relationship
exists on the categorical variables in the population; they are
independent. An example research question that could be answered using a
Chi-Square analysis would be: * Is there a significant relationship
between voter intent and political party membership?
The Chi-Square statistic is most commonly used to evaluate Tests of Independence when using a cross tabulation (also known as a bivariate table). Cross tabulation presents the distributions of two categorical variables simultaneously, with the intersections of the categories of the variables appearing in the cells of the table. The Test of Independence assesses whether an association exists between the two variables by comparing the observed pattern of responses in the cells to the pattern that would be expected if the variables were truly independent of each other. Calculating the Chi-Square statistic and comparing it against a critical value from the Chi-Square distribution allows the researcher to assess whether the observed cell counts are significantly different from the expected cell counts.
In other words the Chi-square test of independence checks whether two variables are likely to be related or not. We have counts for two categorical or nominal variables. We also have an idea that the two variables are not related. The test gives us a way to decide if our idea is plausible or not. For instance, assume that we have two variables, gender and voting. Because we have two variables, we shall use rbind function to store our variable in the object voters
voters<-rbind(c(1486,2131),c(2792,3591))
voters
[,1] [,2]
[1,] 1486 2131
[2,] 2792 3591
dimnames(voters)<-list(gender=c("male","female"),
voting=c("voted","didn't vote")
)
Run the code chunk to look at the data we have just entered
voters
voting
gender voted didn't vote
male 1486 2131
female 2792 3591
Run the code chunk below to run the chi square test of independence
chisq.test(voters)
Pearson's Chi-squared test with Yates' continuity correction
data: voters
X-squared = 6.5523, df = 1, p-value = 0.01047
From the result, the square test value if 6.5523 and the associated probability of approximately 0.01. This probability is below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of voting on gender at a 5% level of significance. We can reject the null hypothesis with 95% confidence.
Consider the information below
knitr::include_graphics("chi2.png")
Automobile owners
Let us enter the information in the Rstudio
Buyer<-rbind(c(70,130,150),c(30,20,100))
Give names to the dimension columns and rows using the following code chunk
dimnames(Buyer)<-list(Gender=c("Male","Female"),
Reason=c("Styling","Engineering","Fuel economy"))
Buyer
Reason
Gender Styling Engineering Fuel economy
Male 70 130 150
Female 30 20 100
chisq.test(Buyer)
Pearson's Chi-squared test
data: Buyer
X-squared = 31.746, df = 2, p-value = 0.0000001278
From the result, the square test value if 6.5523 and the associated probability of approximately 0.001. This probability is far below 0.1. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of reason for purchase on gender at a 10% level of significance. We can reject the null hypothesis with 90% confidence. The output above has the degree of freedom of 2 calculated as follows;
\[ (Number of columns-1)(Number of rows-1) = (3-1)(2-1)= (2*1) = 2 \]
A random sample of 400 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from this survey is summarized in the following contingency table.
Education<-rbind(c(60,56,46,42),c(40,46,53,57))
Give names to the dimension columns and rows using the following code chunk
dimnames(Education)<-list(Gender=c("Male","Female"),
Educ_level=c("High school","Bachelors","Masterd","PhD"))
Education
Educ_level
Gender High school Bachelors Masterd PhD
Male 60 56 46 42
Female 40 46 53 57
chisq.test(Education)
Pearson's Chi-squared test
data: Education
X-squared = 7.5911, df = 3, p-value = 0.05526
From the result, the square test value if 7.5911 and the associated probability of approximately 0.05526. This probability is slightly above 0.05. This is an indication that there is sufficient evidence to support the null hypothesis and therefore conclude that there is no statistically significance dependence of education level on gender at a 5% level of significance. We therefore retain the null hypothesis with 95% confidence.
Ch_sqt <- read_sav("C:/Users/user/Downloads/Effects of employees voice mechanism on academic staff turnover intentions1.sav")
attach(Ch_sqt)
head(Ch_sqt,10)
TBL<-table(Ch_sqt$Q4, Ch_sqt$Q6)
TBL
1 2 3 4
1 155 16 5 52
2 53 3 2 67
The two-way table above has no column and row names and therefore it is very important to pride columns names and row names for easier row dimension and identification.
dimnames(TBL)<-list(Employment_terms=c("Permanent","Contractual"),
Trade_union=c("UASU","KUSU","Any other","Not a member of any"))
TBL
Trade_union
Employment_terms UASU KUSU Any other Not a member of any
Permanent 155 16 5 52
Contractual 53 3 2 67
Run the following code chunk to run the chi square test
chisq.test(TBL)
Warning in chisq.test(TBL): Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: TBL
X-squared = 35.018, df = 3, p-value = 0.0000001208
From the result, the square test value of 35.018 and the associated probability of approximately 0.001. This probability is far below 0.05. This is an indication that there is no sufficient evidence to support the null hypothesis and therefore conclude that there is a statistically significance dependence of trade union membership on terms of employment at a 5% level of significance. We can reject the null hypothesis with 95% confidence
TAB<-table(Ch_sqt$Q6)
TAB
1 2 3 4
208 19 7 119
dimnames(TAB)<-list(Trade_Union=c("UASU","KUSU","Member of any other","Not a member of any"))
TAB
Trade_Union
UASU KUSU Member of any other Not a member of any
208 19 7 119
My_TAB<-count(Ch_sqt,'Q6')
My_TAB
rownames(My_TAB)<-c("UASU","KUSU","Member of any other","Not a member of any")
My_TAB
Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.
In order to run a linear regression, it is always important to ensure that there is a well established linear association between the two variable, for instance, one need to create a scatter plot of the two variable to determine the linear association. Consider the scatter plot below for sales and profit.
plot(Sales, Profit, main = "A scatter plot of Sales and Profit")
A scstter plot of profit and sales
plot(log(Sales),Profit)
plot(Sales, log(Profit))
Warning in log(Profit): NaNs produced
plot(log(Sales),log(Profit))
Warning in log(Profit): NaNs produced
Scatter plot of sales and profit
cor(Sales,Profit)
[1] 0.4790643
reg_model<-lm(Profit~Sales, data = Superstore_data)
summary(reg_model)
Call:
lm(formula = Profit ~ Sales, data = Superstore_data)
Residuals:
Min 1Q Median 3Q Max
-7397.5 2.6 14.6 21.7 5261.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.732867 2.192459 -5.808 0.00000000653 ***
Sales 0.180067 0.003301 54.555 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 205.6 on 9992 degrees of freedom
Multiple R-squared: 0.2295, Adjusted R-squared: 0.2294
F-statistic: 2976 on 1 and 9992 DF, p-value: < 0.00000000000000022
stargazer(reg_model, type="text")
===============================================
Dependent variable:
---------------------------
Profit
-----------------------------------------------
Sales 0.180***
(0.003)
Constant -12.733***
(2.192)
-----------------------------------------------
Observations 9,994
R2 0.230
Adjusted R2 0.229
Residual Std. Error 205.639 (df = 9992)
F Statistic 2,976.247*** (df = 1; 9992)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
anova(reg_model)
plot(reg_model)
Residuals vs Leverage
Residuals vs Leverage
Residuals vs Leverage
Residuals vs Leverage
confint(reg_model)
2.5 % 97.5 %
(Intercept) -17.0305284 -8.4352058
Sales 0.1735967 0.1865366
Resid<-residuals(reg_model)
summary(Resid)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-7397.542 2.581 14.634 0.000 21.658 5261.551
From the results above, the residuals have a mean of 0.00. This indicates the zero conditional mean is met. Next it is important to test the covariance between the residuals and the independent variable. If the covariance is given as zero, that is an indication that the change in dependent is causal.
cov(Resid,Sales)
[1] -0.00000000005062127
The value (covariance) is approximately close to zero, an indicator that changes in Profit is caused by changes in sales. In other words, changes in profits is causal.
Resid<-ts(Resid, start = 1265, frequency = 1)
ts.plot(Resid, main="A time series plot of regression residuals",xlab="Time in years",ylab="Residuals")
abline(0,-0.00000000005062127)
Regression residuals
hist(log(Sales),main="Histogram showing the distribution of sales from Superstore",xlab="log of sales", ylab="Frequencies",breaks = 15)
summary(log(Sales))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.8119 2.8495 3.9980 4.1098 5.3468 10.0274
ggplot(data=Superstore_data, aes(log(Sales))) +
geom_histogram(aes(y = ..density..),color="red")+
scale_x_continuous(limits = c(-2,12))+
stat_function(fun = dnorm,
args = list(mean = mean(log(Sales)),
sd = sd(log(Sales))),
col = "#1b98e0",
size = 1)+
labs(title = "Histogram showing the distribution of
Sales for Superstore Company",
subtitle = "Sales from 2016 to 2019",
caption="Data Collected and compiled by the Superstore Company")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values (`geom_bar()`).
Histogram showing the distribution of sales
ggplot(data=Superstore_data, aes(log(Sales))) +
geom_histogram(aes(y = ..density..),color="red")+
scale_x_continuous(limits = c(-2,12))+
stat_function(fun = dnorm,
args = list(mean = mean(log(Sales)),
sd = sd(log(Sales))),
col = "#1b98e0",
size = 1)+
labs(title = "Histogram showing the distribution of Sales for
Superstore Company",
subtitle = "Sales from 2016 to 2019",
caption="Data Collected and compiled by the Superstore Company")+
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.subtitle = element_text(hjust = 0.5))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values (`geom_bar()`).
Histogram showing the distribution of Sales
Summarize the average sales from various regions
Average_sales<-Superstore_data %>%
dplyr::select(Sales, Region)%>%
filter(Region == "Central") %>%
group_by(Region)%>%
summarise(Average_sale = mean(Sales))
Average_sales
Average_sales<-Superstore_data %>%
dplyr::select(Sales, Region)%>%
filter(Region == "South") %>%
group_by(Region)%>%
summarise(Average_sale = mean(Sales))
Average_sales
Superstore_data %>% group_by(Region) %>%
summarize(average_sales=mean(Sales))
Superstore_data %>% group_by(Region) %>%
drop_na() %>%
summarize(average_sales=mean(Sales))
# Create data for the graph.
average_sales <- c(215.7727, 236.1127, 241.8036,226.4932)
Regions <- c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Regions,explode = 0.06, main = "Pie Chart of
Sales from Various of the United States ")
average_sales <- c(215.7727, 236.1127, 241.8036,226.4932)
Regions <- c("Central","East","South","West")
piepercent<- round(100*average_sales/sum(average_sales), 1)
pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from
Various regions of the United States",col = rainbow(length(average_sales)))
legend("topright", c("Central","East","South","West"), cex = 0.8,
fill = rainbow(length(average_sales)))
Pie chart of sales for various regions
Summarize the average sales from various categories
Superstore_data %>% group_by(Category) %>%
drop_na() %>%
summarize(average_sales=mean(Sales))
# Create data for the graph.
average_sales <- c(347.7488, 119.0760,452.5782)
Sales_category <- c("Furniture","Office Supplies","Technology")
piepercent<- round(100*average_sales/sum(average_sales), 1)
# Plot the chart.
pie3D(average_sales,labels = Sales_category,explode = 0.06, main = "Pie Chart of
Sales from Various Categories")
pie(average_sales, labels = piepercent, main = "Pie Chart of Sales from
Various Categories",col = rainbow(length(average_sales)))
legend("topright", c("Furniture","Office Supplies","Technology"), cex = 0.8,
fill = rainbow(length(average_sales)))
Pie chart of sales for various categories
average_sales1 <- c(347.7488, 119.0760,452.5782)
Sales_category1 <- c("Furniture","Office Supplies","Technology")
barplot(average_sales1,names.arg=Sales_category1,xlab="Category",ylab="Average Sales",
col="blue",
main="Bar graph for the average sales for various category",border="red")
Bar graph for various categories
Superstore_data %>% group_by(Sub.Category) %>%
drop_na() %>%
summarize(average_sales=mean(Sales))
average_sales2 <- c(216.13882,230.08435,34.10157,133.56056,486.67443,532.03556,2198.94162,65.11606,13.93677,95.82567,34.30305, 1645.55331,57.30044,370.17151,263.05245,245.65020,648.79477 )
sub_catgory_sales <- c("Accessories","Appliances","Art","Binders","Bookcases","Chairs","Copiers","Envelopes","Fasteners","Furnishings","Labels","Machines","Paper","Phones", "Storage","Supplies","Tables")
barplot(average_sales2,names.arg=sub_catgory_sales,xlab="sub_category",ylab="Average Sales",col="grey",main="Bar graph for the average sales for various sub categories",border="blue")
Bar graph for various sub categories
ggplot(Superstore_data,aes(x=Sub.Category, y=Sales))+
geom_bar(stat = "identity")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and
Sub categories")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(Superstore_data,aes(x=Sub.Category, y=mean(Sales)))+
geom_bar(stat = "identity")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
theme(axis.text.x = element_markdown(angle = 90))
Grouped Barplot of Sales for various Categories and Sub categories
Superstore_data %>%
filter(Category == "Furniture"|
Category == "Office Supplies"|
Category == "Technology")%>%
ggplot(aes(Category, fill= Sub.Category))+
geom_bar(position = "dodge",
alpha=0.5)+
theme_economist()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
labs(title = "Grouped Bar Graph",
y="Number",
x="Category")
Grouped bar graph
Superstore_data %>%
filter(Category == "Furniture"|
Category == "Office Supplies"|
Category == "Technology")%>%
ggplot(aes(Category, fill= Sub.Category))+
geom_bar(position = "fill",
alpha=0.5)+
theme_economist()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
labs(title = "Grouped Bar Graph",
y="Number",
x="Category")
Stacked Bar graphs 100%
Superstore_data %>%
filter(Category == "Furniture"|
Category == "Office Supplies"|
Category == "Technology")%>%
ggplot(aes(Category, fill= Sub.Category))+
geom_bar(alpha=0.5)+
theme_economist_white()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
labs(title = "Grouped Bar Graph",
y="Frequency",
x="Category")
Stacked bar graph
ggplot(Superstore_data,aes(x=Category, y=mean(Sales), fill=Sub.Category))+
geom_bar(stat = "identity")+
theme_calc()
Stacked Bar Graph for proportion
ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")
Grouped Barplot of Sales for various Categories and Sub categories
ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
facet_wrap(Category~.)
Grouped Barplot of Sales for various Categories and Sub categories
ggplot(Superstore_data,aes(x=Category, y=Sales, fill=Sub.Category))+
geom_bar(stat = "identity", position = "dodge")+
theme_bw()+
labs(title = "Grouped Barplot of Sales for various Categories and Sub categories")+
theme(legend.position = "none")
facet_wrap(Category~.)
<ggproto object: Class FacetWrap, Facet, gg>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map_data: function
params: list
setup_data: function
setup_params: function
shrink: TRUE
train_scales: function
vars: function
super: <ggproto object: Class FacetWrap, Facet, gg>
unemploymnt_data<-read.csv("C:\\Users\\user\\Downloads\\Unemployment.csv")
attach(unemploymnt_data)
head(unemploymnt_data, 10)
tail(unemploymnt_data,10)
This is a quarterly data from 1980 first quarter to 2021 last quarter.
attach(unemploymnt_data)
The following objects are masked from unemploymnt_data (pos = 3):
FedRate, Inflation, Unemployment, year
Unemployment<-ts(unemploymnt_data$Unemployment,frequency = 1,start = 1859)
ts.plot(Unemployment,main="Yearly time series plot of Unemployment from 1859 to 2022", xlab="Time in quarters",ylab="Unemployment rate")
Yearly time series plot of Unemployment from 1859 to 2022
Inflation<-ts(unemploymnt_data$Inflation,frequency = 1,start = 1859)
ts.plot(Inflation,main="Yearly time series plot of Inflation from 1859 to 2022", xlab="Time in quarters",ylab="Inflation rate")
Yearly time series plot of Inflation from 1859 to 2022
FedRate<-ts(unemploymnt_data$FedRate,frequency = 1,start = 1859)
ts.plot(FedRate,main="Yearly time series plot of FedRate from 1859 to 2022", xlab="Time in quarters",ylab="Fed Funds rate")
Yearly time series plot of FedRate from 1859 to 2022
plot(Unemployment,Inflation)
Scatter plot of unemployment and inflation
plot(FedRate,Inflation)
Scatter plot of FedRate and Inflation
cor(Inflation,FedRate)
[1] 0.6716283
cor(Inflation,Unemployment)
[1] 0.2172026
hist(Inflation, main="Histogram showing the distribution of Inflation")
Histogram showing the distribution of Inflation
hist(Unemployment, main="Histogram showing the distribution of Unemployment")
Histogram showing the distribution of Unemployment
hist(FedRate, main = "Histogram showing the distribution of FedRate")
Histogram showing the distribution of FedRate
shapiro.test(Inflation)
Shapiro-Wilk normality test
data: Inflation
W = 0.91272, p-value = 0.00000002459
shapiro.test(Unemployment)
Shapiro-Wilk normality test
data: Unemployment
W = 0.96757, p-value = 0.0006884
shapiro.test(FedRate)
Shapiro-Wilk normality test
data: FedRate
W = 0.9102, p-value = 0.00000001704
Estimate the model using the log transformed values of inflation, unemployment and FedRate to bring about linearity between the dependent variable and independent variables.
REG_MODEL<-lm(log(Inflation)~log(Unemployment)+log(FedRate),data = unemploymnt_data)
stargazer(REG_MODEL,type = "text")
===============================================
Dependent variable:
---------------------------
log(Inflation)
-----------------------------------------------
log(Unemployment) 0.176
(0.157)
log(FedRate) 0.976***
(0.085)
Constant -0.902***
(0.291)
-----------------------------------------------
Observations 164
R2 0.473
Adjusted R2 0.466
Residual Std. Error 0.498 (df = 161)
F Statistic 72.167*** (df = 2; 161)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
outlierTest(REG_MODEL)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
14 -3.172999 0.0018094 0.29673
The p-value for Boniferron test statistics shows that there are outliers in the data set
qqPlot(REG_MODEL, main= "QQ Plot Showing the Possible Presence of outliers")
QQ Plot Showing the Possible Presence of outliers
[1] 14 15
leveragePlots(REG_MODEL)
QQ Plot Showing the Possible Presence of outliers
vif(REG_MODEL)
log(Unemployment) log(FedRate)
1.034369 1.034369
VIF of 1.034369 for unemployment and 1.034369 for FedRate is an indication that predictors are not correlated
ncvTest(REG_MODEL)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 2.623628, Df = 1, p = 0.10528
the results show that the variance of the error terms is constant
spreadLevelPlot(REG_MODEL)
Warning in spreadLevelPlot.lm(REG_MODEL):
2 negative fitted values removed
Spread-Level Plot for Regression Model
Suggested power transformation: 1.024086
durbinWatsonTest(REG_MODEL)
lag Autocorrelation D-W Statistic p-value
1 0.7001071 0.5736665 0
Alternative hypothesis: rho != 0
The results shows that there is an correlation of the regression residuals between periods
coeftest(REG_MODEL, hccm(REG_MODEL, type = "hc0"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.901611 0.272771 -3.3054 0.001169 **
log(Unemployment) 0.176483 0.148486 1.1885 0.236368
log(FedRate) 0.976433 0.074715 13.0688 < 0.00000000000000022 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
stargazer(coeftest(REG_MODEL, hccm(REG_MODEL, type = "hc0")),type="text")
=============================================
Dependent variable:
---------------------------
---------------------------------------------
log(Unemployment) 0.176
(0.148)
log(FedRate) 0.976***
(0.075)
Constant -0.902***
(0.273)
=============================================
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
RSD<-residuals(REG_MODEL)
summary(RSD)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.52073 -0.35028 -0.01588 0.00000 0.35593 1.13292
The zero conditional mean assumption is met, because the residuals mean is approximately zero.
cov(RSD,log(FedRate))
[1] 0.00000000000000007316962
cov(RSD,log(Unemployment))
[1] 0.0000000000000001019151
RSD<-ts(RSD, frequency = 1,start = 1858)
ts.plot(RSD)
Times series of residual
ts.plot(RSD)
abline(0,0.000)
Time series plot of residuals with an abline line
ggplot(data=unemploymnt_data,aes(x=year,y=Inflation))+geom_line()
Times series plot of inflation
ggplot(data=unemploymnt_data,aes(x=year,y=Inflation))+geom_line()+
labs(title="Time Series plot of Inflation Rate",
caption="source:Federal Reserve",
y="Inflation Rate", x="Year",
color=3) + # title and caption
theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
axis.title=element_text(size=12,face="bold"),
panel.grid = element_blank())+
#theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
theme(legend.text = element_text(size=12,face="bold"))+
theme_set(theme_economist())
Time Series plot of Inflation Rate
ggplot(data=unemploymnt_data,aes(x=year,y=Unemployment))+geom_line()+
labs(title="Time Series plot of Unemployment Rate",
caption="source:Federal Reserve",
y="Unemployment Rate", x="Year",
color=3) + # title and caption
theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
axis.title=element_text(size=12,face="bold"),
panel.grid = element_blank())+
#theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
theme(legend.text = element_text(size=12,face="bold"))+
theme_set(theme_economist())
Time Series plot of Unemployment Rate
ggplot(data=unemploymnt_data,aes(x=year,y=FedRate))+geom_line()+
labs(title="Time Series plot of Fed Fund Rate",
caption="source:Federal Reserve",
y="Fed Fund Rate", x="Year",
color=3) + # title and caption
theme(axis.text.x = element_text(angle = 0, vjust=0.5, size = 12), # rotate x axis text
axis.title=element_text(size=12,face="bold"),
panel.grid = element_blank())+
#theme(panel.grid.minor = element_blank())+#turn off minor grid(to run remove #be4 theme)
theme(legend.text = element_text(size=12,face="bold"))+
theme_set(theme_economist())
Time Series plot of Fed Fund Rate
ggplot(data=unemploymnt_data,aes(x=year))+
geom_line(aes(y=Inflation,colour="Inflation Rate"))+
geom_line(aes(y=FedRate,colour="FedRate"))+
geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
labs(title="Trends of Inflation, Unemployment and FedRate from 1859 to 2022",
caption="", y="Rate", x="Time in Years", color="Key")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 10))
Trends of Inflation, Unemployment and FedRate from 1859 to 2022
date<-seq(as.Date("1859-01-01"),by="1 year",length.out=length(unemploymnt_data$year))
date
[1] "1859-01-01" "1860-01-01" "1861-01-01" "1862-01-01" "1863-01-01"
[6] "1864-01-01" "1865-01-01" "1866-01-01" "1867-01-01" "1868-01-01"
[11] "1869-01-01" "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01"
[16] "1874-01-01" "1875-01-01" "1876-01-01" "1877-01-01" "1878-01-01"
[21] "1879-01-01" "1880-01-01" "1881-01-01" "1882-01-01" "1883-01-01"
[26] "1884-01-01" "1885-01-01" "1886-01-01" "1887-01-01" "1888-01-01"
[31] "1889-01-01" "1890-01-01" "1891-01-01" "1892-01-01" "1893-01-01"
[36] "1894-01-01" "1895-01-01" "1896-01-01" "1897-01-01" "1898-01-01"
[41] "1899-01-01" "1900-01-01" "1901-01-01" "1902-01-01" "1903-01-01"
[46] "1904-01-01" "1905-01-01" "1906-01-01" "1907-01-01" "1908-01-01"
[51] "1909-01-01" "1910-01-01" "1911-01-01" "1912-01-01" "1913-01-01"
[56] "1914-01-01" "1915-01-01" "1916-01-01" "1917-01-01" "1918-01-01"
[61] "1919-01-01" "1920-01-01" "1921-01-01" "1922-01-01" "1923-01-01"
[66] "1924-01-01" "1925-01-01" "1926-01-01" "1927-01-01" "1928-01-01"
[71] "1929-01-01" "1930-01-01" "1931-01-01" "1932-01-01" "1933-01-01"
[76] "1934-01-01" "1935-01-01" "1936-01-01" "1937-01-01" "1938-01-01"
[81] "1939-01-01" "1940-01-01" "1941-01-01" "1942-01-01" "1943-01-01"
[86] "1944-01-01" "1945-01-01" "1946-01-01" "1947-01-01" "1948-01-01"
[91] "1949-01-01" "1950-01-01" "1951-01-01" "1952-01-01" "1953-01-01"
[96] "1954-01-01" "1955-01-01" "1956-01-01" "1957-01-01" "1958-01-01"
[101] "1959-01-01" "1960-01-01" "1961-01-01" "1962-01-01" "1963-01-01"
[106] "1964-01-01" "1965-01-01" "1966-01-01" "1967-01-01" "1968-01-01"
[111] "1969-01-01" "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01"
[116] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01"
[121] "1979-01-01" "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01"
[126] "1984-01-01" "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01"
[131] "1989-01-01" "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01"
[136] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[141] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[146] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[151] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[156] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01"
[161] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01"
ggplot(data=unemploymnt_data,aes(x=date))+
geom_line(aes(y=Inflation,colour="Inflation Rate"))+
labs(title="Trends of Inflation Rate from 1859 to 2022",
caption="", y="Inflation Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y", breaks = "4 year")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Inflation Rate from 1859 to 2022
ggplot(data=unemploymnt_data,aes(x=date))+
geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
labs(title="Trends of Unemployment Rate from 1859 to 2022",
caption="", y="Unemployment Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y", breaks = "4 year")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
ggplot(data=unemploymnt_data,aes(x=date))+
geom_line(aes(y=FedRate,colour="Fed Fund Rate"))+
labs(title="Trends of Fed Fund Rate from 1859 to 2022",
caption="", y="Fed Fund Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y", breaks = "4 year")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Fed Fund Rate from 1859 to 2022
ggplot(data=unemploymnt_data,aes(x=date))+
geom_line(aes(y=FedRate,colour="Fed Fund Rate"))+
geom_line(aes(y=Inflation,colour="Inflation Rate"))+
geom_line(aes(y=Unemployment,colour="Unemployment Rate"))+
labs(title="Trends of Fed Fund Rate, Inflation Rate and Unemployment
Rate from 1859 to 2022",
caption="", y="Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y", breaks = "4 year")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Trends of Fed Rate, Inflation and Unemployment from 1859 to 2022
plot_1<-ggplot(data=unemploymnt_data,aes(x=date,y=Inflation))+geom_line()+
labs(title="Inflation Rate from 1859 to 2022",
caption="", y="Inflation Rate", x="Time in Years", color=3)+
scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
plot_2<-ggplot(data=unemploymnt_data,aes(x=date,y=Unemployment))+geom_line()+
labs(title="Unemployment Rate from 1859 to 2022",
caption="", y="Unemployment Rate", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
plot_3<-ggplot(data=unemploymnt_data,aes(x=date,y=FedRate))+geom_line()+
labs(title="Fed Fund Rates from 1859 to 2022",
caption="", y="Fed Fund Rates", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_2),ggplotGrob(plot_3),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_2),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),ggplotGrob(plot_3),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_2),ggplotGrob(plot_3),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_1),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_2),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(plot_3),size="last"))
data11<-data.frame(unemploymnt_data$year,unemploymnt_data$Inflation,unemploymnt_data$FedRate,unemploymnt_data$Unemployment,RSD)
head(data11,10)
write.csv(data11,"C:\\Users\\user\\Downloads\\residual_data.csv", row.names = FALSE)
ggplot(data=data11,aes(x=year,y=RSD))+geom_line()+
labs(title="Regression Residuals",
caption="Source:Federal Reserve", y="Residuals", x="Time in Years", color=3)
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
residual_data<-read.csv("C:\\Users\\user\\Downloads\\residual_data.csv")
attach(residual_data)
The following object is masked _by_ .GlobalEnv:
RSD
head(residual_data,10)
Time<-seq(as.Date("1859-01-01"),by="1 year",length.out=length(residual_data$unemploymnt_data.year))
Time
[1] "1859-01-01" "1860-01-01" "1861-01-01" "1862-01-01" "1863-01-01"
[6] "1864-01-01" "1865-01-01" "1866-01-01" "1867-01-01" "1868-01-01"
[11] "1869-01-01" "1870-01-01" "1871-01-01" "1872-01-01" "1873-01-01"
[16] "1874-01-01" "1875-01-01" "1876-01-01" "1877-01-01" "1878-01-01"
[21] "1879-01-01" "1880-01-01" "1881-01-01" "1882-01-01" "1883-01-01"
[26] "1884-01-01" "1885-01-01" "1886-01-01" "1887-01-01" "1888-01-01"
[31] "1889-01-01" "1890-01-01" "1891-01-01" "1892-01-01" "1893-01-01"
[36] "1894-01-01" "1895-01-01" "1896-01-01" "1897-01-01" "1898-01-01"
[41] "1899-01-01" "1900-01-01" "1901-01-01" "1902-01-01" "1903-01-01"
[46] "1904-01-01" "1905-01-01" "1906-01-01" "1907-01-01" "1908-01-01"
[51] "1909-01-01" "1910-01-01" "1911-01-01" "1912-01-01" "1913-01-01"
[56] "1914-01-01" "1915-01-01" "1916-01-01" "1917-01-01" "1918-01-01"
[61] "1919-01-01" "1920-01-01" "1921-01-01" "1922-01-01" "1923-01-01"
[66] "1924-01-01" "1925-01-01" "1926-01-01" "1927-01-01" "1928-01-01"
[71] "1929-01-01" "1930-01-01" "1931-01-01" "1932-01-01" "1933-01-01"
[76] "1934-01-01" "1935-01-01" "1936-01-01" "1937-01-01" "1938-01-01"
[81] "1939-01-01" "1940-01-01" "1941-01-01" "1942-01-01" "1943-01-01"
[86] "1944-01-01" "1945-01-01" "1946-01-01" "1947-01-01" "1948-01-01"
[91] "1949-01-01" "1950-01-01" "1951-01-01" "1952-01-01" "1953-01-01"
[96] "1954-01-01" "1955-01-01" "1956-01-01" "1957-01-01" "1958-01-01"
[101] "1959-01-01" "1960-01-01" "1961-01-01" "1962-01-01" "1963-01-01"
[106] "1964-01-01" "1965-01-01" "1966-01-01" "1967-01-01" "1968-01-01"
[111] "1969-01-01" "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01"
[116] "1974-01-01" "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01"
[121] "1979-01-01" "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01"
[126] "1984-01-01" "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01"
[131] "1989-01-01" "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01"
[136] "1994-01-01" "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01"
[141] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01"
[146] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01"
[151] "2009-01-01" "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01"
[156] "2014-01-01" "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01"
[161] "2019-01-01" "2020-01-01" "2021-01-01" "2022-01-01"
residual_plot<-ggplot(data=residual_data,aes(x=Time,y=RSD))+geom_line()+
labs(title="Regression Residuals",
caption="", y="Residual", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "4 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
grid.newpage()
grid.draw(rbind(ggplotGrob(residual_plot),size="last"))
sample_sales<-sample(Superstore_data$Sales,size = 200, replace = FALSE)
sample_profit<-sample(Superstore_data$Profit,size = 200, replace = FALSE)
sample_quantity<-sample(Superstore_data$Quantity,size = 200, replace = FALSE)
sample_data<-data.frame(sample_sales,sample_profit,sample_quantity)
head(sample_data,10)
tail(sample_data,10)
In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias occurs when the omitted variable is correlated with one or more other independent variable and dependent variable. Consider the data below.
research_data<-read.csv("C:\\Users\\user\\Downloads\\research_data.csv")
attach(research_data)
The following object is masked from unemploymnt_data (pos = 4):
year
The following object is masked from unemploymnt_data (pos = 5):
year
head(research_data,10)
tail(research_data,10)
mlr1 <- lm(Private_investment~income_tax+value_added_tax, data = research_data)
stargazer(mlr1,type = "text")
===============================================
Dependent variable:
---------------------------
Private_investment
-----------------------------------------------
income_tax -1.017***
(0.344)
value_added_tax 2.828***
(0.658)
Constant 40,402.690***
(12,398.920)
-----------------------------------------------
Observations 46
R2 0.712
Adjusted R2 0.698
Residual Std. Error 60,037.290 (df = 43)
F Statistic 53.103*** (df = 2; 43)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
mlr2<- lm(Private_investment~income_tax+value_added_tax+import_tax, data = research_data)
stargazer(mlr2,type = "text")
===============================================
Dependent variable:
---------------------------
Private_investment
-----------------------------------------------
income_tax -1.026***
(0.323)
value_added_tax 1.752**
(0.743)
import_tax 4.249**
(1.630)
Constant 15,523.390
(15,054.700)
-----------------------------------------------
Observations 46
R2 0.752
Adjusted R2 0.734
Residual Std. Error 56,362.240 (df = 42)
F Statistic 42.433*** (df = 3; 42)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
waldtest(mlr1,mlr2,test = "F")
The null hypothesis for this test states that the coefficient of the omitted variable is zero. Here the implication is that if we accept the null hypothesis, the variable was correctly omitted. On the other hand, the alternative hypothesis states that the coefficient of the omitted variable is not equal to zero. Therefore, rejecting the null hypothesis indicates that the variable was incorrectly omitted. From the results above, the p-value is approximately 0.01, which indicates that we should reject the null hypothesis and conclude that the variable was incorrectly omitted. Thus, the omitted variable helps to explain the variation in the response variable.
fdi_data<-read.csv("C:\\Users\\user\\Downloads\\fdi.csv")
attach(fdi_data)
The following object is masked from research_data:
year
The following object is masked from unemploymnt_data (pos = 5):
year
The following object is masked from unemploymnt_data (pos = 6):
year
head(fdi_data,10)
FDI_USA....of.GDP.<-ts(fdi_data$FDI_USA....of.GDP.,frequency = 1,start = 1970)
ts.plot(FDI_USA....of.GDP.,main="Yearly time series plot of Foreign Direct
Investment Net Flow in the USA from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in USA
FDI_UK....of.GDP.<-ts(fdi_data$FDI_UK....of.GDP.,frequency = 1,start = 1970)
ts.plot(FDI_UK....of.GDP.,main="Yearly time series plot of Foreign Direct
Investment Net Flow in the UK from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in UK
FDI_Kenya....of.GDP.<-ts(fdi_data$FDI_Kenya....of.GDP.,frequency = 1,start = 1970)
ts.plot(FDI_Kenya....of.GDP.,main="Yearly time series plot of Foreign Direct
Investment Net Flow in Kenya from 1970 to 2020", xlab="Time in Years",ylab="Foreign Direct Investment")
Yearly time series of FDI in Kenya
date_time<-seq(as.Date("1970-01-01"),by="1 year",length.out=length(fdi_data$year))
date_time
[1] "1970-01-01" "1971-01-01" "1972-01-01" "1973-01-01" "1974-01-01"
[6] "1975-01-01" "1976-01-01" "1977-01-01" "1978-01-01" "1979-01-01"
[11] "1980-01-01" "1981-01-01" "1982-01-01" "1983-01-01" "1984-01-01"
[16] "1985-01-01" "1986-01-01" "1987-01-01" "1988-01-01" "1989-01-01"
[21] "1990-01-01" "1991-01-01" "1992-01-01" "1993-01-01" "1994-01-01"
[26] "1995-01-01" "1996-01-01" "1997-01-01" "1998-01-01" "1999-01-01"
[31] "2000-01-01" "2001-01-01" "2002-01-01" "2003-01-01" "2004-01-01"
[36] "2005-01-01" "2006-01-01" "2007-01-01" "2008-01-01" "2009-01-01"
[41] "2010-01-01" "2011-01-01" "2012-01-01" "2013-01-01" "2014-01-01"
[46] "2015-01-01" "2016-01-01" "2017-01-01" "2018-01-01" "2019-01-01"
[51] "2020-01-01"
USA<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_USA....of.GDP.))+geom_line()+
labs(title="FDI Inflow in the USA from 1970 to 2020",
caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
Kenya<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_Kenya....of.GDP.))+geom_line()+
labs(title="FDI Inflow in Kenya from 1970 to 2020",
caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
UK<-ggplot(data=fdi_data,aes(x=date_time,y=FDI_UK....of.GDP.))+geom_line()+
labs(title="FDI Inflow in UK from 1970 to 2020",
caption="", y="Foreign Direct Investment", x="Time in Years", color="Key")+
scale_x_date( date_labels = "%Y-%b", breaks = "2 years")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8))
grid.newpage()
grid.draw(rbind(ggplotGrob(USA),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(Kenya),size="last"))
grid.newpage()
grid.draw(rbind(ggplotGrob(UK),size="last"))
ggplot(data=fdi_data,aes(x=year))+
geom_line(aes(y=FDI_USA....of.GDP.,colour="FDI_USA"))+
geom_line(aes(y=FDI_UK....of.GDP.,colour="FDI_UK"))+
geom_line(aes(y=FDI_Kenya....of.GDP.,colour="FDI_Kenya"))+
labs(title="Trends of FDI inflow as a percentage of GDP from 1970 to 2020",
caption="", y="FDI", x="Time in Years",color="Key")+
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 10))
Trends of FDI inflow as a percentage of GDP from 1970 to 2020
ForDI<-read.csv("C:\\Users\\user\\Downloads\\ForDI.csv")
attach(ForDI)
The following object is masked _by_ .GlobalEnv:
Kenya
head(ForDI,10)
ForDI_TR<- ts(data=ForDI[,-1],start = min(ForDI$Year),end =max(ForDI$Year))
plot(ForDI_TR,plot.type = "single",col=1:7)
legend("topleft",legend = colnames(ForDI_TR),ncol = 2,lty = 1, col = 1:7,cex = 0.9)
Time series plot Foreign Direct Investment
plot(ForDI_TR)
ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. Before we introduce ARIMA models, we must first discuss the concept of stationarity and the technique of differencing time series.
A stationary time series is one whose properties do not depend on the time at which the series is observed.14 Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time. Some cases can be confusing — a time series with cyclic behaviour (but with no trend or seasonality) is stationary. This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be. In general, a stationary time series will have no predictable patterns in the long-term. Time plots will show the series to be roughly horizontal (although some cyclic behaviour is possible), with constant variance.
One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required. A number of unit root tests are available, which are based on different assumptions and may lead to conflicting answers. In our analysis, we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test (Kwiatkowski, Phillips, Schmidt, & Shin, 1992). In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required. The test can be computed using the ur.kpss() function from the urca package.
gdp_data<-read.csv("C:\\Users\\user\\Downloads\\ARIMA.csv")
head(gdp_data,5)
str(gdp_data)
'data.frame': 204 obs. of 2 variables:
$ Year : int 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 ...
$ realgdp: num 1610 1659 1723 1754 1774 ...
#Class of the data
class(gdp_data)
[1] "data.frame"
gdp_TS<- ts(data=gdp_data[,-1],start = min(gdp_data$Year),end =max(gdp_data$Year))
class(gdp_TS)
[1] "ts"
#Plot the dataset
plot(gdp_TS)
#Augumented Dicky Fuller Test (Unit root test)
adf.test(gdp_TS)
Warning in adf.test(gdp_TS): p-value greater than printed p-value
Augmented Dickey-Fuller Test
data: gdp_TS
Dickey-Fuller = 0.28354, Lag order = 5, p-value = 0.99
alternative hypothesis: stationary
From the result above, GDP is highly correlated with itself and therefore not stationary.
#Check for autocorrelation (ACF)
acf(gdp_TS)
pacf(gdp_TS)
#Run the ARIMA by automatically Converting non-Stationary data to stationary data
auto.arima(gdp_TS,ic="aic", trace = TRUE)
Fitting models using approximations to speed things up...
ARIMA(2,2,2) : 2039.769
ARIMA(0,2,0) : 2114.799
ARIMA(1,2,0) : 2073.859
ARIMA(0,2,1) : 2049.559
ARIMA(1,2,2) : 2047.647
ARIMA(2,2,1) : 2037.924
ARIMA(1,2,1) : 2046.277
ARIMA(2,2,0) : 2067.973
ARIMA(3,2,1) : 2041.463
ARIMA(3,2,0) : 2064.69
ARIMA(3,2,2) : 2043.092
Now re-fitting the best model(s) without approximations...
ARIMA(2,2,1) : 2054.273
Best model: ARIMA(2,2,1)
Series: gdp_TS
ARIMA(2,2,1)
Coefficients:
ar1 ar2 ma1
0.2633 0.1442 -0.9656
s.e. 0.0725 0.0726 0.0195
sigma^2 = 1477: log likelihood = -1023.14
AIC=2054.27 AICc=2054.48 BIC=2067.51
ARIMAMODEL<-auto.arima(gdp_TS,ic="aic", trace = TRUE)
Fitting models using approximations to speed things up...
ARIMA(2,2,2) : 2039.769
ARIMA(0,2,0) : 2114.799
ARIMA(1,2,0) : 2073.859
ARIMA(0,2,1) : 2049.559
ARIMA(1,2,2) : 2047.647
ARIMA(2,2,1) : 2037.924
ARIMA(1,2,1) : 2046.277
ARIMA(2,2,0) : 2067.973
ARIMA(3,2,1) : 2041.463
ARIMA(3,2,0) : 2064.69
ARIMA(3,2,2) : 2043.092
Now re-fitting the best model(s) without approximations...
ARIMA(2,2,1) : 2054.273
Best model: ARIMA(2,2,1)
ARIMAMODEL
Series: gdp_TS
ARIMA(2,2,1)
Coefficients:
ar1 ar2 ma1
0.2633 0.1442 -0.9656
s.e. 0.0725 0.0726 0.0195
sigma^2 = 1477: log likelihood = -1023.14
AIC=2054.27 AICc=2054.48 BIC=2067.51
summary(ARIMAMODEL)
Series: gdp_TS
ARIMA(2,2,1)
Coefficients:
ar1 ar2 ma1
0.2633 0.1442 -0.9656
s.e. 0.0725 0.0726 0.0195
sigma^2 = 1477: log likelihood = -1023.14
AIC=2054.27 AICc=2054.48 BIC=2067.51
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 2.803157 37.96217 29.06838 0.02152401 0.7154865 0.6297229
ACF1
Training set -0.008628346
#Test the stationarity of the model
acf(ts(ARIMAMODEL$residuals))
pacf(ts(ARIMAMODEL$residuals))
From the many articles that we have read, the ACF is used to identify
order of Moving Average (MA) term, and PACF for Auto-regressive term
(AR). From the results above, MA will have one lag,with AR having two
lags. #Forecasting
MYGDPF<-forecast(ARIMAMODEL,level=c(95), h=50)
MYGDPF
Point Forecast Lo 95 Hi 95
2021 9357.475 9282.142 9432.809
2022 9415.513 9292.099 9538.927
2023 9476.136 9306.002 9646.271
2024 9538.083 9325.159 9751.008
2025 9600.752 9347.949 9853.555
2026 9663.801 9373.544 9954.058
2027 9727.055 9401.215 10052.895
2028 9790.417 9430.445 10150.390
2029 9853.838 9460.849 10246.827
2030 9917.289 9492.143 10342.436
2031 9980.757 9524.114 10437.401
2032 10044.234 9556.599 10531.870
2033 10107.716 9589.474 10625.958
2034 10171.200 9622.640 10719.761
2035 10234.686 9656.019 10813.352
2036 10298.172 9689.550 10906.793
2037 10361.658 9723.181 11000.136
2038 10425.145 9756.870 11093.420
2039 10488.632 9790.583 11186.681
2040 10552.119 9824.290 11279.948
2041 10615.606 9857.967 11373.244
2042 10679.093 9891.594 11466.592
2043 10742.580 9925.152 11560.008
2044 10806.067 9958.626 11653.508
2045 10869.554 9992.003 11747.105
2046 10933.041 10025.271 11840.810
2047 10996.528 10058.422 11934.634
2048 11060.015 10091.445 12028.584
2049 11123.502 10124.335 12122.669
2050 11186.989 10157.083 12216.894
2051 11250.476 10189.685 12311.266
2052 11313.963 10222.136 12405.790
2053 11377.450 10254.430 12500.470
2054 11440.937 10286.565 12595.309
2055 11504.424 10318.536 12690.312
2056 11567.911 10350.341 12785.481
2057 11631.398 10381.976 12880.819
2058 11694.885 10413.441 12976.329
2059 11758.372 10444.733 13072.011
2060 11821.859 10475.849 13167.869
2061 11885.346 10506.789 13263.903
2062 11948.833 10537.552 13360.114
2063 12012.320 10568.135 13456.505
2064 12075.807 10598.538 13553.075
2065 12139.294 10628.761 13649.826
2066 12202.781 10658.803 13746.759
2067 12266.268 10688.663 13843.873
2068 12329.755 10718.340 13941.170
2069 12393.242 10747.834 14038.649
2070 12456.729 10777.146 14136.312
plot(MYGDPF,type="l",main="Time Series plot of Forecasted GDP (ARIMA 2,2,1)",xlab="Time in Years",ylab="MYGDPF")
ARIMA forecast
#Diagnosing the ARIMA Model/Validating the Model
Box.test(MYGDPF$resid, lag=5,type="Ljung-Box")
Box-Ljung test
data: MYGDPF$resid
X-squared = 1.6084, df = 5, p-value = 0.9002
Box.test(MYGDPF$resid, lag=2,type="Ljung-Box")
Box-Ljung test
data: MYGDPF$resid
X-squared = 0.025225, df = 2, p-value = 0.9875
Box.test(MYGDPF$resid, lag=20,type="Ljung-Box")
Box-Ljung test
data: MYGDPF$resid
X-squared = 23.863, df = 20, p-value = 0.2484
From the results, ARIMA(2,2,1) is good for prediction and has no autocorrelation issue. Box-Ljung test gave a p-value value of 0.9875, showing no autocorrelation
autoplot(forecast(MYGDPF))
ARIMA 2,2,1 Forecast
RESID<-residuals(ARIMAMODEL)
summary(RESID)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-127.529 -20.085 2.219 2.803 25.356 154.703
plot(RESID)
abline(2.803,0)
The plot of residuals with an abline
In today’s lesson, you’ll learn the basics of the vector autoregressive model. We lay the foundation for getting started with this crucial multivariate time series model and cover the important details including:
The vector autoregressive (VAR) model is a workhouse multivariate time series model that relates current observations of a variable with past observations of itself and past observations of other variables in the system. VAR models differ from univariate autoregressive models because they allow feedback to occur between the variables in the model. For example, we could use a VAR model to show how Inflation rate is a function of unemployment rate and how unemployment rate is, in turn, a function of inflation rate. In economics we might be interested in establishing a bi-directional relationship between personal income and personal consumption spending? A two-equation VAR system is used to model the relationship between income and consumption over time.
VAR models are traditionally widely used in finance and econometrics because they offer a framework for accomplishing important modeling goals, including (Stock and Watson 2001):
There are three broad types of VAR models, the reduced form, the recursive form, and the structural VAR model.
This model contain all the components of the reduced form model, but also allow some variables to be functions of other concurrent variables. By imposing these short-run relationships, the recursive model allows us to model structural shocks.
A VAR model is made up of a system of equations that represents the relationships between multiple variables. When referring to VAR models, we often use special language to specify:
In general, a VAR model is composed of n-equations (representing n endogenous variables) and includes p-lags of the variables.
Lag selection is one of the important aspects of VAR model specification. In practical applications, we generally choose a maximum number of lags, , and evaluate the performance of the model including. The optimal model is then the model VAR(p) which minimizes some lag selection criteria. The most commonly used lag selection criteria are:
From an estimation standpoint, it is important to be deliberate about how many variables we include in our VAR model. Adding additional variables: * Increases the number of coefficients to be estimated for each equation and each number of lags.
Deciding what variables to include in a VAR model should be founded in theory, as much as possible. We can use additional tools, like Granger causality or Sims causality, to test the forecasting relevance of variables.
Tests whether a variable is “helpful” for forecasting the behavior of another variable. It’s important to note that Granger causality only allows us to make inferences about forecasting capabilities – not about true causality.
Despite their seeming complexities, VAR models are quite easy to estimate. The equation can be estimated using ordinary least squares given a few assumptions:
The error term has a conditional mean of zero.
The variables in the model are stationary.
Large outliers are unlikely.
No perfect multicollinearity. Under these assumptions, the ordinary least squares estimates:
Will be consistent.
Can be evaluated using traditional t-statistics and p-values.
Can be used to jointly test restrictions across multiple equations.
One of the most important functions of VAR models is to generate forecasts. Forecasts are generated for VAR models using an iterative forecasting algorithm:
Often we are more interested in the dynamics that are predicted by our VAR models than the actual coefficients that are estimated. For this reason, it is most common that VAR studies report:
As we previously discussed, Granger-causality statistics test whether one variable is statistically significant when predicting another variable. In other words, the Granger-causality statistics are F-statistics that test if the coefficients of all lags of a variable are jointly equal to zero in the equation for another variable. As the p-value of the F-statistic decreases, evidence that a variable is relevant for predict another variable increases.
For example, in the Granger-causality test of on , if the p-value is 0.02 we would say that does help predict at the 5% level. However, if the p-value is 0.3 we would say that there is no evidence that helps predict .
The impulse response function traces the dynamic path of variables in the system to shocks to other variables in the system. This is done by:
Forecast error decomposition separates the forecast error variance into proportions attributed to each variable in the model. Intuitively, this measure helps us judge how much of an impact one variable has on another variable in the VAR model and how intertwined our variables’ dynamics are. For example, if X is responsible for 85% of the forecast error variance of Y, it is explaining a large amount of the forecast variation in Y. However, if X is only responsible for 20% of the forecast error variance of Y, much of the forecast error variance of is left unexplained by X.
VAR models are an essential component of multivariate time series modeling. After today’s blog, you should have a better understanding of the fundamentals of the VAR model including:
var<-read.csv("C:\\Users\\user\\Downloads\\estimation.csv")
head(var,10)
Inflation<-ts(var$Inflation,start=c(2000,1),frequency = 12)
plot.ts(Inflation,type="l",main="Time Series plot Inflation",xlab="Year",ylab="inflation")
The time Series plot Inflation
unemployment<-ts(var$unemployment,start=c(2000,1),frequency = 12)
plot.ts(unemployment,type="l",main="Time Series plot unemployment",xlab="Year",ylab="unemployment")
Time Series plot unemployment
Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of a series.
The null hypothesis however is still the same as the Dickey Fuller test. A key point to remember here is: Since the null hypothesis assumes the presence of unit root, that is alpha=1, the p-value obtained should be less than the significance level (say 0.05) in order to reject the null hypothesis. Thereby, inferring that the series is stationary. However, this is a very common mistake analysts commit with this test. That is, if the p-value is less than significance level, people mistakenly take the series to be non-stationary.
ADF<-adf.test(Inflation)
Warning in adf.test(Inflation): p-value smaller than printed p-value
ADF
Augmented Dickey-Fuller Test
data: Inflation
Dickey-Fuller = -5.7974, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
ADF1<-adf.test(unemployment)
Warning in adf.test(unemployment): p-value smaller than printed p-value
ADF1
Augmented Dickey-Fuller Test
data: unemployment
Dickey-Fuller = -4.0617, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
lag<-VARselect(var, lag.max=4)
lag
$selection
AIC(n) HQ(n) SC(n) FPE(n)
1 1 1 1
$criteria
1 2 3 4
AIC(n) -0.093318151 -0.08006220 -0.08372238 -0.07044136
HQ(n) -0.052691480 -0.01235108 0.01107318 0.05143865
SC(n) 0.007032298 0.08718855 0.15042866 0.23060999
FPE(n) 0.910908005 0.92307937 0.91974245 0.93210286
The four criteria above indicates that the appropriate number of lags is one. Including more than one lag in the model will be inappropriate.
EST<-VAR(var, p=1, type = "none")
summary(EST)
VAR Estimation Results:
=========================
Endogenous variables: Inflation, unemployment
Deterministic variables: none
Sample size: 199
Log Likelihood: -549.51
Roots of the characteristic polynomial:
0.43 0.43
Call:
VAR(y = var, p = 1, type = "none")
Estimation results for equation Inflation:
==========================================
Inflation = Inflation.l1 + unemployment.l1
Estimate Std. Error t value Pr(>|t|)
Inflation.l1 0.25493 0.06887 3.702 0.000278 ***
unemployment.l1 -0.05589 0.04771 -1.171 0.242875
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9565 on 197 degrees of freedom
Multiple R-Squared: 0.06981, Adjusted R-squared: 0.06037
F-statistic: 7.392 on 2 and 197 DF, p-value: 0.0008023
Estimation results for equation unemployment:
=============================================
unemployment = Inflation.l1 + unemployment.l1
Estimate Std. Error t value Pr(>|t|)
Inflation.l1 0.58693 0.07050 8.325 0.000000000000014 ***
unemployment.l1 0.59648 0.04885 12.211 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9792 on 197 degrees of freedom
Multiple R-Squared: 0.5342, Adjusted R-squared: 0.5295
F-statistic: 113 on 2 and 197 DF, p-value: < 0.00000000000000022
Covariance matrix of residuals:
Inflation unemployment
Inflation 0.9104 -0.0391
unemployment -0.0391 0.9578
Correlation matrix of residuals:
Inflation unemployment
Inflation 1.00000 -0.04187
unemployment -0.04187 1.00000
#Use stargaze to organize the results
stargazer(EST[["varresult"]], type='text')
===========================================================
Dependent variable:
----------------------------
y
(1) (2)
-----------------------------------------------------------
Inflation.l1 0.255*** 0.587***
(0.069) (0.071)
unemployment.l1 -0.056 0.596***
(0.048) (0.049)
-----------------------------------------------------------
Observations 199 199
R2 0.070 0.534
Adjusted R2 0.060 0.530
Residual Std. Error (df = 197) 0.957 0.979
F Statistic (df = 2; 197) 7.392*** 112.979***
===========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
In practice, you usually don’t don’t make interpretation for the coefficient of a VAR model. And there is a pretty intuitive reason for that. As you recall, VARs assume that all relevant variables are somehow affecting each other through time as a unique universe, so much that in practice VAR estimates as much equations as there are variables. For example, if you estimate a 3-var VAR, then you get an output that consists of 3 estimated equations. This means that each variable gets the “opportunity” of being endogenous while the others are exogenous. So, since VAR assumes this, you won’t want to interpret coefficients as doing that would imply that an x- var is causally affecting the y-var, which by construction, VAR does not assumes that. That’s why most analysis skip that and jump to other analysis, like impulse-response functions and graphs.
Strictly speaking, the coefficients of a reduced-form VAR have the same interpretation as the coefficients in any regression, ie. they are the estimated partial derivatives of the dependent variable with respect to the lagged explanatory variables. However, we are typically not interested in these coefficients, but rather on the dynamic properties of the whole system. After all, a VAR is intended to summarize the dynamic behavior of a group of variables, all of which are endogenous with respect to one another. So, we typically focus on the impulse response functions (IRFs) of the VAR, which provide the (cumulative) total derivatives of the endogenous variables with respect to an exogenous shock to one of the variables. More technically, calculating IRFs is equivalent to inverting the vector autoregressive representation to obtain the vector moving average representation. Note that doing so requires that the matrix be invertible, meaning no unit roots in the underlying data.
However, there is an additional complication, which gets to the root of the difference between VARs and the linear simultaneous equation models of the 1970s that they replaced. Since only lags appear as regressors in a VAR, any contemporaneous correlations between variables is captured in the residual covariance matrix. But IRFs become invalid when errors are correlated since the former explicitly assumes that only one variable is being shocked at t=0. Therefore, in addition to the reduced-form coefficient matrix being invertible, we require a means by which the residuals from the VAR can be rendered orthogonal (uncorrelated) with respect to one another. This is how we get from a reduced-form to a structural VAR whose IRFs are valid. Unfortunately, the orthogonality requirement alone is not sufficient to produce a unique structural representation. In other words, there exists an infinity of different structural representations of the VAR that (a) are consistent with the estimated reduced form parameters of the VAR, and (b) possess orthogonal shocks. One popular example is the Cholesky factorization of the residual covariance matrix.
To summarize, the IRFs of VAR are much easier to interpret than the parameters of the reduced-form VAR. But calculating valid IRFs requires (a) stationary ( I(0) ) endogenous variables (or cointegration of the variables) and (b) some form of factorization of the residual covariance matrix that renders the resulting shocks orthogonal with respect to one another. There are other more subtle requirements such as linearity of the underlying data generating process, but these are the 2 big ones.
roots(EST, modulus=TRUE)
[1] 0.4299625 0.4299625
The model is stable since the roots are inside the unit circle, that is, the roots are less than one
GRA<-causality(EST, cause = "Inflation")
GRA
$Granger
Granger causality H0: Inflation do not Granger-cause unemployment
data: VAR object EST
F-Test = 69.304, df1 = 1, df2 = 394, p-value = 0.000000000000001332
$Instant
H0: No instantaneous causality between: Inflation and unemployment
data: VAR object EST
Chi-squared = 0.38572, df = 1, p-value = 0.5346
GRA1<-causality(EST, cause = "unemployment")
GRA1
$Granger
Granger causality H0: unemployment do not Granger-cause Inflation
data: VAR object EST
F-Test = 1.372, df1 = 1, df2 = 394, p-value = 0.2422
$Instant
H0: No instantaneous causality between: unemployment and Inflation
data: VAR object EST
Chi-squared = 0.38572, df = 1, p-value = 0.5346
From the analysis, it was established that inflation granger cause unemployment, however, unemployment do not granger cause inflation.
IRF1<-irf(EST, impulse = "Inflation", response = "unemployment",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
IRF1
Impulse response coefficients
$Inflation
unemployment
[1,] -0.0431533864513
[2,] 0.5356735487953
[3,] 0.4640598613174
[4,] 0.2960802457858
[5,] 0.1662984086264
[6,] 0.0868538042711
[7,] 0.0432056980416
[8,] 0.0207296508563
[9,] 0.0096622621431
[10,] 0.0043943820861
[11,] 0.0019552162997
[12,] 0.0008523272496
[13,] 0.0003642305110
[14,] 0.0001525446726
[15,] 0.0000625448388
[16,] 0.0000250512251
[17,] 0.0000097665459
[18,] 0.0000036842516
[19,] 0.0000013313196
[20,] 0.0000004524105
[21,] 0.0000001390725
Lower Band, CI= 0.95
$Inflation
unemployment
[1,] -0.169171500436
[2,] 0.376654126141
[3,] 0.279313868516
[4,] 0.145139507875
[5,] 0.063398901744
[6,] 0.021161699257
[7,] -0.005481819418
[8,] -0.013567116845
[9,] -0.011108438313
[10,] -0.007477150215
[11,] -0.004163985660
[12,] -0.002051689657
[13,] -0.001004384246
[14,] -0.000461231348
[15,] -0.000180387432
[16,] -0.000080200060
[17,] -0.000015417320
[18,] -0.000006248758
[19,] -0.000015273511
[20,] -0.000011951222
[21,] -0.000008943788
Upper Band, CI= 0.95
$Inflation
unemployment
[1,] 0.0850139786
[2,] 0.6681237588
[3,] 0.5941520957
[4,] 0.4280824180
[5,] 0.2855338709
[6,] 0.1825943775
[7,] 0.1157773305
[8,] 0.0733199290
[9,] 0.0480346920
[10,] 0.0314476673
[11,] 0.0206537647
[12,] 0.0139051638
[13,] 0.0093772944
[14,] 0.0063337296
[15,] 0.0042801623
[16,] 0.0028938776
[17,] 0.0019362835
[18,] 0.0012880042
[19,] 0.0008568245
[20,] 0.0005700235
[21,] 0.0003792451
plot(IRF1, ylab="Inflation",
main="Inflation response to unemployment shock")
Inflation response to the unemployment shock
Usually, the impulse response functions are interpreted as something like “a one standard deviation shock to x causes significant increases (decreases) in y for m periods (determined by the length of period for which the SE bands are above 0 or below 0 in case of decrease) after which the effect dissipates. The IRF curve above shows that a one standard deviation shock from unemployment causes inflation to increase from 0 to 3 months and fall there after. The effect of do not last longer. The effects lasted for 3 months.
IRF2<-irf(EST, impulse = "unemployment", response = "Inflation",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF2, ylab="unemployment",
main="unemployment response to Inflation shock")
unemployment response to Inflation shock
From the graph above, a one standard deviation shock from inflation on unemployment cause unemployment to decrease upto about 2 months and starts to increase. However, the confidence bound contains a zero, which is an implication that the shock of inflation on unemployment on inflation is not statistically significant. This results confirms to the results gotten above which indicated that unemployment granger cause inflation but inflation do not granger cause unemployment.
IRF3<-irf(EST, impulse = "Inflation", response = "Inflation",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF3, ylab="Inflation",
main="Inflation response to Inflation shock")
Inflation response to Inflation shock
IRF4<-irf(EST, impulse = "unemployment", response = "unemployment",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF4, ylab="unemployment",
main="unemployment response to unemployment shock")
unemployment response to unemployment shock
The variance decomposition indicates the amount of information each variable contributes to the other variables in the autoregression. It determines how much of the forecast error variance of each of the variables can be explained by exogenous shocks to the other variables.
VARD<-fevd(EST, n.ahead=20)
VARD
$Inflation
Inflation unemployment
[1,] 1.0000000 0.000000000
[2,] 0.9969451 0.003054924
[3,] 0.9947479 0.005252059
[4,] 0.9938673 0.006132705
[5,] 0.9935938 0.006406153
[6,] 0.9935202 0.006479764
[7,] 0.9935022 0.006497786
[8,] 0.9934981 0.006501898
[9,] 0.9934972 0.006502784
[10,] 0.9934970 0.006502966
[11,] 0.9934970 0.006503001
[12,] 0.9934970 0.006503008
[13,] 0.9934970 0.006503009
[14,] 0.9934970 0.006503010
[15,] 0.9934970 0.006503010
[16,] 0.9934970 0.006503010
[17,] 0.9934970 0.006503010
[18,] 0.9934970 0.006503010
[19,] 0.9934970 0.006503010
[20,] 0.9934970 0.006503010
$unemployment
Inflation unemployment
[1,] 0.001942077 0.9980579
[2,] 0.182061322 0.8179387
[3,] 0.265135962 0.7348640
[4,] 0.293687415 0.7063126
[5,] 0.302330557 0.6976694
[6,] 0.304683598 0.6953164
[7,] 0.305269961 0.6947300
[8,] 0.305405969 0.6945940
[9,] 0.305435712 0.6945643
[10,] 0.305441899 0.6945581
[11,] 0.305443130 0.6945569
[12,] 0.305443365 0.6945566
[13,] 0.305443408 0.6945566
[14,] 0.305443415 0.6945566
[15,] 0.305443416 0.6945566
[16,] 0.305443417 0.6945566
[17,] 0.305443417 0.6945566
[18,] 0.305443417 0.6945566
[19,] 0.305443417 0.6945566
[20,] 0.305443417 0.6945566
plot(VARD)
Variance decomposition function for inflation and unemployment
my_var<-read.csv("C:\\Users\\user\\Downloads\\my_var.csv")
attach(my_var)
The following objects are masked _by_ .GlobalEnv:
Inflation, Unemployment
The following objects are masked from unemploymnt_data (pos = 7):
Inflation, Unemployment
The following objects are masked from unemploymnt_data (pos = 8):
Inflation, Unemployment
head(my_var,15)
Inflation<-ts(my_var$Inflation,start=c(1960),frequency = 4)
plot.ts(Inflation,type="l",main="Time Series plot Inflation",xlab="Year",ylab="inflation")
Time series plot of inflation
Unemployment<-ts(my_var$Unemployment,start=c(1960),frequency = 4)
plot.ts(Unemployment,type="l",main="Time Series plot Unemployment",xlab="Year",ylab="Unemployment")
Time series plot of unemployment
Fed.Rate<-ts(my_var$Fed.Rate,start=c(1960),frequency = 4)
plot.ts(Fed.Rate,type="l",main="Time Series plot of FedRate",xlab="Year",ylab="FedRate")
Time series plot of FedRate
URT<-adf.test(my_var$Inflation)
URT
Augmented Dickey-Fuller Test
data: my_var$Inflation
Dickey-Fuller = -2.2805, Lag order = 5, p-value = 0.4593
alternative hypothesis: stationary
URT1<-adf.test(my_var$Unemployment)
URT1
Augmented Dickey-Fuller Test
data: my_var$Unemployment
Dickey-Fuller = -2.0581, Lag order = 5, p-value = 0.5521
alternative hypothesis: stationary
URT2<-adf.test(my_var$Fed.Rate)
URT2
Augmented Dickey-Fuller Test
data: my_var$Fed.Rate
Dickey-Fuller = -3.3068, Lag order = 5, p-value = 0.07246
alternative hypothesis: stationary
lag<-VARselect(my_var, lag.max=4)
lag
$selection
AIC(n) HQ(n) SC(n) FPE(n)
3 3 2 3
$criteria
1 2 3 4
AIC(n) -2.6746095 -3.15420316 -3.28434359 -3.25963208
HQ(n) -2.5809554 -2.99030845 -3.05020830 -2.95525620
SC(n) -2.4439714 -2.75058659 -2.70774850 -2.51005846
FPE(n) 0.0689359 0.04267955 0.03748351 0.03844391
VAR_mdl<-VAR(my_var, p=3, type = "none")
VAR_mdl
VAR Estimation Results:
=======================
Estimated coefficients for equation Unemployment:
=================================================
Call:
Unemployment = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3
Unemployment.l1 Inflation.l1 Fed.Rate.l1 Unemployment.l2 Inflation.l2
1.540461079 0.036092543 0.015635976 -0.567700896 -0.026670658
Fed.Rate.l2 Unemployment.l3 Inflation.l3 Fed.Rate.l3
0.034688923 -0.011735675 -0.003682298 -0.019977419
Estimated coefficients for equation Inflation:
==============================================
Call:
Inflation = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3
Unemployment.l1 Inflation.l1 Fed.Rate.l1 Unemployment.l2 Inflation.l2
-1.01182762 0.68232932 0.14845919 1.56424639 0.09094890
Fed.Rate.l2 Unemployment.l3 Inflation.l3 Fed.Rate.l3
-0.06004687 -0.57197523 0.18422864 -0.05318817
Estimated coefficients for equation Fed.Rate:
=============================================
Call:
Fed.Rate = Unemployment.l1 + Inflation.l1 + Fed.Rate.l1 + Unemployment.l2 + Inflation.l2 + Fed.Rate.l2 + Unemployment.l3 + Inflation.l3 + Fed.Rate.l3
Unemployment.l1 Inflation.l1 Fed.Rate.l1 Unemployment.l2 Inflation.l2
-1.5314206 0.0541425 0.9752619 1.3715074 0.2532385
Fed.Rate.l2 Unemployment.l3 Inflation.l3 Fed.Rate.l3
-0.3649440 0.1113407 -0.1461168 0.3350045
stargazer(VAR_mdl[["varresult"]], type='text')
====================================================================
Dependent variable:
-------------------------------------
y
(1) (2) (3)
--------------------------------------------------------------------
Unemployment.l1 1.540*** -1.012** -1.531***
(0.088) (0.389) (0.343)
Inflation.l1 0.036** 0.682*** 0.054
(0.018) (0.079) (0.070)
Fed.Rate.l1 0.016 0.148 0.975***
(0.022) (0.096) (0.084)
Unemployment.l2 -0.568*** 1.564** 1.372**
(0.147) (0.646) (0.570)
Inflation.l2 -0.027 0.091 0.253***
(0.022) (0.096) (0.085)
Fed.Rate.l2 0.035 -0.060 -0.365***
(0.030) (0.133) (0.117)
Unemployment.l3 -0.012 -0.572 0.111
(0.086) (0.376) (0.332)
Inflation.l3 -0.004 0.184** -0.146**
(0.019) (0.084) (0.074)
Fed.Rate.l3 -0.020 -0.053 0.335***
(0.022) (0.098) (0.087)
--------------------------------------------------------------------
Observations 161 161 161
R2 0.999 0.955 0.986
Adjusted R2 0.999 0.952 0.985
Residual Std. Error (df = 152) 0.231 1.015 0.895
F Statistic (df = 9; 152) 12,729.890*** 356.559*** 1,193.802***
====================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
roots(VAR_mdl, modulus=TRUE)
[1] 0.99513614 0.94079541 0.76174764 0.76174764 0.59992529 0.59992529 0.38418659
[8] 0.38418659 0.08928642
#GRANGER CAUSALITY TEST
GrI<-causality(VAR_mdl, cause = "Inflation")
GrI
$Granger
Granger causality H0: Inflation do not Granger-cause Unemployment
Fed.Rate
data: VAR object VAR_mdl
F-Test = 6.0224, df1 = 6, df2 = 456, p-value = 0.000004455
$Instant
H0: No instantaneous causality between: Inflation and Unemployment
Fed.Rate
data: VAR object VAR_mdl
Chi-squared = 4.219, df = 2, p-value = 0.1213
GrU<-causality(VAR_mdl, cause = "Unemployment")
GrU
$Granger
Granger causality H0: Unemployment do not Granger-cause Inflation
Fed.Rate
data: VAR object VAR_mdl
F-Test = 5.6633, df1 = 6, df2 = 456, p-value = 0.00001088
$Instant
H0: No instantaneous causality between: Unemployment and Inflation
Fed.Rate
data: VAR object VAR_mdl
Chi-squared = 23.73, df = 2, p-value = 0.000007031
GrF<-causality(VAR_mdl, cause = "Fed.Rate")
GrF
$Granger
Granger causality H0: Fed.Rate do not Granger-cause Unemployment
Inflation
data: VAR object VAR_mdl
F-Test = 3.2328, df1 = 6, df2 = 456, p-value = 0.004033
$Instant
H0: No instantaneous causality between: Fed.Rate and Unemployment
Inflation
data: VAR object VAR_mdl
Chi-squared = 26.054, df = 2, p-value = 0.0000022
From the analysis, it was established that both inflation, unemployment and fed rate granger cause each other.
IRF_1<-irf(VAR_mdl, impulse = "Inflation", response = "Unemployment",
n.ahead=20,boot = TRUE, run=100, ci=0.95)
plot(IRF_1, ylab="Inflation",
main="Inflation response to Unemployment shock")
Inflation response to unemployment shock
VARD_1<-fevd(VAR_mdl, n.ahead=5)
VARD_1
$Unemployment
Unemployment Inflation Fed.Rate
[1,] 1.0000000 0.000000000 0.0000000000
[2,] 0.9906356 0.008466579 0.0008978494
[3,] 0.9689308 0.017920429 0.0131487360
[4,] 0.9348733 0.028723840 0.0364029017
[5,] 0.8903648 0.046527944 0.0631072218
$Inflation
Unemployment Inflation Fed.Rate
[1,] 0.001789413 0.9982106 0.000000000
[2,] 0.062362243 0.9290021 0.008635625
[3,] 0.092268984 0.8917762 0.015954800
[4,] 0.100141867 0.8864991 0.013359037
[5,] 0.105821679 0.8826761 0.011502228
$Fed.Rate
Unemployment Inflation Fed.Rate
[1,] 0.1722401 0.02083114 0.8069287
[2,] 0.3326729 0.02515215 0.6421749
[3,] 0.4415973 0.06171751 0.4966851
[4,] 0.5042660 0.07830998 0.4174240
[5,] 0.5349494 0.08422492 0.3808256
plot(VARD_1)
Variance decomposition function
Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH) is a statistical model used in analyzing time-series data where the variance error is believed to be serially auto-correlated. GARCH models assume that the variance of the error term follows an auto-regressive moving average process.
Although GARCH models can be used in the analysis of a number of different types of financial data, such as macroeconomic data, financial institutions typically use them to estimate the volatility of returns for stocks, bonds, and market indices. They use the resulting information to help determine pricing and judge which assets will potentially provide higher returns, as well as to forecast the returns of current investments to help in their asset allocation, hedging, risk management, and portfolio optimization decisions.
GARCH models are used when the variance of the error term is not constant. That is, the error term is heteroskedastic. Heteroskedasticity describes the irregular pattern of variation of an error term, or variable, in a statistical model. Essentially, wherever there is heteroskedasticity, observations do not conform to a linear pattern. Instead, they tend to cluster. Therefore, if statistical models that assume constant variance are used on this data, then the conclusions and predictive value one can draw from the model will not be reliable.
The variance of the error term in GARCH models is assumed to vary systematically, conditional on the average size of the error terms in previous periods. In other words, it has conditional heteroskedasticity, and the reason for the heteroskedasticity is that the error term is following an auto-regressive moving average pattern. This means that it is a function of an average of its own past values.
GARCH was developed in 1986 by Dr. Tim Bollerslev, a doctoral student at the time, as a way to address the problem of forecasting volatility in asset prices. It built on economist Robert Engle’s breakthrough 1982 work in introducing the Autoregressive Conditional Heteroskedasticity (ARCH) model. His model assumed the variation of financial returns was not constant over time but are autocorrelated, or conditional to/dependent on each other. For instance, one can see this in stock returns where periods of volatility in returns tend to be clustered together. 1
Since the original introduction, many variations of GARCH have emerged. These include Nonlinear (NGARCH), which addresses correlation and observed “volatility clustering” of returns, and Integrated GARCH (IGARCH), which restricts the volatility parameter. All the GARCH model variations seek to incorporate the direction, positive or negative, of returns in addition to the magnitude (addressed in the original model).
Each derivation of GARCH can be used to accommodate the specific qualities of the stock, industry, or economic data. When assessing risk, financial institutions incorporate GARCH models into their Value-at-Risk (VAR), maximum expected loss (whether for a single investment or trading position, portfolio, or at a division or firm-wide level) over a specified time period. GARCH models are viewed to provide better gauges of risk than can be obtained through tracking standard deviation alone.
Various studies have been conducted on the reliability of various GARCH models during different market conditions, including during the periods leading up to and after the Great Recession.
We are going to download the data for the apple stock price from the internet. We shall therefore use the command below.
apple<-getSymbols("AAPL",
from="2007-01-01",
to="2022-12-31")
plot(AAPL$AAPL.Close["2020"])
The plot of Returns for 2020
head(AAPL,10)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2007-01-03 3.081786 3.092143 2.925000 2.992857 1238319600 2.551165
2007-01-04 3.001786 3.069643 2.993571 3.059286 847260400 2.607790
2007-01-05 3.063214 3.078571 3.014286 3.037500 834741600 2.589220
2007-01-08 3.070000 3.090357 3.045714 3.052500 797106800 2.602005
2007-01-09 3.087500 3.320714 3.041071 3.306071 3349298400 2.818155
2007-01-10 3.383929 3.492857 3.337500 3.464286 2952880000 2.953020
2007-01-11 3.426429 3.456429 3.396429 3.421429 1440252800 2.916488
2007-01-12 3.378214 3.395000 3.329643 3.379286 1312690400 2.880563
2007-01-16 3.417143 3.473214 3.408929 3.467857 1244076400 2.956063
2007-01-17 3.484286 3.485714 3.386429 3.391071 1646260000 2.890609
tail(AAPL,10)
AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2022-11-18 152.31 152.70 149.97 151.29 74794600 151.29
2022-11-21 150.16 150.37 147.72 148.01 58724100 148.01
2022-11-22 148.13 150.42 146.93 150.18 51804100 150.18
2022-11-23 149.45 151.83 149.34 151.07 58301400 151.07
2022-11-25 148.31 148.88 147.12 148.11 35195900 148.11
2022-11-28 145.14 146.64 143.38 144.22 69246000 144.22
2022-11-29 144.29 144.81 140.35 141.17 83763800 141.17
2022-11-30 141.40 148.72 140.55 148.03 111224400 148.03
2022-12-01 148.21 149.13 146.61 148.31 71250400 148.31
2022-12-02 145.96 148.00 145.65 147.81 65421400 147.81
chartSeries(AAPL)
Chartseries for returns
chartSeries(AAPL["2020-12"])
Chart series showing returns for 2020/12
Note that the yellow candle sticks indicates that AAPL experience a lower closing stock than the opening stock price. On the other hand, the green candle stick shows that the company experienced a higher closing stock prices than opening stock prices.
chartSeries(AAPL["2008-12"])
Chart series showing returns for 2008/12
Returns<-CalculateReturns(AAPL$AAPL.Close)
head(Returns,10)
AAPL.Close
2007-01-03 NA
2007-01-04 0.022195848
2007-01-05 -0.007121269
2007-01-08 0.004938272
2007-01-09 0.083069943
2007-01-10 0.047855899
2007-01-11 -0.012371092
2007-01-12 -0.012317368
2007-01-16 0.026209975
2007-01-17 -0.022142205
Returns<-Returns[-1]
head(Returns,10)
AAPL.Close
2007-01-04 0.022195848
2007-01-05 -0.007121269
2007-01-08 0.004938272
2007-01-09 0.083069943
2007-01-10 0.047855899
2007-01-11 -0.012371092
2007-01-12 -0.012317368
2007-01-16 0.026209975
2007-01-17 -0.022142205
2007-01-18 -0.061927338
hist(Returns, breaks = 15)
Histogram for the returns
chart.Histogram(Returns,
methods = c('add.density','add.normal'),
colorset = c('blue','green','red'))
Histogram showing the distribution of returns superimposed with the normal curve
chartSeries(Returns,theme = 'white')
Volatility chart series
The graph above shows that there is volatility in the series, however, the chart shows that there is no seasonality in the series. High volatility is seen in the year 2008, specifically after the stock market crash in the united states due to a fall in the housing pricing. Similarly, the graph shows some periods where we had a higher variability in the returns.
We can capture volatility in the market with the help of standard deviation. First, we calculate the standard deviation of our series as shown below
sd(Returns)
[1] 0.02038514
The return’s standard deviation is approximately 2.03%. Now suppose we assume that we have 252 trading days in a year. Therefore, this standard deviation will be multiplied by the square root of 252.
sqrt(252)*sd(Returns)
[1] 0.3236041
Generally speaking, the average volatility in returns is approximately 32.19% per year.
sqrt(252)*sd(Returns["2008"])
[1] 0.5820481
Volatility in returns for the Apple company in the year 2008 was approximately 58.20%. Volatility measures how fast returns from an investment changes due to some factors, such as economic shocks,and or political shocks. Compare the volatility across the years, 2008, 2015 and 2020. In which year did the AAPL experienced a higher volatility in returns.
sqrt(252)*sd(Returns["2015"])
[1] 0.2673398
Volatility in returns for the Apple company in the year 2008 was approximately 26.73%.
sqrt(252)*sd(Returns["2020"])
[1] 0.4665857
Volatility in returns for the Apple company in the year 2008 was approximately 46.66%.
chart.RollingPerformance(R=Returns["2007::2022"],
width = 22,
FUN = "sd.annualized",
scale=252,
main = "Apple's monthly rolling volatility")
Apple’s monthly rolling volatility
chart.RollingPerformance(R=Returns["2007::2022"],
width = 6,
FUN = "sd.annualized",
scale=252,
main = "Apple's monthly rolling volatility")
Apple’s weekly rolling volatility
chart.RollingPerformance(R=Returns["2007::2022"],
width = 252,
FUN = "sd.annualized",
scale=252,
main = "Apple's monthly rolling volatility")
Apple’s Annual rolling volatility
Before we create the model, we create the specification for the model and so store the specification, and how do we do that;
s<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="sGARCH"),
distribution.model = 'norm')
m<- ugarchfit(data=Returns,spec = s)
m
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001951 0.000250 7.7928 0
omega 0.000014 0.000002 8.1685 0
alpha1 0.108084 0.004180 25.8595 0
beta1 0.857895 0.008443 101.6101 0
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001951 0.000352 5.5489 0.000000
omega 0.000014 0.000005 2.5905 0.009585
alpha1 0.108084 0.023389 4.6212 0.000004
beta1 0.857895 0.014386 59.6333 0.000000
LogLikelihood : 10385.77
Information Criteria
------------------------------------
Akaike -5.1805
Bayes -5.1742
Shibata -5.1805
Hannan-Quinn -5.1783
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.8287 0.3626
Lag[2*(p+q)+(p+q)-1][2] 1.3710 0.3921
Lag[4*(p+q)+(p+q)-1][5] 4.4402 0.2040
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.1499 0.6986
Lag[2*(p+q)+(p+q)-1][5] 1.4674 0.7479
Lag[4*(p+q)+(p+q)-1][9] 2.6763 0.8110
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.7406 0.500 2.000 0.3895
ARCH Lag[5] 2.4204 1.440 1.667 0.3856
ARCH Lag[7] 2.5880 2.315 1.543 0.5945
Nyblom stability test
------------------------------------
Joint Statistic: 14.7015
Individual Statistics:
mu 0.1656
omega 2.9902
alpha1 0.3782
beta1 0.4481
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.07 1.24 1.6
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 0.8319 0.40553
Negative Sign Bias 1.4287 0.15316
Positive Sign Bias 0.7851 0.43247
Joint Effect 9.7816 0.02052 **
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 139.0 0.00000000000000000002877
2 30 153.7 0.00000000000000000062105
3 40 177.0 0.00000000000000000017618
4 50 173.4 0.00000000000000082932719
Elapsed time : 0.529916
From the results above, the standard GARCH model is sGARCH(1,1). Remember we are using a constant mean model ASRIMA(0,0,0) with a normal distribution of error. The model has three parameters, mu, omega, alpha1 and beta1. From the results above, all the parameter with their estimates shows that the parameters are statistically significant.
We can write the equation in Rmarkdown using the command below; \[ Returns_{t}= \mu + e_{t} \] The main equation now becomes the following. \[ Returns_{t}= 0.001968 + e_{t} \] Remember the mean is constant. On the other hand, the equation for sigma squares variable at time t is given as follows; \[ \sigma_{t}^2 = w + \alpha*e_{t-1}^2 + \beta \sigma_{t-1}^2 \]
If we replace the value from the output above into our equation, the new equation will be as shown below.
\[ \sigma_{t}^2 = 0.000014 + 0.108468e_{t-1}^2 + 0.855942 \sigma_{t-1}^2 \]
The model above is our variance GARCH model. Remember if we do not have the beta term in the model, the entire model reduces to ARCH(1) model, which is basically as shown below; \[ \sigma_{t}^2 = w + \alpha*e_{t-1}^2 \]
This model also provides a lot of other useful analysis and information that we can use. One of the most useful information we can use is the information criteria, which include Akaike information criteria, Bayes information criteria, Hannan Quinn information criteria and Shibata information criteria. The information criteria helps to the best model without necessarily choosing a very complex model. The simple always has a lower information criteria. The model that has a lower information criteria will be useful in calculating returns. Therefore, we shall choose the model with lowest information value.
Another important information we have is the Ljung-Box test on standardized residuals. The null hypothesis in this case is that there is no serial correlation. From the output above, there is no sufficient evidence to reject the null hypothesis. As a result, we can conclude that there is no serial correlation between the standardized residuals. Besides, there is no serial correlation in the standardized squared residuals, and they more or less behave like the white noise process. This also justifies that our GARCH model is valid.
One thing we can notice is that the probability values are all less than 0.05 indicating that the model for the residuals that we used is not really a good choice. If the distribution was a good fit, then the p-value for the goodness fit test would be greater than 0.05. Thus, we shall assume that there is a scope to improve our model.
s<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="sGARCH"),
distribution.model = 'norm')
m<- ugarchfit(data=Returns,spec = s)
m
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001951 0.000250 7.7928 0
omega 0.000014 0.000002 8.1685 0
alpha1 0.108084 0.004180 25.8595 0
beta1 0.857895 0.008443 101.6101 0
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001951 0.000352 5.5489 0.000000
omega 0.000014 0.000005 2.5905 0.009585
alpha1 0.108084 0.023389 4.6212 0.000004
beta1 0.857895 0.014386 59.6333 0.000000
LogLikelihood : 10385.77
Information Criteria
------------------------------------
Akaike -5.1805
Bayes -5.1742
Shibata -5.1805
Hannan-Quinn -5.1783
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.8287 0.3626
Lag[2*(p+q)+(p+q)-1][2] 1.3710 0.3921
Lag[4*(p+q)+(p+q)-1][5] 4.4402 0.2040
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.1499 0.6986
Lag[2*(p+q)+(p+q)-1][5] 1.4674 0.7479
Lag[4*(p+q)+(p+q)-1][9] 2.6763 0.8110
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.7406 0.500 2.000 0.3895
ARCH Lag[5] 2.4204 1.440 1.667 0.3856
ARCH Lag[7] 2.5880 2.315 1.543 0.5945
Nyblom stability test
------------------------------------
Joint Statistic: 14.7015
Individual Statistics:
mu 0.1656
omega 2.9902
alpha1 0.3782
beta1 0.4481
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.07 1.24 1.6
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 0.8319 0.40553
Negative Sign Bias 1.4287 0.15316
Positive Sign Bias 0.7851 0.43247
Joint Effect 9.7816 0.02052 **
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 139.0 0.00000000000000000002877
2 30 153.7 0.00000000000000000062105
3 40 177.0 0.00000000000000000017618
4 50 173.4 0.00000000000000082932719
Elapsed time : 0.5013511
plot(m,which ='all')
please wait...calculating quantiles...
The 12 plots for the distribution of residuals
The first plot shows the series with 2 conditional standard deviation superimposed. Remember you have this data for returns and the red lines you are seeing on the plot are at 2 sd. Basically, the two lines represents 95% of the data on returns. The second plot shows the series with a 1% value risk limit, therefore the red line you are seeing are at 1%. The third plot is for conditional standard deviation versus returns. Remember returns are absolute returns. The blue line you are seeing on the third plot is the standard deviation. The forth plot shows the autocorrelation for the observations. In the fifth plot there is a significant autocorrelation as shown by all the bars being above the red line. There is also a similar situation in the sixth plot. The seventh plot shows the autocorrelation for the squared versus the actual observations. The eighth plot shows histogram for the residuals. The tails of the histograms deviates from the normal distribution. This can be well established by looking at the normal Q-Q plot. In other words, there is a significant deviation of the standard residuals from the normal distribution for both low and high values. The ninth plot shows the autocorrelation of the standardized residuals where we are able to see that the residuals improved significantly and for squared standardized residuals, all the values are within the red lines
We shall therefore use this model to make our forecast and we will store our forecast in the object F.
f<-ugarchforecast(fitORspec = m,
n.ahead = 20)
plot(fitted(f))
Fitted plot
Remember is model was estimated on a constant mean model. In other words, our prediction will be constant. We can also plot the variability using the command below.
plot(sigma(f))
Variability plot
The graph shows that volatility in returns is going to increase for the next 20 days predicted.
Portfolio allocation example will help and guide what percentage of portfolio should go to the riskier assets such as stocks and what percentage should go to free riskier assets. This will be followed by four different variants. * Model-2: GARCH with sstd * Model-3: GJR-GARCH * Model-4: AR(1) GJG-GARCH * Model-5: GJG-GARCH in mean
Let us now calculate the analyzed volatility using the code below and store the results in V; remember we shall assume that there are 252 trading days.
V<- sqrt(252) * sigma(m)
We shall also create an object W which represent the weight assigned to the risky assets. In this case, we shall assign a target of 5%, that is, 0.05.
W<-0.05/V
plot(merge(V,W),
multi.panel=T)
Weight assigned to risky assets
The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end.
tail(W,20)
[,1]
2022-11-04 0.09982105
2022-11-07 0.10669273
2022-11-08 0.11409389
2022-11-09 0.12183158
2022-11-10 0.11730929
2022-11-11 0.08268621
2022-11-14 0.08765183
2022-11-15 0.09345250
2022-11-16 0.09964281
2022-11-17 0.10601643
2022-11-18 0.11246150
2022-11-21 0.12014075
2022-11-22 0.12222863
2022-11-23 0.12846092
2022-11-25 0.13662344
2022-11-28 0.13810258
2022-11-29 0.13475537
2022-11-30 0.13549282
2022-12-01 0.11797602
2022-12-02 0.12593584
Let us assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 18.3%. In this case the model suggests that the company can allocate the following amount to the risky assets. \[ Amount_{riskyAssets}= 2000000*0.1831701 = 366340.2 \]
The calculations above shows that the company should have allocated $366340.2 on the risky assets as suggested by the model and allocate the following amount on the free risky assets; Some of the suggested free risky assets include but are limited to bank accounts
\[ Amount_{freeRisk}=2000000-366340.2= 1633659.8 \]
Suppose we increase the assigned to the risky assets to 10%, that is, 0.1. Consider the graph below
W_1<-0.1/V
plot(merge(V,W_1),
multi.panel=T)
Weight assigned to risky assets
The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end. However, the weight assigned to the risky assets is twice as compared to the previous graph.
tail(W_1,20)
[,1]
2022-11-04 0.1996421
2022-11-07 0.2133855
2022-11-08 0.2281878
2022-11-09 0.2436632
2022-11-10 0.2346186
2022-11-11 0.1653724
2022-11-14 0.1753037
2022-11-15 0.1869050
2022-11-16 0.1992856
2022-11-17 0.2120329
2022-11-18 0.2249230
2022-11-21 0.2402815
2022-11-22 0.2444573
2022-11-23 0.2569218
2022-11-25 0.2732469
2022-11-28 0.2762052
2022-11-29 0.2695107
2022-11-30 0.2709856
2022-12-01 0.2359520
2022-12-02 0.2518717
Let us assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 36.6%. In this case the model suggests that the company can allocate the following amount to the risky assets.
2000000*0.3663402
[1] 732680.4
According to this model, the company should therefore allocate $732680.4 to the risky assets and the remaining amount to be allocated to the free risky assets. This allocation is only possible if the compamy accepts set 10% as the target for the assigned weight to the risky assets.
In the previous model we saw that the normal distribution is not a good distribution for the residuals. Now let us make use of skewed student’s t-distribution to capture this distribution more accurately. We shall edit the model to skewed student’s t-distribution
S<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="sGARCH"),
distribution.model = 'sstd')
M<- ugarchfit(data=Returns,spec = S)
M
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001646 0.000255 6.4610 0.000000
omega 0.000008 0.000003 2.3373 0.019424
alpha1 0.098038 0.012390 7.9128 0.000000
beta1 0.889328 0.013202 67.3651 0.000000
skew 1.010442 0.021780 46.3936 0.000000
shape 5.039717 0.450983 11.1750 0.000000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001646 0.000257 6.3964 0.0000
omega 0.000008 0.000008 1.0067 0.3141
alpha1 0.098038 0.018950 5.1735 0.0000
beta1 0.889328 0.019562 45.4616 0.0000
skew 1.010442 0.020704 48.8036 0.0000
shape 5.039717 0.708013 7.1181 0.0000
LogLikelihood : 10543.89
Information Criteria
------------------------------------
Akaike -5.2584
Bayes -5.2490
Shibata -5.2584
Hannan-Quinn -5.2551
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.8252 0.3637
Lag[2*(p+q)+(p+q)-1][2] 1.3217 0.4047
Lag[4*(p+q)+(p+q)-1][5] 4.5496 0.1930
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.009601 0.9219
Lag[2*(p+q)+(p+q)-1][5] 1.122584 0.8314
Lag[4*(p+q)+(p+q)-1][9] 2.192556 0.8808
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.6271 0.500 2.000 0.4284
ARCH Lag[5] 2.0666 1.440 1.667 0.4567
ARCH Lag[7] 2.3500 2.315 1.543 0.6435
Nyblom stability test
------------------------------------
Joint Statistic: 4.1734
Individual Statistics:
mu 0.19445
omega 0.86119
alpha1 0.77193
beta1 0.90835
skew 0.04426
shape 1.06900
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.49 1.68 2.12
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.0081 0.31348
Negative Sign Bias 1.1927 0.23306
Positive Sign Bias 0.8992 0.36859
Joint Effect 10.3009 0.01617 **
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 28.51 0.07415
2 30 44.57 0.03234
3 40 56.77 0.03277
4 50 57.69 0.18472
Elapsed time : 1.52689
Having created the skewed student’s t-distribution, let us now have the 12 plots and see the distribution of the residuals.
plot(M,which ='all')
please wait...calculating quantiles...
The 12 plots for residual distribution
The plots above shows that our models improved compared to the previous model as indicated by the distribution by of the residuals. From the plots above, the sstd Q-Q plot shows that residuals relatively aligns to the diagonal line. Besides, the information criteria also improved. Additionally, the skew statistics shows that the distribution is symmetry as indicated by the skewed estimated of 1.007, which is quite close to 1.
V_1<- sqrt(252) * sigma(M)
We shall also create an object W which represent the weight assigned to the risky assets. In this case, we shall assign a target of 5%, that is, 0.05.
W_2<-0.05/V
plot(merge(V_1,W_2),
multi.panel=T)
Weight assigned to risky and non-risky assets
The upper plot is for volatility and the lower plot is the weight we want to use in assigning the risky assets. From the graph, one thing you will notice is that for example, when the volatility is too high, like in 2008 and 2020, the weight assigned to risky assets is very low. On the other hand, when the volatility is very low, the weight assigned to the risky assets is quite high. In other words, we can increase the size of the risky assets when there is a low volatility. View the weight assigned to the risky assets at the tailed end.
tail(W_2,20)
[,1]
2022-11-04 0.09982105
2022-11-07 0.10669273
2022-11-08 0.11409389
2022-11-09 0.12183158
2022-11-10 0.11730929
2022-11-11 0.08268621
2022-11-14 0.08765183
2022-11-15 0.09345250
2022-11-16 0.09964281
2022-11-17 0.10601643
2022-11-18 0.11246150
2022-11-21 0.12014075
2022-11-22 0.12222863
2022-11-23 0.12846092
2022-11-25 0.13662344
2022-11-28 0.13810258
2022-11-29 0.13475537
2022-11-30 0.13549282
2022-12-01 0.11797602
2022-12-02 0.12593584
Similarly, if we assume that APPLE has 2 million dollars, ($2,000,000) for the trading day 2022-08-12. During that time, the weight assigned to risky assets was approximately 18.3%. In this case the model suggests that the company can allocate the following amount to the risky assets and the remaining to be allocated to the free risky assets.
2000000*0.1831701
[1] 366340.2
We are going to copy the same model we used above and modify so as to estimate the GJR GARCH model. The GJR GARCH model was developed by Glosen Jagannathan-Runkle. Consider the results below.
GJR<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="gjrGARCH"),
distribution.model = 'sstd')
M_1<- ugarchfit(data=Returns,spec = GJR)
M_1
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001439 0.000245 5.8735 0
omega 0.000010 0.000001 8.0910 0
alpha1 0.037620 0.001338 28.1231 0
beta1 0.872453 0.009578 91.0907 0
gamma1 0.141980 0.020040 7.0849 0
skew 1.005834 0.021604 46.5580 0
shape 5.363441 0.387189 13.8522 0
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001439 0.000268 5.3614 0.000000
omega 0.000010 0.000002 5.4874 0.000000
alpha1 0.037620 0.010679 3.5229 0.000427
beta1 0.872453 0.010833 80.5341 0.000000
gamma1 0.141980 0.025049 5.6680 0.000000
skew 1.005834 0.020659 48.6881 0.000000
shape 5.363441 0.491852 10.9046 0.000000
LogLikelihood : 10573.09
Information Criteria
------------------------------------
Akaike -5.2725
Bayes -5.2615
Shibata -5.2725
Hannan-Quinn -5.2686
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 1.770 0.1833
Lag[2*(p+q)+(p+q)-1][2] 2.193 0.2326
Lag[4*(p+q)+(p+q)-1][5] 5.259 0.1337
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.2066 0.6494
Lag[2*(p+q)+(p+q)-1][5] 1.3939 0.7660
Lag[4*(p+q)+(p+q)-1][9] 2.3991 0.8524
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.8719 0.500 2.000 0.3504
ARCH Lag[5] 2.1876 1.440 1.667 0.4312
ARCH Lag[7] 2.4394 2.315 1.543 0.6249
Nyblom stability test
------------------------------------
Joint Statistic: 21.3464
Individual Statistics:
mu 0.50717
omega 5.27138
alpha1 1.50521
beta1 1.49999
gamma1 1.00250
skew 0.03171
shape 1.42563
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.69 1.9 2.35
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.3229 0.1860
Negative Sign Bias 0.6039 0.5460
Positive Sign Bias 0.1287 0.8976
Joint Effect 2.7437 0.4329
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 24.07 0.1936
2 30 37.15 0.1424
3 40 42.72 0.3144
4 50 47.64 0.5284
Elapsed time : 2.497092
We can notice that there is a very big change in the last plot among the 12 plots as shown below;
plot(M_1,which ='all')
please wait...calculating quantiles...
Plot number 12 changed drastically due to the gjr garch modeling. According to this model, when the news have a positive impacts on the stock price, the increase in prices are gradual. However, when the news have a negative impacts on the stock price, the impact is significantly higher. Therefore, as per this model, the impacts of news on the stock market is not symmetric. However, the impacts is higher for the negative new than the positive news.
The model parameter are all significant. This model is written as follows
\[ Returns_{t} = \mu + e_{t}^2 \]
When we plug in values, the model will be as follows: \[ Returns_{t} = 0.001448 + e_{t} \]
Remember in this model, et follows a normal distribution with the mean of zero and a variance of sigma square. However, for sigma t-square, we will have two equations for the negative and positive news as shown below; \[ \sigma_{t}^2 (e_{t-1}<0) \] The equation above is for the impacts of the negative news. The full equation is going to as follows; \[ \omega + (\alpha + \gamma)e_{t-1}^2 + \beta\sigma_{t-1}^2 \]
If we plug in the values, this is going to be as follows; \[ 0.000011 + (0.039228 + 0.142040)e_{t-1}^2 + 0.869758 \sigma_{t-1}^2 \] When the news have a positive impacts on the stock market, then et square will be as follows; \[ \sigma_{t}^2 (e_{t-1}>0) \]
However, if the news have a positive impacts on the stocm market, we shall have a standard model as we’ve seen earlier.
\[ \omega + \alpha e_{t-1}^2 + \beta\sigma_{t-1}^2 \]
When we plug in values, we will have the following model.
\[ 0.000011 + 0.039228e_{t-1}^2 + 0.869758\sigma_{t-1}^2 \]
Generally, the third model is better than the first two models. To validate this claim we shall focus on the following * skeweness statistics * Information criteria The skeweness statistics for this models is close to one as compared to the first two models. Besides, the values for the information criteria for this models are relatively lower compared to the information criteria value in the other two models. This is enough evidence to shows that this model performed better than the first two models.
We shall edit the model we used above to have the AR(1) GJR GARCH model;
AR_GJR<-ugarchspec(mean.model = list(armaOrder =c(1,0)),
variance.model = list(model="gjrGARCH"),
distribution.model = 'sstd')
M_2<- ugarchfit(data=Returns,spec = AR_GJR)
M_2
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model : ARFIMA(1,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001420 0.000247 5.74515 0.00000
ar1 0.010494 0.015945 0.65813 0.51045
omega 0.000010 0.000001 8.24188 0.00000
alpha1 0.037181 0.001534 24.23671 0.00000
beta1 0.872005 0.009593 90.89994 0.00000
gamma1 0.143733 0.020281 7.08721 0.00000
skew 1.005822 0.021631 46.49968 0.00000
shape 5.382598 0.389205 13.82971 0.00000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001420 0.000268 5.29040 0.000000
ar1 0.010494 0.014629 0.71735 0.473160
omega 0.000010 0.000002 5.60750 0.000000
alpha1 0.037181 0.010548 3.52510 0.000423
beta1 0.872005 0.010817 80.61469 0.000000
gamma1 0.143733 0.025288 5.68388 0.000000
skew 1.005822 0.020836 48.27366 0.000000
shape 5.382598 0.490608 10.97128 0.000000
LogLikelihood : 10573.3
Information Criteria
------------------------------------
Akaike -5.2721
Bayes -5.2595
Shibata -5.2721
Hannan-Quinn -5.2677
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.5763 0.4478
Lag[2*(p+q)+(p+q)-1][2] 1.0005 0.7396
Lag[4*(p+q)+(p+q)-1][5] 4.0600 0.2160
d.o.f=1
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.2509 0.6164
Lag[2*(p+q)+(p+q)-1][5] 1.4228 0.7589
Lag[4*(p+q)+(p+q)-1][9] 2.4147 0.8502
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.8672 0.500 2.000 0.3517
ARCH Lag[5] 2.1471 1.440 1.667 0.4396
ARCH Lag[7] 2.4044 2.315 1.543 0.6322
Nyblom stability test
------------------------------------
Joint Statistic: 21.9003
Individual Statistics:
mu 0.50680
ar1 0.28772
omega 5.39492
alpha1 1.51052
beta1 1.50376
gamma1 1.01348
skew 0.03135
shape 1.42984
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.89 2.11 2.59
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.2808 0.2003
Negative Sign Bias 0.6068 0.5440
Positive Sign Bias 0.1426 0.8866
Joint Effect 2.5962 0.4582
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 25.76 0.1369
2 30 38.74 0.1068
3 40 45.37 0.2235
4 50 50.93 0.3975
Elapsed time : 2.785747
plot(M_2,which='all')
please wait...calculating quantiles...
There is nothing in the plots above; however, what is readily noticeable is that the AR(1) we’ve just included in our model is not statistically significant. The implication is that adding this term to GARCH model is not in any way helpful and we should therefore consider removing from our model. In other words, we are no long going to use this model.
Like usual, we shall edit the models we used above to have the GJR GARCH in mean model.
GJR_mean<-ugarchspec(mean.model = list(armaOrder =c(0,0),
archm=T,
archpow=2),
variance.model = list(model="gjrGARCH"),
distribution.model = 'sstd')
M_3<- ugarchfit(data=Returns,spec = GJR_mean)
M_3
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001444 0.000325 4.443402 0.000009
archm -0.049490 0.932517 -0.053071 0.957675
omega 0.000010 0.000001 8.754815 0.000000
alpha1 0.037684 0.004911 7.673515 0.000000
beta1 0.872433 0.009586 91.011271 0.000000
gamma1 0.142023 0.020514 6.923123 0.000000
skew 1.005493 0.021781 46.164192 0.000000
shape 5.363234 0.397180 13.503282 0.000000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001444 0.000379 3.813922 0.000137
archm -0.049490 0.958897 -0.051611 0.958839
omega 0.000010 0.000002 6.413839 0.000000
alpha1 0.037684 0.009759 3.861426 0.000113
beta1 0.872433 0.010894 80.081843 0.000000
gamma1 0.142023 0.025209 5.633769 0.000000
skew 1.005493 0.020491 49.069703 0.000000
shape 5.363234 0.467440 11.473630 0.000000
LogLikelihood : 10573.09
Information Criteria
------------------------------------
Akaike -5.2720
Bayes -5.2594
Shibata -5.2720
Hannan-Quinn -5.2675
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 1.756 0.1851
Lag[2*(p+q)+(p+q)-1][2] 2.183 0.2341
Lag[4*(p+q)+(p+q)-1][5] 5.248 0.1344
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.209 0.6476
Lag[2*(p+q)+(p+q)-1][5] 1.404 0.7634
Lag[4*(p+q)+(p+q)-1][9] 2.414 0.8503
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.8757 0.500 2.000 0.3494
ARCH Lag[5] 2.1980 1.440 1.667 0.4291
ARCH Lag[7] 2.4523 2.315 1.543 0.6223
Nyblom stability test
------------------------------------
Joint Statistic: 22.1719
Individual Statistics:
mu 0.51432
archm 0.31673
omega 5.27461
alpha1 1.51079
beta1 1.50491
gamma1 1.00694
skew 0.03152
shape 1.42678
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.89 2.11 2.59
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.3120 0.1896
Negative Sign Bias 0.5963 0.5510
Positive Sign Bias 0.1378 0.8904
Joint Effect 2.7252 0.4360
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 24.24 0.1873
2 30 38.00 0.1223
3 40 42.74 0.3137
4 50 50.31 0.4214
Elapsed time : 4.44684
Again the additional term in our model is not statistically significant. Thus we shall ignore this model. Therefore we are not going to use this model to make our prediction.
In this part we shall make use of the best model out of the five models above. We shall therefore use the model to simulate the stock prices for apple. From the five models that we ran, the best model was GJR-GARCh model. This was the third model that we estimated. We shall therefore copy the model that we used and edit the model to make our simulation.
GJR<-ugarchspec(mean.model = list(armaOrder =c(0,0)),
variance.model = list(model="gjrGARCH"),
distribution.model = 'sstd')
M_1<- ugarchfit(data=Returns,spec = GJR)
M_1
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : gjrGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : sstd
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.001439 0.000245 5.8735 0
omega 0.000010 0.000001 8.0910 0
alpha1 0.037620 0.001338 28.1231 0
beta1 0.872453 0.009578 91.0907 0
gamma1 0.141980 0.020040 7.0849 0
skew 1.005834 0.021604 46.5580 0
shape 5.363441 0.387189 13.8522 0
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.001439 0.000268 5.3614 0.000000
omega 0.000010 0.000002 5.4874 0.000000
alpha1 0.037620 0.010679 3.5229 0.000427
beta1 0.872453 0.010833 80.5341 0.000000
gamma1 0.141980 0.025049 5.6680 0.000000
skew 1.005834 0.020659 48.6881 0.000000
shape 5.363441 0.491852 10.9046 0.000000
LogLikelihood : 10573.09
Information Criteria
------------------------------------
Akaike -5.2725
Bayes -5.2615
Shibata -5.2725
Hannan-Quinn -5.2686
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 1.770 0.1833
Lag[2*(p+q)+(p+q)-1][2] 2.193 0.2326
Lag[4*(p+q)+(p+q)-1][5] 5.259 0.1337
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 0.2066 0.6494
Lag[2*(p+q)+(p+q)-1][5] 1.3939 0.7660
Lag[4*(p+q)+(p+q)-1][9] 2.3991 0.8524
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.8719 0.500 2.000 0.3504
ARCH Lag[5] 2.1876 1.440 1.667 0.4312
ARCH Lag[7] 2.4394 2.315 1.543 0.6249
Nyblom stability test
------------------------------------
Joint Statistic: 21.3464
Individual Statistics:
mu 0.50717
omega 5.27138
alpha1 1.50521
beta1 1.49999
gamma1 1.00250
skew 0.03171
shape 1.42563
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.69 1.9 2.35
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.3229 0.1860
Negative Sign Bias 0.6039 0.5460
Positive Sign Bias 0.1287 0.8976
Joint Effect 2.7437 0.4329
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 24.07 0.1936
2 30 37.15 0.1424
3 40 42.72 0.3144
4 50 47.64 0.5284
Elapsed time : 2.763984
We shall treat this model as our final model. Run the two command lines above and run the command lines below as well.
GJR_final<-GJR
setfixed(GJR_final)<-as.list(coef(M_1))
Run the command below to obtain the coefficients of the model M_1.
coef(M_1)
mu omega alpha1 beta1 gamma1
0.00143879804 0.00001029444 0.03762037087 0.87245345504 0.14198003252
skew shape
1.00583419409 5.36344066744
Now, we need to compare a one year focus for two different years. We shall however use 2008 focus using the command below. We shall make our prediction for the next year, which has approximately 252 trading days
f2008<-ugarchforecast(data = Returns["/2008-12"],
fitORspec = GJR_final,
n.ahead = 252)
f2022<-ugarchforecast(data = Returns["/2022-12"],
fitORspec = GJR_final,
n.ahead = 252)
par(mfrow= c(2,1))
plot(sigma(f2008))
plot(sigma(f2022))
For the 2008, we had a higher volatility and it was forecasted that after 2008,in next one year, the volatility is going to fall. At the end of 2022, the volatility was low and therefore it was forecasted that the volatility was going to increase. In other words, the volatility was forecasted to increase in 2020. In the long run, the volatility was expected to go towards the average. Now, let us have the two results have simulation in SIM, showing simulation. Consider the command below;
SIM <- ugarchpath(spec = GJR_final,
m.sim = 3,
n.sim = 1*250,
rseed = 123)
plot.zoo(fitted(SIM))
The graph above shows three different simulated returns based on our model. We can as well plot the the simulated variability for the 252 days in 2023.
plot.zoo(sigma(SIM))
In terms of selecting a statistical test, the most important question is “what is the main study hypothesis?” In some cases there is no hypothesis; the investigator just wants to “see what is there”. For example, in a prevalence study there is no hypothesis to test, and the size of the study is determined by how accurately the investigator wants to determine the prevalence. If there is no hypothesis, then there is no statistical test. It is important to decide a priori which hypotheses are confirmatory (that is, are testing some presupposed relationship), and which are exploratory (are suggested by the data). No single study can support a whole series of hypotheses. A sensible plan is to limit severely the number of confirmatory hypotheses. Although it is valid to use statistical tests on hypotheses suggested by the data, the P values should be used only as guidelines, and the results treated as tentative until confirmed by subsequent studies. A useful guide is to use a Bonferroni correction, which states simply that if one is testing n independent hypotheses, one should use a significance level of 0.05/n. Thus if there were two independent hypotheses a result would be declared significant only if P<0.025. Note that, since tests are rarely independent, this is a very conservative procedure – i.e. one that is unlikely to reject the null hypothesis. The investigator should then ask “are the data independent?” This can be difficult to decide but as a rule of thumb results on the same individual, or from matched individuals, are not independent. Thus results from a crossover trial, or from a case-control study in which the controls were matched to the cases by age, sex and social class, are not independent.
Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests. Non parametric tests include the following
The One Sample t Test examines whether the mean of a population is statistically different from a known or hypothesized value. The One Sample t Test is a parametric test. This test is also known as single Sample t Test. The variable used in this test is known as the test variable. In a One Sample t Test, the test variable’s mean is compared against a “test value”, which is a known or hypothesized value of the mean in the population. Test values may come from a literature review, a trusted research organization, legal requirements, or industry standards. For example:
A particular factory’s machines are supposed to fill bottles with 150 milliliters of product. A plant manager wants to test a random sample of bottles to ensure that the machines are not under- or over-filling the bottles. The United States Environmental Protection Agency (EPA) sets clearance levels for the amount of lead present in homes: no more than 10 micrograms per square foot on floors and no more than 100 micrograms per square foot on window sills (as of December 2020). An inspector wants to test if samples taken from units in an apartment building exceed the clearance level.
Enter the following data
score<-c(50,65,72,77,73,85,88,80,65,56,66,78,82,90)
score
[1] 50 65 72 77 73 85 88 80 65 56 66 78 82 90
length(score)
[1] 14
mean(score)
[1] 73.35714
t.test(score,mu=80)
One Sample t-test
data: score
t = -2.0988, df = 13, p-value = 0.05593
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
66.51943 80.19486
sample estimates:
mean of x
73.35714
In the above output, the p-value of the test is 0.05593, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.05593.
t.test(score,mu=80,alternative="two.sided",conf.level=0.95)
One Sample t-test
data: score
t = -2.0988, df = 13, p-value = 0.05593
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
66.51943 80.19486
sample estimates:
mean of x
73.35714
In the above output, the p-value of the test is 0.05593, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.05593.
t.test(score,mu=80,alternative="less",conf.level=0.95)
One Sample t-test
data: score
t = -2.0988, df = 13, p-value = 0.02797
alternative hypothesis: true mean is less than 80
95 percent confidence interval:
-Inf 78.96227
sample estimates:
mean of x
73.35714
In the above output, the p-value of the test is 0.02797, which is less than the significance level alpha = 0.05. We can conclude that the true mean is not equal to 80 with a p-value = 0.02797.
t.test(score,mu=80,alternative="greater",conf.level=0.95)
One Sample t-test
data: score
t = -2.0988, df = 13, p-value = 0.972
alternative hypothesis: true mean is greater than 80
95 percent confidence interval:
67.75202 Inf
sample estimates:
mean of x
73.35714
In the above output, the p-value of the test is 0.972, which is greater than the significance level alpha = 0.05. We can conclude that the true mean is equal to 80 with a p-value = 0.972.
Paired T-Test. The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.
weight_before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
weight_after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myDatta<-data.frame(weight_before,weight_after)
head(myDatta,10)
t.test(weight_before,weight_after, alternative="two.sided",paired = TRUE)
Paired t-test
data: weight_before and weight_after
t = -7.9059, df = 9, p-value = 0.00002433
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-214.1284 -118.8516
sample estimates:
mean difference
-166.49
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value approximately 0.0001.
t.test(weight_before,weight_after, alternative="less",paired = TRUE)
Paired t-test
data: weight_before and weight_after
t = -7.9059, df = 9, p-value = 0.00001216
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf -127.8868
sample estimates:
mean difference
-166.49
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value approximately 0.0001.
t.test(weight_before,weight_after, alternative="greater",paired = TRUE)
Paired t-test
data: weight_before and weight_after
t = -7.9059, df = 9, p-value = 1
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
-205.0932 Inf
sample estimates:
mean difference
-166.49
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference is equal to zero with a p-value =1.
weight_before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
weight_after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myData <- data.frame(
weight = c(weight_before, weight_after),
group = rep(c("weight_before", "weight_after"), each = 10)
)
myData
t.test(weight~group, alternative="two.sided",paired = TRUE, data=myData)
Paired t-test
data: weight by group
t = 7.9059, df = 9, p-value = 0.00002433
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
118.8516 214.1284
sample estimates:
mean difference
166.49
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value which is approximately 0.0001.
t.test(weight~group, alternative="less",paired = TRUE, data=myData)
Paired t-test
data: weight by group
t = 7.9059, df = 9, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf 205.0932
sample estimates:
mean difference
166.49
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference is equal to zero with a p-value =1.
t.test(weight~group, alternative="greater",paired = TRUE, data=myData)
Paired t-test
data: weight by group
t = 7.9059, df = 9, p-value = 0.00001216
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
127.8868 Inf
sample estimates:
mean difference
166.49
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference is not equal to zero with a p-value which is approximately 0.0001.
my_tbl2020 <- tibble::tribble(
~S.n, ~mathematics, ~arts,
1, 22, 53,
2, 37, 68,
3, 36, 42,
4, 38, 49,
5, 42, 51,
6, 58, 65,
7, 58, 51,
8, 60, 71,
9, 62, 55,
10, 65, 74,
11, 66, 68,
12, 56, 64,
13, 66, 67,
14, 67, 73,
15, 62, 65
)
require(knitr)
Loading required package: knitr
kable(my_tbl2020, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| S.n | mathematics | arts |
|---|---|---|
| 1 | 22 | 53 |
| 2 | 37 | 68 |
| 3 | 36 | 42 |
| 4 | 38 | 49 |
| 5 | 42 | 51 |
| 6 | 58 | 65 |
| 7 | 58 | 51 |
| 8 | 60 | 71 |
| 9 | 62 | 55 |
| 10 | 65 | 74 |
| 11 | 66 | 68 |
| 12 | 56 | 64 |
| 13 | 66 | 67 |
| 14 | 67 | 73 |
| 15 | 62 | 65 |
attach(my_tbl2020)
head(my_tbl2020,10)
TTEST<-t.test(mathematics,arts, alternative = "two.sided",paired = TRUE,data=my_tbl2020, conf.level=.95)
TTEST
Paired t-test
data: mathematics and arts
t = -2.8805, df = 14, p-value = 0.0121
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-14.073042 -2.060291
sample estimates:
mean difference
-8.066667
In the above output, the p-value of the test is approximately 0.0121, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between mathematics and arts is not equal to zero with a p-value which is approximately 0.0121 at a 5% level of significance.
before<-c(117,111,98,104,105,100,81,89,78)
after<-c(83,85,75,82,82,77,62,69,64)
My_framed_data<-data.frame(before,after)
My_framed_data
MY_TTEST<-t.test(before,after, alternative = "two.sided",paired = TRUE,data=My_framed_data, conf.level = 0.99)
MY_TTEST
Paired t-test
data: before and after
t = 12.52, df = 8, p-value = 0.000001551
alternative hypothesis: true mean difference is not equal to 0
99 percent confidence interval:
16.59186 28.74147
sample estimates:
mean difference
22.66667
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can therefore conclude that the true mean difference between the weight before and after is not equal to zero with a p-value which is approximately 0.0001 at a 5% level of significance.
MY_TTEST1<-t.test(before,after, alternative = "less",paired = TRUE,data=My_framed_data)
MY_TTEST1
Paired t-test
data: before and after
t = 12.52, df = 8, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf 26.03331
sample estimates:
mean difference
22.66667
In the above output, the p-value of the test is approximately 1, which is significantly greater than the significance level alpha = 0.05. We therefore retain the null hypothesis and conclude that the true mean difference between the weight before and after is greater than zero with a p-value which is approximately 1 at a 5% level of significance.
MY_TTEST2<-t.test(before,after, alternative = "greater",paired = TRUE,data=My_framed_data)
MY_TTEST2
Paired t-test
data: before and after
t = 12.52, df = 8, p-value = 0.0000007754
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
19.30002 Inf
sample estimates:
mean difference
22.66667
In the above output, the p-value of the test is approximately 0.0001, which is significantly less than the significance level alpha = 0.05. We therefore reject the null hypothesis and conclude that the true mean difference between the weight before and after is less than zero with a p-value which is approximately 0.0001 at a 5% level of significance.
What is an unpaired t-test? An unpaired t-test (also known as an independent t-test) is a statistical procedure that compares the averages/means of two independent or unrelated groups to determine if there is a significant difference between the two.
data("gapminder")
data2021<-gapminder%>%
dplyr::select(country, lifeExp)%>%
filter(country == "Kenya"|
country == "United States")
head(data2021,14)
t.test(lifeExp ~ country,alternative="two.sided",data = data2021)
Welch Two Sample t-test
data: lifeExp by country
t = -11.051, df = 17.966, p-value = 0.000000001917
alternative hypothesis: true difference in means between group Kenya and group United States is not equal to 0
95 percent confidence interval:
-24.75174 -16.84326
sample estimates:
mean in group Kenya mean in group United States
52.6810 73.4785
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States is not equal to zero with a p-value which is approximately 0.0001.
data2021<-gapminder %>%
dplyr::select(country, lifeExp)%>%
filter(country == "Kenya"|
country == "United States")
head(data2021,14)
t.test(lifeExp ~ country,alternative="less",data = data2021)
Welch Two Sample t-test
data: lifeExp by country
t = -11.051, df = 17.966, p-value = 0.0000000009583
alternative hypothesis: true difference in means between group Kenya and group United States is less than 0
95 percent confidence interval:
-Inf -17.53385
sample estimates:
mean in group Kenya mean in group United States
52.6810 73.4785
In the above output, the p-value of the test is approximately 0.0001, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States is greater or equal to zero with a p-value which is approximately 0.0001.
data2021<-gapminder %>%
dplyr::select(country, lifeExp)%>%
filter(country == "Kenya"|
country == "United States")
head(data2021,14)
t.test(lifeExp ~ country,alternative="greater",data = data2021)
Welch Two Sample t-test
data: lifeExp by country
t = -11.051, df = 17.966, p-value = 1
alternative hypothesis: true difference in means between group Kenya and group United States is greater than 0
95 percent confidence interval:
-24.06115 Inf
sample estimates:
mean in group Kenya mean in group United States
52.6810 73.4785
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true mean difference between Kenya and United States less than or equal to zero with a p-value =1.
male<-c(135,180,108,128,160,143,175,170)
female<-c(205,195,185,150,175,190,180,220)
daatta<-data.frame(male,female)
daatta
TTEST1<-t.test(male,female, alternative = "two.sided",paired = FALSE,data=daatta)
TTEST1
Welch Two Sample t-test
data: male and female
t = -3.2304, df = 13.477, p-value = 0.006308
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-62.69717 -12.55283
sample estimates:
mean of x mean of y
149.875 187.500
In the above output, the p-value of the test is approximately 0.006308, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is not equal to zero with a p-value which is approximately 0.006308 at a 5% level of significance.
TTEST2<-t.test(male,female, alternative = "less",paired = FALSE,data=daatta)
TTEST2
Welch Two Sample t-test
data: male and female
t = -3.2304, df = 13.477, p-value = 0.003154
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -17.05415
sample estimates:
mean of x mean of y
149.875 187.500
In the above output, the p-value of the test is approximately 0.003154, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.003154 at a 5% level of significance.
TTEST3<-t.test(male,female, alternative = "greater",paired = FALSE,data=daatta)
TTEST3
Welch Two Sample t-test
data: male and female
t = -3.2304, df = 13.477, p-value = 0.9968
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-58.19585 Inf
sample estimates:
mean of x mean of y
149.875 187.500
In the above output, the p-value of the test is approximately 0.9968, which is greater than the significance level alpha = 0.05. We therefore fail to reject the null hypothesis and conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.9968 at a 5% level of significance.
weight<-c(135,180,108,128,160,143,175,170,205,195,185,150,175,190,180,220)
weight
[1] 135 180 108 128 160 143 175 170 205 195 185 150 175 190 180 220
gender<-gl(2,8,labels=c("male","female"))
gender
[1] male male male male male male male male female female
[11] female female female female female female
Levels: male female
paired_data<-data.frame(weight,gender)
paired_data
TTEST4<-t.test(weight~gender, alternative = "two.sided",paired = FALSE,data=paired_data)
TTEST4
Welch Two Sample t-test
data: weight by gender
t = -3.2304, df = 13.477, p-value = 0.006308
alternative hypothesis: true difference in means between group male and group female is not equal to 0
95 percent confidence interval:
-62.69717 -12.55283
sample estimates:
mean in group male mean in group female
149.875 187.500
In the above output, the p-value of the test is approximately 0.006308, which is less than the significance level alpha = 0.05. We therefore reject the null hypothesis and conclude that the true mean difference in height between male and female is not equal zero with a p-value which is approximately 0.006308 at a 5% level of significance.
TTEST5<-t.test(weight~gender, alternative = "less",paired = FALSE,data=paired_data)
TTEST5
Welch Two Sample t-test
data: weight by gender
t = -3.2304, df = 13.477, p-value = 0.003154
alternative hypothesis: true difference in means between group male and group female is less than 0
95 percent confidence interval:
-Inf -17.05415
sample estimates:
mean in group male mean in group female
149.875 187.500
In the above output, the p-value of the test is approximately 0.003154, which is less than the significance level alpha = 0.05. We can conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.003154 at a 5% level of significance.
TTEST6<-t.test(weight~gender, alternative = "greater",paired = FALSE,data=paired_data)
TTEST6
Welch Two Sample t-test
data: weight by gender
t = -3.2304, df = 13.477, p-value = 0.9968
alternative hypothesis: true difference in means between group male and group female is greater than 0
95 percent confidence interval:
-58.19585 Inf
sample estimates:
mean in group male mean in group female
149.875 187.500
In the above output, the p-value of the test is approximately 0.9968, which is greater than the significance level alpha = 0.05. We therefore fail to reject the null hypothesis and conclude that the true mean difference in height between male and female is less than zero with a p-value which is approximately 0.9968 at a 5% level of significance.
Analysis of Variance (ANOVA) is a statistical approach used to compare variances across the means (or average) of different groups. A range of scenarios use it to determine if there is any difference between the means of three of more different groups.
head(gapminder,10)
attach(gapminder)
The following object is masked from fdi_data:
year
The following object is masked from research_data:
year
The following object is masked from unemploymnt_data (pos = 10):
year
The following object is masked from unemploymnt_data (pos = 11):
year
ANOVA<-aov(gdpPercap~continent)
summary(ANOVA)
Df Sum Sq Mean Sq F value Pr(>F)
continent 4 37990228667 9497557167 126.6 <0.0000000000000002 ***
Residuals 1699 127489276662 75037832
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 126.6 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the mean gdp per capita across continent at a 5% level of significance.
#Test the significance of the difference # Load the package below
library(agricolae)
TK<-TukeyHSD(ANOVA)
TK
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = gdpPercap ~ continent)
$continent
diff lwr upr p adj
Americas-Africa 4942.3558 3280.4806 6604.231 0.0000000
Asia-Africa 5708.3958 4188.6340 7228.158 0.0000000
Europe-Africa 12275.7210 10710.1623 13841.280 0.0000000
Oceania-Africa 16427.8546 11507.4034 21348.306 0.0000000
Asia-Americas 766.0401 -1044.5150 2576.595 0.7767582
Europe-Americas 7333.3652 5484.2011 9182.529 0.0000000
Oceania-Americas 11485.4989 6467.6035 16503.394 0.0000000
Europe-Asia 6567.3251 4844.7530 8289.897 0.0000000
Oceania-Asia 10719.4588 5746.8215 15692.096 0.0000000
Oceania-Europe 4152.1337 -834.6909 9138.958 0.1539474
From the results above a significant difference in mean gdp per capita exist across various continent except for Asia and America and Oceania and Europe.
plot(TK)
plot(TK, las=1)
ANOVA2<-aov(lifeExp~continent)
summary(ANOVA2)
Df Sum Sq Mean Sq F value Pr(>F)
continent 4 139343 34836 408.7 <0.0000000000000002 ***
Residuals 1699 144805 85
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, the F statistics is given as 126.6 with its associated p-value of approximately 0.0001, which is significantly less than 0.05. This is an indication that there exist a significant difference in the mean life expectancy across continent at a 5% level of significance.
TK2<-TukeyHSD(ANOVA2)
TK2
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lifeExp ~ continent)
$continent
diff lwr upr p adj
Americas-Africa 15.793407 14.022263 17.564550 0.0000000
Asia-Africa 11.199573 9.579887 12.819259 0.0000000
Europe-Africa 23.038356 21.369862 24.706850 0.0000000
Oceania-Africa 25.460878 20.216908 30.704848 0.0000000
Asia-Americas -4.593833 -6.523432 -2.664235 0.0000000
Europe-Americas 7.244949 5.274203 9.215696 0.0000000
Oceania-Americas 9.667472 4.319650 15.015293 0.0000086
Europe-Asia 11.838783 10.002952 13.674614 0.0000000
Oceania-Asia 14.261305 8.961718 19.560892 0.0000000
Oceania-Europe 2.422522 -2.892185 7.737230 0.7250559
From the results above a significant difference in mean life expectancy exist across various continent except for Oceania and Europe.
plot(TK2)
plot(TK2, las=1)
library(haven)
MAIZE_YIELDS_two_way_ANOVA <- read_dta("D:/Data/data and other training stuff/STATA training DATA set/MAIZE YIELDS two way ANOVA.dta")
head(MAIZE_YIELDS_two_way_ANOVA,10)
my_2anova<-MAIZE_YIELDS_two_way_ANOVA
str(my_2anova)
tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
$ maize_yields: num [1:24] 18.5 11.7 15.4 16.5 15.7 14.2 16.6 18.5 16.2 12.9 ...
..- attr(*, "format.stata")= chr "%10.0g"
$ treatment : dbl+lbl [1:24] 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5...
..@ format.stata: chr "%10.0g"
..@ labels : Named num [1:6] 1 2 3 4 5 6
.. ..- attr(*, "names")= chr [1:6] "TR 1" "TR 2" "TR 3" "TR 4" ...
$ blocks : dbl+lbl [1:24] 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3...
..@ format.stata: chr "%10.0g"
..@ labels : Named num [1:4] 1 2 3 4
.. ..- attr(*, "names")= chr [1:4] "BLOCK 1" "BLOCK 2" "BLOCK 3" "BLOCK 4"
my_2anova
attach(my_2anova)
my_2anova$treatment <- as.factor(my_2anova$treatment)
my_2anova$blocks <- as.factor(my_2anova$blocks)
str(my_2anova)
tibble [24 × 3] (S3: tbl_df/tbl/data.frame)
$ maize_yields: num [1:24] 18.5 11.7 15.4 16.5 15.7 14.2 16.6 18.5 16.2 12.9 ...
..- attr(*, "format.stata")= chr "%10.0g"
$ treatment : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
$ blocks : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
modell<-lm(maize_yields~treatment+blocks,data=my_2anova)
modell
Call:
lm(formula = maize_yields ~ treatment + blocks, data = my_2anova)
Coefficients:
(Intercept) treatment2 treatment3 treatment4 treatment5 treatment6
14.921 0.725 -1.200 0.575 0.675 0.900
blocks2 blocks3 blocks4
-1.467 2.767 1.117
summary(modell)
Call:
lm(formula = maize_yields ~ treatment + blocks, data = my_2anova)
Residuals:
Min 1Q Median 3Q Max
-2.5958 -1.7688 0.0292 1.2313 3.5792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.921 1.410 10.585 0.0000000235 ***
treatment2 0.725 1.628 0.445 0.6624
treatment3 -1.200 1.628 -0.737 0.4723
treatment4 0.575 1.628 0.353 0.7288
treatment5 0.675 1.628 0.415 0.6842
treatment6 0.900 1.628 0.553 0.5884
blocks2 -1.467 1.329 -1.104 0.2872
blocks3 2.767 1.329 2.082 0.0549 .
blocks4 1.117 1.329 0.840 0.4140
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.302 on 15 degrees of freedom
Multiple R-squared: 0.4681, Adjusted R-squared: 0.1843
F-statistic: 1.65 on 8 and 15 DF, p-value: 0.192
two_way<-anova(modell)
two_way
From the results above, the F statistics for treatment is given as 0.4672 with its associated p-value of approximately 0.79478, which is significantly greater than 0.05. This is an indication that there is no significant difference in the mean maize yield across treatment at a 5% level of significance. On the other, F statistics for blocks is 3.6208 with its associated p-value of 0.03802, which is less than 0.05. This is a clear indication that there exists a significant difference in the mean maize yield across blocks at a 5% level of significant.
A Pearson’s correlation is used when the two statistics we want to analyze are both quantitative. This means we will be comparing quantitative variables to find a linear relationship (if the variables represent a nonlinear relationship, a correlation is not appropriate). The Pearson correlation measures the strength of the linear relationship between two variables. It has a value between -1 to 1, with a value of -1 meaning a total negative linear correlation, 0 being no correlation, and + 1 meaning a total positive correlation.
data("gapminder")
head(gapminder,10)
cor.test(gdpPercap, lifeExp, method = "pearson")
Pearson's product-moment correlation
data: gdpPercap and lifeExp
t = 29.658, df = 1702, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5515065 0.6141690
sample estimates:
cor
0.5837062
“t” is the value of the test statistic (t=29.658). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true correlation is not equal to 0). Sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s named as cor (cor = 0.5837). From the results we shall therefore reject the null hypothesis and conclude that there is a positive moderate correlation between life expectancy and gdp per capita at a 5% level of significance.
library(haven)
Pearson <- read_sav("C:/Users/user/Downloads/Pearson.sav")
head(Pearson,10)
length(Pearson)
[1] 2
str(Pearson)
tibble [300 × 2] (S3: tbl_df/tbl/data.frame)
$ mpg : num [1:300] 18 15 18 16 17 15 14 14 14 15 ...
..- attr(*, "format.spss")= chr "F4.1"
$ weight: num [1:300] 3504 3693 3436 3433 3449 ...
..- attr(*, "format.spss")= chr "F4.0"
attach(Pearson)
The following object is masked _by_ .GlobalEnv:
weight
The following object is masked from package:ggplot2:
mpg
cor(Pearson)
mpg weight
mpg 1.0000000 -0.8782815
weight -0.8782815 1.0000000
cor.test(Pearson$mpg, Pearson$weight, method = "pearson", data=Pearson)
Pearson's product-moment correlation
data: Pearson$mpg and Pearson$weight
t = -31.709, df = 298, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9018288 -0.8495329
sample estimates:
cor
-0.8782815
“t” is the value of the test statistic (t=-31.709). P-value is the
significance level of the test statistic (p-value = 0.0001). Alternative
hypothesis is a character string describing the alternative hypothesis
(true correlation is not equal to 0). Sample estimates is the
correlation coefficient. For Pearson correlation coefficient it’s named
as cor (cor = -0.8783). From the results we shall therefore reject the
null hypothesis and conclude that the true correlation is not equal to
zero. Further, correlation coefficient from the results above shows that
there a strong negative correlation between mpg and weight at a 5% level
of significance.
There is sufficient evidence to conclude that there is a significant
linear relationship between mpg and weight because the correlation
coefficient is significantly different from zero.
plot(Pearson$mpg, Pearson$weight)
knitr::include_graphics("C:/Users/user/Downloads/my_scatter.png")
Warning in knitr::include_graphics("C:/Users/user/Downloads/my_scatter.png"): It
is highly recommended to use relative paths for images. You had absolute paths:
"C:/Users/user/Downloads/my_scatter.png"
My scatter plot showing negative correlation
knitr::include_graphics("my_scatter.png")
My scatter plot showing negative correlation
library(devtools)
Loading required package: usethis
my_tbl <- tibble::tribble(
~ln_population, ~ln_life_exp, ~ln_gdp_cap,
6.925587075, 1.45940757, 2.891786,
6.965715868, 1.48190105, 2.914265,
7.011447073, 1.50510926, 2.931,
7.062129255, 1.53173431, 2.922309,
7.116589814, 1.55736281, 2.869221,
7.172613788, 1.58476078, 2.895485,
7.109977092, 1.60047192, 2.990344,
7.142012486, 1.61089428, 2.930641,
7.212664826, 1.61986519, 2.812473,
7.346888958, 1.62079169, 2.803007,
7.402577829, 1.62458115, 2.861376,
7.503653471, 1.64175165, 2.988818,
6.108124079, 1.74217504, 3.204407,
6.169234922, 1.77290819, 3.288313,
6.237578169, 1.81170903, 3.364155,
6.297554802, 1.82098918, 3.44094,
6.35479086, 1.83052451, 3.520277,
6.39950897, 1.83840828, 3.548144,
6.444059949, 1.84769602, 3.560012
)
require(knitr)
kable(my_tbl, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| ln_population | ln_life_exp | ln_gdp_cap |
|---|---|---|
| 6.926 | 1.459 | 2.892 |
| 6.966 | 1.482 | 2.914 |
| 7.011 | 1.505 | 2.931 |
| 7.062 | 1.532 | 2.922 |
| 7.117 | 1.557 | 2.869 |
| 7.173 | 1.585 | 2.895 |
| 7.110 | 1.600 | 2.990 |
| 7.142 | 1.611 | 2.931 |
| 7.213 | 1.620 | 2.812 |
| 7.347 | 1.621 | 2.803 |
| 7.403 | 1.625 | 2.861 |
| 7.504 | 1.642 | 2.989 |
| 6.108 | 1.742 | 3.204 |
| 6.169 | 1.773 | 3.288 |
| 6.238 | 1.812 | 3.364 |
| 6.298 | 1.821 | 3.441 |
| 6.355 | 1.831 | 3.520 |
| 6.400 | 1.838 | 3.548 |
| 6.444 | 1.848 | 3.560 |
attach(my_tbl)
my_model<-lm(ln_life_exp~ln_gdp_cap,data=my_tbl)
stargazer(my_model,type="text")
===============================================
Dependent variable:
---------------------------
ln_life_exp
-----------------------------------------------
ln_gdp_cap 0.430***
(0.050)
Constant 0.328**
(0.155)
-----------------------------------------------
Observations 19
R2 0.813
Adjusted R2 0.802
Residual Std. Error 0.058 (df = 17)
F Statistic 73.702*** (df = 1; 17)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
exp(0.328)
[1] 1.388189
my_tbl1 <- tibble::tribble(
~Year, ~Income, ~Consumption,
1950, 791.8, 733.2,
1951, 819, 748.7,
1952, 844.3, 771.4,
1953, 880, 802.5,
1954, 894, 822.7,
1955, 944.5, 873.8,
1956, 989.4, 899.8,
1957, 1012.1, 919.7,
1958, 1028.8, 932.9,
1959, 1067.2, 979.4,
1960, 1091.1, 1005.1,
1961, 1123.2, 1025.2,
1962, 1170.2, 1069,
1963, 1207.3, 1108.4,
1964, 1291, 1170.6,
1965, 1365.7, 1236.4,
1966, 1431.3, 1298.9,
1967, 1493.2, 1337.7,
1968, 1551.3, 1405.9,
1969, 1599.8, 1456.7,
1970, 1688.1, 1492,
1971, 1728.4, 1538.8,
1972, 1797.4, 1621.9,
1973, 1916.3, 1689.6,
1974, 1896.6, 1674,
1975, 1931.7, 1711.9,
1976, 2001, 1803.9,
1977, 2066.6, 1883.8,
1978, 2167.4, 1961,
1979, 2216.2, 2004.4
)
require(knitr)
kable(my_tbl1, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| Year | Income | Consumption |
|---|---|---|
| 1950 | 791.8 | 733.2 |
| 1951 | 819.0 | 748.7 |
| 1952 | 844.3 | 771.4 |
| 1953 | 880.0 | 802.5 |
| 1954 | 894.0 | 822.7 |
| 1955 | 944.5 | 873.8 |
| 1956 | 989.4 | 899.8 |
| 1957 | 1012.1 | 919.7 |
| 1958 | 1028.8 | 932.9 |
| 1959 | 1067.2 | 979.4 |
| 1960 | 1091.1 | 1005.1 |
| 1961 | 1123.2 | 1025.2 |
| 1962 | 1170.2 | 1069.0 |
| 1963 | 1207.3 | 1108.4 |
| 1964 | 1291.0 | 1170.6 |
| 1965 | 1365.7 | 1236.4 |
| 1966 | 1431.3 | 1298.9 |
| 1967 | 1493.2 | 1337.7 |
| 1968 | 1551.3 | 1405.9 |
| 1969 | 1599.8 | 1456.7 |
| 1970 | 1688.1 | 1492.0 |
| 1971 | 1728.4 | 1538.8 |
| 1972 | 1797.4 | 1621.9 |
| 1973 | 1916.3 | 1689.6 |
| 1974 | 1896.6 | 1674.0 |
| 1975 | 1931.7 | 1711.9 |
| 1976 | 2001.0 | 1803.9 |
| 1977 | 2066.6 | 1883.8 |
| 1978 | 2167.4 | 1961.0 |
| 1979 | 2216.2 | 2004.4 |
attach(my_tbl1)
The following object is masked from ForDI:
Year
tail(my_tbl1)
head(my_tbl1)
str(my_tbl)
tibble [19 × 3] (S3: tbl_df/tbl/data.frame)
$ ln_population: num [1:19] 6.93 6.97 7.01 7.06 7.12 ...
$ ln_life_exp : num [1:19] 1.46 1.48 1.51 1.53 1.56 ...
$ ln_gdp_cap : num [1:19] 2.89 2.91 2.93 2.92 2.87 ...
Statistical techniques are tools that enable us to answer questions about possible patterns in empirical data. It is not surprising, then, to learn that many important techniques of statistical analysis were developed by scientists who were interested in answering very specific empirical questions. So it was with regression analysis. The history of this particular statistical technique can be traced back to late nineteenth-century England and the pursuits of a gentleman scientist, Francis Galton. Galton was born into a wealthy family that produced more than its share of geniuses; he and Charles Darwin, the famous biologist, were first cousins. During his lifetime, Galton studied everything from fingerprint classification to meteorology, but he gained widespread recognition primarily for his work on inheritance. His most important insight came to him while he was studying the inheritance of one of the most obvious of all human characteristics: height. In order to understand how the characteristic of height was passed from one generation to the next, Galton collected data on the heights of individuals and the heights of their parents. After constructing frequency tables that classified these individuals both by their height and by the average height of their parents, Galton came to the unremarkable conclusion that tall people usually had tall parents and short people usually had short parents.
myy_model<-lm(Consumption~Income,data = my_tbl1)
stargazer(myy_model,type="text")
===============================================
Dependent variable:
---------------------------
Consumption
-----------------------------------------------
Income 0.881***
(0.006)
Constant 31.935***
(9.004)
-----------------------------------------------
Observations 30
R2 0.999
Adjusted R2 0.999
Residual Std. Error 14.858 (df = 28)
F Statistic 20,657.590*** (df = 1; 28)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
myy_model<-lm(log(Consumption)~log(Income),data = my_tbl1)
stargazer(myy_model,type="text")
===============================================
Dependent variable:
---------------------------
log(Consumption)
-----------------------------------------------
log(Income) 0.972***
(0.006)
Constant 0.107**
(0.040)
-----------------------------------------------
Observations 30
R2 0.999
Adjusted R2 0.999
Residual Std. Error 0.010 (df = 28)
F Statistic 30,476.920*** (df = 1; 28)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
According to Relative Income Theory (RIT), the level of consumption expenditures is not determined by the absolute level of income but by the relative level of income, with the APC declining as relative income increases. More specifically, the RIT argues that the level of consumption spending is determined by the household’s level of current income relative to the highest level of income previously earned. Symbolically, it is shown:
knitr::include_graphics("consumption.png")
Simbolic Representation of RIT
where C represents the current level of consumption expenditures, Y, the current level of income, Yh, the highest level of income previously earned, and a and b, represent numerical constants which relate income to consumption. From the above equation, we find that when the households experience a temporary and short-run increase in current income above its previous peak level of income, it increases its consumption expenditures by an amount which is less than proportional to the increase in current income.
Consequently, when current income rises relative to peak income, the APC declines and the increase in total consumption expenditures is not proportional to the increase in total income. Again, when a household experiences current and peak income growing by the same percentage amount, it increases its consumption expenditures by an amount which is proportional to the increase in current income.
Consequently, the APC remains constant and the increase in total consumption expenditure is proportional to the increase in total income. Thus, according to the RIT, changes in current consumption are not proportional to the changes in current income only when current income increases relative to previous peak income.
If current and peak income grow together, changes in consumption are always proportional to the changes in income. However, it must be noted that RIT works for decreases as well as increases in the level of current income. The RIT is fundamentally different from AIT. The RIT explains away the short-run consumption function as a result of temporary deviations in current income, while the AIT explains away the long-run consumption function as the result of factors other than income on consumption.
From the results above, autonomous consumption is about 31.935 US dollar. On the other hand, a unit change in income results in 88.1% change in consumption. The effect is statistically significant at a 1% level of significance.
Pre<-predict(myy_model,data.frame(Income=2000))
Pre
1
7.491265
my_tbl2021 <- tibble::tribble(
~YEAR, ~GDP, ~PIV, ~ICT, ~VAT, ~IMPTT,
1973, 15790, 3645, 1124.8, 639.8, 795.44,
1974, 18776, 4075, 1531.4, 937.26, 842.24,
1975, 21140, 4837, 1796.8, 1185.48, 983.62,
1976, 25562, 5808, 2149.4, 1308.44, 1057.18,
1977, 32699, 7800, 2846.8, 1855.26, 2083.94,
1978, 35601, 10280, 3121.4, 1995.38, 2025.48,
1979, 39543, 10809, 3437, 3098.14, 2049.64,
1980, 44648, 12451, 3951.6, 3588, 2919,
1981, 51641, 14508, 3993.4, 3895.9, 3674.24,
1982, 58214, 13364, 4624.6, 3917.5, 3305.84,
1983, 66218, 14349, 5023, 5062.38, 3424.38,
1984, 72550, 16143, 6019.4, 5471, 3043.58,
1985, 100831, 17631, 7162.4, 6065.86, 4236.8,
1986, 117472, 23064, 7714.6, 7950.4, 4934.2,
1987, 131169, 25735, 9089.6, 10399.14, 5473.72,
1988, 151194, 30359, 10240.4, 9774.92, 10240.5,
1989, 171589, 33056, 11983, 9461.94, 11983.06,
1990, 195536, 40560, 14261.6, 12140.28, 14261.56,
1991, 224232, 42671, 17027.8, 18555.4, 5118.78,
1992, 264476, 43777, 19970.4, 22142.72, 9183,
1993, 333616, 56580, 36767.2, 28994.34, 14792.78,
1994, 400700, 75616, 43505.8, 24533.86, 18598.28,
1995, 465654, 99497, 48082.3, 28403.72, 21175.68,
1996, 687998, 110142, 48375.02, 29850.08, 22594.06,
1997, 770312, 118535, 55577.9, 34468.1, 24567.06,
1998, 850808, 133366, 55234.9, 90204.76, 28443.93,
1999, 906928, 141403, 53316.99, 40944.11, 28605.16,
2000, 967838, 161714, 53428.93, 50220.92, 28803.74,
2001, 1020020, 185186, 55861.95, 50871.68, 27302,
2002, 1035370, 178466, 70140.28, 56135.25, 24936.09,
2003, 1138060, 179254, 77409.73, 58853.08, 25214,
2004, 1286460, 207196, 99312.47, 75995.66, 23531.72,
2005, 1445480, 264912, 114629.06, 79925.91, 20511.43,
2006, 1642400, 309402, 130719, 96497.01, 27927,
2007, 1833511, 309781, 165078, 111904.51, 32944.35,
2008, 2111173, 239525, 194155, 126854, 36181,
2009, 2365453, 292488, 288168, 146791, 41372,
2010, 2551161, 238938, 268291, 172360, 48903,
2011, 3725918, 213845, 290621.6, 163940.39, 48436.93,
2012, 4261370, 242365, 353693.71, 174716.17, 55397.81,
2013, 4745594, 234586, 422357.86, 218297.19, 61692.72,
2014, 5403471, 273641, 470101.98, 243156.2, 70245.12,
2015, 6284191, 293645, 540440.43, 276504.4, 75410.29,
2016, 7194163, 261548, 589921.37, 287766.52, 86329.96,
2017, 7749435, 269542, 557959.32, 309977.4, 85243.79,
2018, 7946248, 256345, 683377.33, 381419.9, 95354.98
)
require(knitr)
kable(my_tbl2021, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| YEAR | GDP | PIV | ICT | VAT | IMPTT |
|---|---|---|---|---|---|
| 1973 | 15790 | 3645 | 1124.80 | 639.80 | 795.44 |
| 1974 | 18776 | 4075 | 1531.40 | 937.26 | 842.24 |
| 1975 | 21140 | 4837 | 1796.80 | 1185.48 | 983.62 |
| 1976 | 25562 | 5808 | 2149.40 | 1308.44 | 1057.18 |
| 1977 | 32699 | 7800 | 2846.80 | 1855.26 | 2083.94 |
| 1978 | 35601 | 10280 | 3121.40 | 1995.38 | 2025.48 |
| 1979 | 39543 | 10809 | 3437.00 | 3098.14 | 2049.64 |
| 1980 | 44648 | 12451 | 3951.60 | 3588.00 | 2919.00 |
| 1981 | 51641 | 14508 | 3993.40 | 3895.90 | 3674.24 |
| 1982 | 58214 | 13364 | 4624.60 | 3917.50 | 3305.84 |
| 1983 | 66218 | 14349 | 5023.00 | 5062.38 | 3424.38 |
| 1984 | 72550 | 16143 | 6019.40 | 5471.00 | 3043.58 |
| 1985 | 100831 | 17631 | 7162.40 | 6065.86 | 4236.80 |
| 1986 | 117472 | 23064 | 7714.60 | 7950.40 | 4934.20 |
| 1987 | 131169 | 25735 | 9089.60 | 10399.14 | 5473.72 |
| 1988 | 151194 | 30359 | 10240.40 | 9774.92 | 10240.50 |
| 1989 | 171589 | 33056 | 11983.00 | 9461.94 | 11983.06 |
| 1990 | 195536 | 40560 | 14261.60 | 12140.28 | 14261.56 |
| 1991 | 224232 | 42671 | 17027.80 | 18555.40 | 5118.78 |
| 1992 | 264476 | 43777 | 19970.40 | 22142.72 | 9183.00 |
| 1993 | 333616 | 56580 | 36767.20 | 28994.34 | 14792.78 |
| 1994 | 400700 | 75616 | 43505.80 | 24533.86 | 18598.28 |
| 1995 | 465654 | 99497 | 48082.30 | 28403.72 | 21175.68 |
| 1996 | 687998 | 110142 | 48375.02 | 29850.08 | 22594.06 |
| 1997 | 770312 | 118535 | 55577.90 | 34468.10 | 24567.06 |
| 1998 | 850808 | 133366 | 55234.90 | 90204.76 | 28443.93 |
| 1999 | 906928 | 141403 | 53316.99 | 40944.11 | 28605.16 |
| 2000 | 967838 | 161714 | 53428.93 | 50220.92 | 28803.74 |
| 2001 | 1020020 | 185186 | 55861.95 | 50871.68 | 27302.00 |
| 2002 | 1035370 | 178466 | 70140.28 | 56135.25 | 24936.09 |
| 2003 | 1138060 | 179254 | 77409.73 | 58853.08 | 25214.00 |
| 2004 | 1286460 | 207196 | 99312.47 | 75995.66 | 23531.72 |
| 2005 | 1445480 | 264912 | 114629.06 | 79925.91 | 20511.43 |
| 2006 | 1642400 | 309402 | 130719.00 | 96497.01 | 27927.00 |
| 2007 | 1833511 | 309781 | 165078.00 | 111904.51 | 32944.35 |
| 2008 | 2111173 | 239525 | 194155.00 | 126854.00 | 36181.00 |
| 2009 | 2365453 | 292488 | 288168.00 | 146791.00 | 41372.00 |
| 2010 | 2551161 | 238938 | 268291.00 | 172360.00 | 48903.00 |
| 2011 | 3725918 | 213845 | 290621.60 | 163940.39 | 48436.93 |
| 2012 | 4261370 | 242365 | 353693.71 | 174716.17 | 55397.81 |
| 2013 | 4745594 | 234586 | 422357.86 | 218297.19 | 61692.72 |
| 2014 | 5403471 | 273641 | 470101.98 | 243156.20 | 70245.12 |
| 2015 | 6284191 | 293645 | 540440.43 | 276504.40 | 75410.29 |
| 2016 | 7194163 | 261548 | 589921.37 | 287766.52 | 86329.96 |
| 2017 | 7749435 | 269542 | 557959.32 | 309977.40 | 85243.79 |
| 2018 | 7946248 | 256345 | 683377.33 | 381419.90 | 95354.98 |
head(my_tbl2021,10)
summary(my_tbl2021)
YEAR GDP PIV ICT
Min. :1973 Min. : 15790 Min. : 3645 Min. : 1125
1st Qu.:1984 1st Qu.: 79620 1st Qu.: 16515 1st Qu.: 6305
Median :1996 Median : 576826 Median :104820 Median : 48229
Mean :1996 Mean :1542657 Mean :124401 Mean :128339
3rd Qu.:2007 3rd Qu.:1785733 3rd Qu.:237850 3rd Qu.:156488
Max. :2018 Max. :7946248 Max. :309781 Max. :683377
VAT IMPTT
Min. : 639.8 Min. : 795.4
1st Qu.: 5619.7 1st Qu.: 3814.9
Median : 29422.2 Median :20843.6
Mean : 75848.5 Mean :25351.1
3rd Qu.:108052.6 3rd Qu.:31909.2
Max. :381419.9 Max. :95355.0
attach(my_tbl2021)
MDL<-lm(log(PIV)~log(ICT)+log(VAT)+log(IMPTT),data=my_tbl2021)
stargazer(MDL,type="text")
===============================================
Dependent variable:
---------------------------
log(PIV)
-----------------------------------------------
log(ICT) -0.135
(0.177)
log(VAT) 0.647***
(0.211)
log(IMPTT) 0.356**
(0.162)
Constant 2.532***
(0.398)
-----------------------------------------------
Observations 46
R2 0.958
Adjusted R2 0.955
Residual Std. Error 0.303 (df = 42)
F Statistic 317.211*** (df = 3; 42)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
From the analysis it was found that the value added tax has a significantly positive effect on the level of private investment in Kenya. The findings are concluded on the basis that the value added tax has positive coefficient. Other studies shows that the increase in the value of value added tax increases the level of private investment (Adejare, 2017). This is an indicator that the increase in the value added tax increases the level of private investment. According to Gitahi (2010), value added tax negatively influences the level of private investment. From the table of coefficients, value added tax had a statistically significant effect on the level of private investment as opposed to the null hypothesis which stated that value added tax has no statistically significant effect on the level of private investment. This implies that we will reject the null hypothesis and conclude that value added tax has a statistically significant effect on the level of private investment at 5% level of significant. The coefficient of 0.647 show that one unit change in the level of value added tax results into 0.647 unit change in the level of private investment.
From the table 4.8, the coefficient for the income shows that income tax has a negative effect on the level of private investment. The argument here is that an increase in the rate of income tax constrain private investment. The assumption here is that, according to Keynesian economist, an increase in income tax reduce the disposable income. A decrease in the value of disposable income implies a decrease in the level of saving. Consequently, whenever the level of saving decreases private investment decreases. This is so because, Keynes (1936) argued that in the long run savings equates investment. An increase in the value of income tax charged results into a decrease in the level of private investment in country (Gitahi, 2010). The results from the analysis in the coefficient table shows that the effect of income tax on private investment is significant. The p- value for the income tax given as 0.028 which is less than 0.05 implies that we will therefore reject the null hypothesis and conclude that income tax has a statistically significant effect on the level of private investment at 5% level of significant. The fact that the Kenyan government is subjecting its citizens to higher income tax rate more than other countries in the region, this has accelerated higher capital flight to other countries leading shrinking of the private investment level in Kenya.
The effect of Import tax on private investment was found to be positive and statistically significant. From the existing literature, the results shows that import tax encouraged private investment that deterring private investment. Newbery (1987) and Coady (1997) argued that import taxes and barriers on international trade of any sort tend to encourage domestic production of final consumer goods while permitting relatively free imports of capital or intermediate goods. This tends to be associated with high rates of effective protection, high cost of domestic production, and creating a bias against exports. Import tax charged on final consumer goods leads production of import substitute. The production of import substitute in the country enables for the expansion of the local business enterprises leading to an increase in the level of private investment (Gitahi, 2010) Consequently, while reducing the dependence of the country on imports of final consumption goods, the economy becomes highly dependent on imports of intermediate goods. The above scenario will lead to the expansion of the local business enterprises and consequently increasing the aggregate demand which will reflect as the positive effect of import tax on private investment. From the table 4.8 import tax is statistically significant shown by the p-value of 0.034 which is less than 0.05 implying that we will reject the null hypothesis and conclude import tax has statistically significant effect on the level of private investment in Kenya at 5% level of significance.
RESID<-residuals(MDL)
mean(RESID)
[1] -0.00000000000000001027399
cov(RESID,log(ICT))
[1] 0.0000000000000000004384995
cov(RESID,log(VAT))
[1] -0.000000000000000001207078
cov(RESID,log(IMPTT))
[1] 0.00000000000000000800864
Rehearsal_and_non_rehearsal_groups <- read_sav("D:/Training/Data/Rehearsal and non rehearsal groups.sav")
head(Rehearsal_and_non_rehearsal_groups,10)
data222<-Rehearsal_and_non_rehearsal_groups
head(data222,10)
attach(data222)
var.test(No.ofwordsrecalled~Groups, ratio=1,
alternative=c("two.sided","less","greater"),
conf.level=0.95, data=data222)
F test to compare two variances
data: No.ofwordsrecalled by Groups
F = 1.1019, num df = 9, denom df = 9, p-value = 0.8875
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2736844 4.4360491
sample estimates:
ratio of variances
1.101852
In the above output, the p-value of the test is 0.8875, which is greater than the significance level alpha = 0.05. We can conclude that there is equal variance across the two groups with a p-value = 0.8875.
Kruskal-Wallis test, proposed by Kruskal and Wallis in 1952, is a nonparametric method for testing whether samples are originated from the same distribution.597,681 It extends the Mann-Whitney U test to more than two groups. The null hypothesis of the Kruskal-Wallis test is that the mean ranks of the groups are the same. As the nonparametric equivalent one-way ANOVA, Kruskal-Wallis test is called one-way ANOVA on ranks. Unlike the analogous one-way ANOVA, the nonparametric Kruskal-Wallis test does not assume a normal distribution of the underlying data. Thus, Kruskal-Wallis test is more suitable for analysis of microbiome data. Because the postsequencing microbiome data are often not normally distributed and contain some strong outliers, it is more appropriate to use ranks rather than actual values to avoid the testing being affected by the presence of outliers or by the nonnormal distribution of data.
The Kruskal Wallis test is the non parametric alternative to the One Way ANOVA. Non parametric means that the test doesn’t assume your data comes from a particular distribution. The H test is used when the assumptions for ANOVA aren’t met (like the assumption of normality). It is sometimes called the one-way ANOVA on ranks, as the ranks of the data values are used in the test rather than the actual data points.
H_test_miles_per_gassolin <- read_sav("D:/Data/data and other training stuff/SPSS_training data sets/H-test ,miles per gassolin.sav")
data333<-H_test_miles_per_gassolin
head(data333,10)
attach(data333)
gassolin<-as.factor(data333$gassolin)
kruskal.test(miles~gassolin, data=data333)
Kruskal-Wallis rank sum test
data: miles by gassolin
Kruskal-Wallis chi-squared = 0.86831, df = 2, p-value = 0.6478
In the above output, the p-value of the test is 0.6478, which is greater than the significance level alpha = 0.05. We can conclude that the median miles covered using the three types of gasoline is not significantly different with a p-value =0.6478
The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. This tutorial describes how to compute paired samples Wilcoxon test in R.
before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myData <- data.frame(
group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)
head(myData,10)
attach(myData)
The following object is masked _by_ .GlobalEnv:
weight
The following object is masked from Pearson:
weight
wilcox.test(weight~group, data=myData)
Wilcoxon rank sum exact test
data: weight by group
W = 98, p-value = 0.0000433
alternative hypothesis: true location shift is not equal to 0
In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 4.33e-05.
result = wilcox.test(weight ~ group,
data = myData,
paired = TRUE,
alternative = "less")
print(result)
Wilcoxon signed rank exact test
data: weight by group
V = 55, p-value = 1
alternative hypothesis: true location shift is less than 0
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater or equal to the median weight after treatment with a p-value = 1.
result2 = wilcox.test(weight ~ group,
data = myData,
paired = TRUE,
alternative = "greater")
print(result2)
Wilcoxon signed rank exact test
data: weight by group
V = 55, p-value = 0.0009766
alternative hypothesis: true location shift is greater than 0
In the above output, the p-value of the test is 0.0009766, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than the median weight after treatment with a p-value = 0.0009766.
weight <-c(27.6,30.6,32.2,25.3,30.9,31.0,28.9,28.9,28.9,28.2)
print(weight)
[1] 27.6 30.6 32.2 25.3 30.9 31.0 28.9 28.9 28.9 28.2
result3 = wilcox.test(weight, mu = 25)
Warning in wilcox.test.default(weight, mu = 25): cannot compute exact p-value
with ties
print(result3)
Wilcoxon signed rank test with continuity correction
data: weight
V = 55, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
wilcox.test(myData$weight, mu = 25,
alternative = "less")
Wilcoxon signed rank exact test
data: myData$weight
V = 210, p-value = 1
alternative hypothesis: true location is less than 25
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than or equal to the median weight after treatment with a p-value = 1.
wilcox.test(myData$weight, mu = 25,
alternative = "greater")
Wilcoxon signed rank exact test
data: myData$weight
V = 210, p-value = 0.0000009537
alternative hypothesis: true location is greater than 25
In the above output, the p-value of the test is approximately 0.0001, which is greater than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly greater than or equal to the median weight after treatment with a p-value approximately 0.0001.
The sign test is used to test the null hypothesis that the median of a distribution is equal to some value. It can be used a) in place of a one-sample t-test b) in place of a paired t-test or c) for ordered categorical data where a numerical scale is inappropriate but where it is possible to rank the observations.
weight <-c(27.6,30.6,32.2,25.3,30.9,31.0,28.9,28.9,28.9,28.2)
print(weight)
[1] 27.6 30.6 32.2 25.3 30.9 31.0 28.9 28.9 28.9 28.2
library(BSDA)
Loading required package: lattice
Attaching package: 'BSDA'
The following object is masked _by_ '.GlobalEnv':
Pearson
The following object is masked from 'data222':
Groups
The following object is masked from 'my_tbl1':
Income
The following object is masked from 'Superstore_data':
Region
The following objects are masked from 'package:carData':
Vocab, Wool
The following object is masked from 'package:datasets':
Orange
result4 = SIGN.test(weight, mu = 25)
print(result4)
One-sample Sign-Test
data: weight
s = 10, p-value = 0.001953
alternative hypothesis: true median is not equal to 0
95 percent confidence interval:
27.79467 30.96756
sample estimates:
median of x
28.9
Achieved and Interpolated Confidence Intervals:
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.8906 28.2000 30.9000
Interpolated CI 0.9500 27.7947 30.9676
Upper Achieved CI 0.9785 27.6000 31.0000
before <-c(190.1, 190.9, 172.7, 213, 231.4,
196.9, 172.2, 285.5, 225.2, 113.7)
after <-c(392.9, 313.2, 345.1, 393, 434,
227.9, 422, 383.9, 392.3, 352.2)
myDatta<-data.frame(before,after)
head(myDatta,10)
SIGN.test(before,after,md=0, alternative ="greater",paired=T, data=myDatta)
Dependent-samples Sign-Test
data: before and after
S = 0, p-value = 1
alternative hypothesis: true median difference is greater than 0
95 percent confidence interval:
-206.608 Inf
sample estimates:
median of x-y
-176.2
Achieved and Interpolated Confidence Intervals:
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.9453 -202.800 Inf
Interpolated CI 0.9500 -206.608 Inf
Upper Achieved CI 0.9893 -238.500 Inf
In the above output, the p-value of the test is 1, which is greater than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is equal to zero with a p-value = 1.
SIGN.test(before,after,md=0, alternative ="less",paired=T, data=myDatta)
Dependent-samples Sign-Test
data: before and after
S = 0, p-value = 0.0009766
alternative hypothesis: true median difference is less than 0
95 percent confidence interval:
-Inf -119.7507
sample estimates:
median of x-y
-176.2
Achieved and Interpolated Confidence Intervals:
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.9453 -Inf -122.3000
Interpolated CI 0.9500 -Inf -119.7507
Upper Achieved CI 0.9893 -Inf -98.4000
In the above output, the p-value of the test is 0.0009766, which is less than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is not equal to zero with a p-value = 0.009766.
SIGN.test(before,after,md=0, alternative ="two.sided",paired=T, data=myDatta)
Dependent-samples Sign-Test
data: before and after
S = 0, p-value = 0.001953
alternative hypothesis: true median difference is not equal to 0
95 percent confidence interval:
-226.9173 -106.1542
sample estimates:
median of x-y
-176.2
Achieved and Interpolated Confidence Intervals:
Conf.Level L.E.pt U.E.pt
Lower Achieved CI 0.8906 -202.8000 -122.3000
Interpolated CI 0.9500 -226.9173 -106.1542
Upper Achieved CI 0.9785 -238.5000 -98.4000
In the above output, the p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the true median difference between weight before and after is not equal to zero with a p-value = 0.001953.
The Mann-Whitney U test is used to compare differences between two independent groups when the dependent variable is either ordinal or continuous, but not normally distributed. For example, you could use the Mann-Whitney U test to understand whether attitudes towards pay discrimination, where attitudes are measured on an ordinal scale, differ based on gender (i.e., your dependent variable would be “attitudes towards pay discrimination” and your independent variable would be “gender”, which has two groups: “male” and “female”). Alternately, you could use the Mann-Whitney U test to understand whether salaries, measured on a continuous scale, differed based on educational level (i.e., your dependent variable would be “salary” and your independent variable would be “educational level”, which has two groups: “high school” and “university”). The Mann-Whitney U test is often considered the nonparametric alternative to the independent t-test although this is not always the case.
Unlike the independent-samples t-test, the Mann-Whitney U test allows you to draw different conclusions about your data depending on the assumptions you make about your data’s distribution. These conclusions can range from simply stating whether the two populations differ through to determining if there are differences in medians between groups. These different conclusions hinge on the shape of the distributions of your data, which we explain more about later.
red_bulb <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8)
orange_bulb <- c(47.8, 60, 63.4, 76, 89.4, 67.3, 61.3, 62.4)
BULB_PRICE = c(red_bulb, orange_bulb)
BULB_TYPE = rep(c("red", "orange"), each = 8)
DATASET <- data.frame(BULB_TYPE, BULB_PRICE, stringsAsFactors = TRUE)
DATASET
RSLT<-wilcox.test(BULB_PRICE~ BULB_TYPE,
data = DATASET,
exact = FALSE)
RSLT
Wilcoxon rank sum test with continuity correction
data: BULB_PRICE by BULB_TYPE
W = 44.5, p-value = 0.2072
alternative hypothesis: true location shift is not equal to 0
Here as we can see that the p-value of is coming out to be 0.2072 which is significantly greater than the null hypothesis(0.05). Due to which it will be accepted. And it can be conclude that the distribution of prices of red and orange bulbs is the same.We can therefore assume that prices for both orange and red bulbs come from the same distribution.
Spearman’s rank correlation is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function
My_data<-read.csv("C:\\Users\\user\\Downloads\\gapminder.csv")
head(My_data,10)
attach(My_data)
The following objects are masked from gapminder:
continent, country, gdpPercap, lifeExp, pop, year
The following object is masked from fdi_data:
year
The following object is masked from research_data:
year
The following object is masked from unemploymnt_data (pos = 23):
year
The following object is masked from unemploymnt_data (pos = 24):
year
res <- cor.test(gdpPercap, lifeExp, method = "spearman")
Warning in cor.test.default(gdpPercap, lifeExp, method = "spearman"): Cannot
compute exact p-value with ties
res
Spearman's rank correlation rho
data: gdpPercap and lifeExp
S = 143096490, p-value < 0.00000000000000022
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8264712
S is the value of the test statistic (S = 14309649). P-value is the significance level of the test statistic (p-value = 0.0001). Alternative hypothesis is a character string describing the alternative hypothesis (true rho is not equal to 0). Sample estimates is the correlation coefficient. For Spearmann correlation coefficient it’s named as rho (Cor.coeff = 0.8264). From the results we shall therefore reject the null hypothesis and conclude that life expectancy and gdp per capita are dependent, and therefore true rho is not equal to 0.
matrix, a set of numbers arranged in rows and columns so as to form a rectangular array. The numbers are called the elements, or entries, of the matrix. Matrices have wide applications in engineering, physics, economics, and statistics as well as in various branches of mathematics. Matrices also have important applications in computer graphics, where they have been used to represent rotations and other transformations of images.
A<-matrix(c(4,19,16,21,14,8,10,18,14),nrow=3,ncol=3,byrow=T)
A
[,1] [,2] [,3]
[1,] 4 19 16
[2,] 21 14 8
[3,] 10 18 14
The determinant is a special number that can be calculated from a matrix. The matrix has to be square (same number of rows and columns) like this one:
Mat<-matrix(c(3,8,4,6),nrow=2,byrow = T)
Mat
[,1] [,2]
[1,] 3 8
[2,] 4 6
\[ (3*6)-(4*8)= 18-32 = -14 \]
det(Mat)
[1] -14
mtrx <- matrix( c(-1, -3, -5, -7, -8, 9, 11, 13, 15),nrow = 3,byrow = T)
mtrx
[,1] [,2] [,3]
[1,] -1 -3 -5
[2,] -7 -8 9
[3,] 11 13 15
det(mtrx)
[1] -360
A<-matrix(c(4,19,16,21,14,8,10,18,14),nrow=3,ncol=3,byrow=T)
A
[,1] [,2] [,3]
[1,] 4 19 16
[2,] 21 14 8
[3,] 10 18 14
B<-matrix(c(161,191,123,165,125,155,100,200,155),nrow=3,ncol=3,byrow=T)
B
[,1] [,2] [,3]
[1,] 161 191 123
[2,] 165 125 155
[3,] 100 200 155
solve(B)
[,1] [,2] [,3]
[1,] 0.009121582 0.003927184 -0.011165601
[2,] 0.007905371 -0.009929774 0.003656479
[3,] -0.016085370 0.010278944 0.008937189
zapsmall(B%*%solve(B))
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
A+B
[,1] [,2] [,3]
[1,] 165 210 139
[2,] 186 139 163
[3,] 110 218 169
B-A
[,1] [,2] [,3]
[1,] 157 172 107
[2,] 144 111 147
[3,] 90 182 141
A%*%B
[,1] [,2] [,3]
[1,] 5379 6339 5917
[2,] 6491 7361 5993
[3,] 5980 6960 6190
B/A
[,1] [,2] [,3]
[1,] 40.250000 10.052632 7.68750
[2,] 7.857143 8.928571 19.37500
[3,] 10.000000 11.111111 11.07143
M1<-matrix(c(16,121,166,121,129,177,163,135,198),nrow=3,ncol=3,byrow=T)
M1
[,1] [,2] [,3]
[1,] 16 121 166
[2,] 121 129 177
[3,] 163 135 198
M2<-matrix(c(100,170,188,131,175,178,151,199,502),nrow=3,ncol=3,byrow=T)
M2
[,1] [,2] [,3]
[1,] 100 170 188
[2,] 131 175 178
[3,] 151 199 502
S<-solve(A)
S
[,1] [,2] [,3]
[1,] -1.04 -0.44 1.44
[2,] 4.28 2.08 -6.08
[3,] -4.76 -2.36 6.86
solve(B)
[,1] [,2] [,3]
[1,] 0.009121582 0.003927184 -0.011165601
[2,] 0.007905371 -0.009929774 0.003656479
[3,] -0.016085370 0.010278944 0.008937189
cbind(M1,M2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 16 121 166 100 170 188
[2,] 121 129 177 131 175 178
[3,] 163 135 198 151 199 502
rbind(M1,M2)
[,1] [,2] [,3]
[1,] 16 121 166
[2,] 121 129 177
[3,] 163 135 198
[4,] 100 170 188
[5,] 131 175 178
[6,] 151 199 502
zapsmall(A%*%S)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
S<-matrix(c(2.879,10.002,-1.810,10.002,199.798,-5.627,-1.810,-5.627,3.628),nrow=3)
S
[,1] [,2] [,3]
[1,] 2.879 10.002 -1.810
[2,] 10.002 199.798 -5.627
[3,] -1.810 -5.627 3.628
S_inv<-solve(S)
S_inv
[,1] [,2] [,3]
[1,] 0.58648233 -0.022083814 0.258342724
[2,] -0.02208381 0.006065228 -0.001610437
[3,] 0.25834272 -0.001610437 0.402022712
zapsmall(S%*%S_inv)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
R<-matrix(c(4,3,4,5,8,4,7,8,2,5,3,8,9,5,5,6),nrow = 4)
R
[,1] [,2] [,3] [,4]
[1,] 4 8 2 9
[2,] 3 4 5 5
[3,] 4 7 3 5
[4,] 5 8 8 6
det(R)
[1] -71
solve(R)
[,1] [,2] [,3] [,4]
[1,] -0.87323944 1.5211268 1.6901408 -1.36619718
[2,] 0.33802817 -0.9436620 -0.4929577 0.69014085
[3,] 0.07042254 -0.1549296 -0.3943662 0.35211268
[4,] 0.18309859 0.1971831 -0.2253521 -0.08450704
zapsmall(R%*%solve(R))
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
my_mat<-matrix(c(25,22,18,26,12,16,24,20,26,28,30,28,12,
17,16,17,10,15,16,20,14,19,17,14),ncol=3,nrow = 8)
my_mat
[,1] [,2] [,3]
[1,] 25 26 10
[2,] 22 28 15
[3,] 18 30 16
[4,] 26 28 20
[5,] 12 12 14
[6,] 16 17 19
[7,] 24 16 17
[8,] 20 17 14
colnames(my_mat) <- c("sales", "profit","quantity")
rownames(my_mat)<- c("John","Chang","Michael","Bryan","Jose","McTon","Jayden","Jane")
my_mat
sales profit quantity
John 25 26 10
Chang 22 28 15
Michael 18 30 16
Bryan 26 28 20
Jose 12 12 14
McTon 16 17 19
Jayden 24 16 17
Jane 20 17 14
my_mat<-as.data.frame(my_mat)
my_mat
Flour<-c(230,181,165,150,97,182,191,199,172,170)
Cooking_oil<-c(125,99,97,115,120,100,80,90,95,125)
Kerosene<-c(200,55,105,85,125,150,85,120,110,130)
Rice<-c(109,107,98,71,82,103,111,93,86,78)
data_frame<-data.frame(Flour,Cooking_oil,Kerosene,Rice)
data_frame
cov(data_frame)
Flour Cooking_oil Kerosene Rice
Flour 1196.4556 -127.3556 469.94444 309.60000
Cooking_oil -127.3556 244.2667 316.22222 -101.75556
Kerosene 469.9444 316.2222 1589.16667 69.77778
Rice 309.6000 -101.7556 69.77778 197.06667
var(data_frame)
Flour Cooking_oil Kerosene Rice
Flour 1196.4556 -127.3556 469.94444 309.60000
Cooking_oil -127.3556 244.2667 316.22222 -101.75556
Kerosene 469.9444 316.2222 1589.16667 69.77778
Rice 309.6000 -101.7556 69.77778 197.06667
mean(data_frame$Flour)
[1] 173.7
income<-c(25,26,32,31,25,26,24,36,21,26,21,23,25)
expenditure<-c(15,16,14,12,20,16,19,18,14,13,19,18,17)
investment<-c(10,12,14,12,12,14,13,16,21,14,15,19,17)
data.frame(income,expenditure,investment)
Regression_model=lm(expenditure~income+investment)
summary(Regression_model)
Call:
lm(formula = expenditure ~ income + investment)
Residuals:
Min 1Q Median 3Q Max
-3.5335 -1.4506 -0.2696 1.9866 3.5648
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.284694 6.894461 2.942 0.0147 *
income -0.150291 0.183800 -0.818 0.4326
investment -0.007682 0.259844 -0.030 0.9770
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.67 on 10 degrees of freedom
Multiple R-squared: 0.06582, Adjusted R-squared: -0.121
F-statistic: 0.3523 on 2 and 10 DF, p-value: 0.7115
stargazer(Regression_model, type="text")
===============================================
Dependent variable:
---------------------------
expenditure
-----------------------------------------------
income -0.150
(0.184)
investment -0.008
(0.260)
Constant 20.285**
(6.894)
-----------------------------------------------
Observations 13
R2 0.066
Adjusted R2 -0.121
Residual Std. Error 2.670 (df = 10)
F Statistic 0.352 (df = 2; 10)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
In many cancer studies, the main outcome under assessment is the time to an event of interest. The generic name for the time is survival time, although it may be applied to the time ‘survived’ from complete remission to relapse or progression as equally as to the time from diagnosis to death. If the event occurred in all individuals, many methods of analysis would be applicable. However, it is usual that at the end of follow-up some of the individuals have not had the event of interest, and thus their true time to event is unknown. Further, survival data are rarely Normally distributed, but are skewed and comprise typically of many early events and relatively few late ones. It is these features of the data that make the special methods called survival analysis necessary.
This paper is the first of a series of four articles that aim to introduce and explain the basic concepts of survival analysis. Most survival analyses in cancer journals use some or all of Kaplan–Meier (KM) plots, logrank tests, and Cox (proportional hazards) regression. We will discuss the background to, and interpretation of, each of these methods but also other approaches to analysis that deserve to be used more often. In this first article, we will present the basic concepts of survival analysis, including how to produce and interpret survival curves, and how to quantify and test survival differences between two or more groups of patients. Future papers in the series cover multivariate analysis and the last paper introduces some more advanced concepts in a brief question and answer format. More detailed accounts of these methods can be found in books written specifically about survival analysis, for example, Collett (1994), Parmar and Machin (1995) and Kleinbaum (1996). In addition, individual references for the methods are presented throughout the series. Several introductory texts also describe the basis of survival analysis, for example, Altman (2003) and Piantadosi (1997).
In other words, survival analysis is a branch of statistics for analyzing the expected duration of time until one event happens, such as a death in biological organisms and failure in mechanical systems. Today, I have gone through this topic and found it very interesting. How one can predict the survival of cancer patients, churn rate or students’ dropout rate, or machine failure.
my_data<-ovarian
data(cancer, package="survival")
head(my_data,10)
tail(my_data,10)
fustat= 1= event occurred(death), =0= no event(they are censored). When fitting a model, make sure you know what 0 and 1 code for. This is the survival after diagnosis.
futime: survival or censoring time
fustat: censoring status
age: in years
resid.ds: residual disease present (1=no,2=yes)
rx: treatment group
ecog.ps: ECOG performance status (1 is better, see reference)
A patient is scored as censored if he or she did not suffer the outcome of interest. In survival analysis, patients who do not have an “event” during a specified period are said to have censored observation.
names(my_data)
[1] "futime" "fustat" "age" "resid.ds" "rx" "ecog.ps"
attach(my_data)
Kaplan-Meier estimate is one of the best options to be used to measure the fraction of subjects living for a certain amount of time after treatment. In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects survived or saved after that intervention over a period of time. The time starting from a defined point to the occurrence of a given event, for example death is called as survival time and the analysis of group data as survival analysis. This can be affected by subjects under study that are uncooperative and refused to be remained in the study or when some of the subjects may not experience the event or death before the end of the study, although they would have experienced or died if observation continued, or we lose touch with them midway in the study. We label these situations as censored observations. The Kaplan-Meier estimate is the simplest way of computing the survival over time in spite of all these difficulties associated with subjects or situations. The survival curve can be created assuming various situations. It involves computing of probabilities of occurrence of event at a certain point of time and multiplying these successive probabilities by any earlier computed probabilities to get the final estimate. This can be calculated for two groups of subjects and also their statistical difference in the survivals. This can be used in Ayurveda research when they are comparing two drugs and looking for survival of subjects.
km.model<-survfit(Surv(futime,fustat)~1,
type="kaplan-meier")
Since we have no x-variable, we must put a 1 there, its just survival, no variable to relate it to. In the command above, we specified the model, however, if we don’t the model, the system will fit the Kaplan-meier model by default.
km.model
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")
n events median 0.95LCL 0.95UCL
[1,] 26 12 638 464 NA
This gives the MEDIAN survival and confidence interval (CI) for it. From the results above, we have 26 individuals and 12 events. The results also gives the median survival time. From the results, more than half of the people survived beyond 638 days.
summary(km.model)
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 26 1 0.962 0.0377 0.890 1.000
115 25 1 0.923 0.0523 0.826 1.000
156 24 1 0.885 0.0627 0.770 1.000
268 23 1 0.846 0.0708 0.718 0.997
329 22 1 0.808 0.0773 0.670 0.974
353 21 1 0.769 0.0826 0.623 0.949
365 20 1 0.731 0.0870 0.579 0.923
431 17 1 0.688 0.0919 0.529 0.894
464 15 1 0.642 0.0965 0.478 0.862
475 14 1 0.596 0.0999 0.429 0.828
563 12 1 0.546 0.1032 0.377 0.791
638 11 1 0.497 0.1051 0.328 0.752
From the summary table above, at time (59 days) 26 individuals were at risk and an event occurred. In other words, and individual died. Additionally, beyond 59 days, the probability of an individual surviving is 96.2%. Further, from the summary table above, we are 95% confident that there are 89% to 100% chances of survival beyond 59 days. At time 115 days, 25 individuals were at risk and one died. The probability of survival beyond 115 days is 92.3% with the standard error of 0.0523. Besides, we are 95% confident that beyond 115 days, there are 82.6% to 100% chances of survival.
plot(km.model,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model")
From the command above, we provide conf.int = F, if we do not want the coefficient level in the plot. Consider the plot below with confidence level.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
From the graph above, you get the reason why why the median survival days is 638.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
abline(h=0.5,col="red")
Remember we can also include a ‘tick’ where there is a censored observation by using the “mark-time” argument.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1, mark.time = T)
The graph above shows every point where there were censored observation.
cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")
head(cancer,10)
tail(cancer,10)
km.model2<-survfit(Surv(futime,fustat)~over50,
type="kaplan-meier", data = cancer)
plot(km.model2,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="K-M Model: Survival Analysis", las=1)
ggsurvplot(km.model2,data = cancer, pval = TRUE)
plot(km.model2, conf.int = F,xlab = "Time(days)",ylab = "Survival Probability",main="K-M model: Survival Analysis", col = c("red","blue"),las=1,lwd=2,mark.time = T)
## legend
legend(100, 0.2, legend = c("below50","above50"),lty = 1,lwd = 2,col=c("red","blue"),bty = "",cex = 0.6)
km.model2
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer,
type = "kaplan-meier")
n events median 0.95LCL 0.95UCL
over50=0 6 1 NA NA NA
over50=1 20 11 563 431 NA
From the summary statistics above, there were six individuals who were under 50 years of age, from which only one event occurred. On the other hand, we have 20 individual who were at risk and over 50 years from which 11 events occurred. This is a clear indication that there is a relationship between age and survival.
summary(km.model2)
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer,
type = "kaplan-meier")
over50=0
time n.risk n.event survival std.err lower 95% CI
329.000 6.000 1.000 0.833 0.152 0.583
upper 95% CI
1.000
over50=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 20 1 0.950 0.0487 0.859 1.000
115 19 1 0.900 0.0671 0.778 1.000
156 18 1 0.850 0.0798 0.707 1.000
268 17 1 0.800 0.0894 0.643 0.996
353 16 1 0.750 0.0968 0.582 0.966
365 15 1 0.700 0.1025 0.525 0.933
431 12 1 0.642 0.1093 0.460 0.896
464 10 1 0.577 0.1157 0.390 0.855
475 9 1 0.513 0.1193 0.326 0.809
563 7 1 0.440 0.1227 0.255 0.760
638 6 1 0.367 0.1222 0.191 0.705
From the summary table above, at time (329 days) 6 individuals below 50 years were at risk and an event occurred. In other words, and individual died. Additionally, beyond 329 days, the probability of an individual below 50 years surviving is 83.3%. Further, from the summary table above, we are 95% confident that there are 58.3% to 100% chances of survival beyond 329 days for the individuals below 50 years. On the other hand, at time (59 days) 20 individuals above 50 years were at risk and an event occurred. In other words, and individual died. Besides, beyond 59 days, the probability of an individual above 50 years surviving is 95%. Further, from the summary table above, we are 95% confident that there are 85.9% to 100% chances of survival beyond 59 days for the individuals above 50 years. It is important to note that the tick marks in the plot above shows censored observations.
From the graph, we can argue that survival differs significantly across the two age category. Individuals below 50 years seems to have a higher survival probability as compared to individuals above 50 years.
survdiff(Surv(futime,fustat)~over50, data = cancer)
Call:
survdiff(formula = Surv(futime, fustat) ~ over50, data = cancer)
N Observed Expected (O-E)^2/E (O-E)^2/V
over50=0 6 1 3.6 1.876 2.75
over50=1 20 11 8.4 0.804 2.75
Chisq= 2.7 on 1 degrees of freedom, p= 0.1
From the inferential results above, there is no evidence to reject the null hypothesis. In other words, there is no statistically sufficient evidence to prove that survival in the two groups is statistically different as shown by the p-value of 0.1 which is more than 0.05.
The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.
In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:
The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age. An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.
The Cox proportional hazards model is called a semi-parametric model, because there are no assumptions about the shape of the baseline hazard function. There are however, other assumptions as noted above (i.e., independence, changes in predictors produce proportional changes in the hazard regardless of time, and a linear association between the natural logarithm of the relative hazard and the predictors). There are other regression models used in survival analysis that assume specific distributions for the survival times such as the exponential, Weibull, Gompertz and log-normal distributions1,8. The exponential regression survival model, for example, assumes that the hazard function is constant. Other distributions assume that the hazard is increasing over time, decreasing over time, or increasing initially and then decreasing. Example 5 will illustrate estimation of a Cox proportional hazards regression model and discuss the interpretation of the regression coefficients.
data("lung")
Warning in data("lung"): data set 'lung' not found
head(lung)
tail(lung,5)
names(lung)
[1] "inst" "time" "status" "age" "sex" "ph.ecog"
[7] "ph.karno" "pat.karno" "meal.cal" "wt.loss"
The data set above is an Rstudio inbuilt data set that can be used to run Cox Proportional hazard model, however, in this case, we are going to use the data below.
cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")
head(cancer,10)
cancer$over50<-as.factor(cancer$over50)
cancer$ecog.ps<-as.factor(cancer$ecog.ps)
attach(cancer)
The following object is masked _by_ .GlobalEnv:
gender
The following objects are masked from my_data:
age, ecog.ps, fustat, futime, resid.ds, rx
cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)
summary(cox.model)
Call:
coxph(formula = Surv(futime, fustat) ~ over50 + ecog.ps)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
over501 1.5356 4.6442 1.0541 1.457 0.145
ecog.ps2 0.2628 1.3006 0.5893 0.446 0.656
exp(coef) exp(-coef) lower .95 upper .95
over501 4.644 0.2153 0.5884 36.655
ecog.ps2 1.301 0.7689 0.4098 4.128
Concordance= 0.603 (se = 0.081 )
Likelihood ratio test= 3.64 on 2 df, p=0.2
Wald test = 2.46 on 2 df, p=0.3
Score (logrank) test = 2.95 on 2 df, p=0.2
Remember, cox proportional hazard model do not have the coefficient for the intercept.
At any given time, someone is above 50 years is 4.6442 times as likely to to die as someone who is under 50 years adjusting for ECOG performance status. ECOG performance status describes a patient’s level of functioning in terms of their ability to care for themselves, daily activity, and physical ability (walking, working, etc.). Researchers worldwide consider the ECOG Performance Status Scale when planning cancer clinical trials to study new treatments. From the results (4.6442- 1) indicates that someone who is over 50 years is 364.42% times likely to die compared to someone who is under 50 years when adjusted for ECOG performance status. What this means is that if we pick two individualz who have same the ECOG performance status, the one above 50 years is 364.42% times likely to dies than the one below 50 years. Additionally, someone who is below 50 years is 0.2153 times as likey to die as compared to someone who is above 50 years adjusting to ECOG performance status.
This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.
cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)
cox.model2<-coxph(Surv(futime,fustat)~over50)
The test aims to establish whether dropping ECOG performance would better or worsen our model significantly. Consider the command below.
anova(cox.model2,cox.model,test = "LRT")
From the results above(p-value>0.05), we cam drop the ECOG performance status without the loss of predictive power from the model. In other words, ECOG performance status is not very important in this model.
data("lung")
Warning in data("lung"): data set 'lung' not found
head(lung)
write.csv(lung, "C:\\Users\\user\\Downloads\\lung.csv", row.names=FALSE)
lung_data<-read.csv("C:\\Users\\user\\Downloads\\lung_data.csv")
attach(lung_data)
The following objects are masked from cancer:
age, over50
The following object is masked from my_data:
age
head(lung_data,10)
cox.model3<-coxph(Surv(time,status)~over50+sex)
summary(cox.model3)
Call:
coxph(formula = Surv(time, status) ~ over50 + sex)
n= 228, number of events= 165
coef exp(coef) se(coef) z Pr(>|z|)
over50 0.2803 1.3236 0.3003 0.933 0.35057
sex -0.5270 0.5904 0.1672 -3.151 0.00162 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
over50 1.3236 0.7555 0.7347 2.3844
sex 0.5904 1.6938 0.4254 0.8193
Concordance= 0.585 (se = 0.022 )
Likelihood ratio test= 11.58 on 2 df, p=0.003
Wald test = 10.94 on 2 df, p=0.004
Score (logrank) test = 11.19 on 2 df, p=0.004
At any given time, someone is above 50 years is 1.3236 times as likely to to die as someone who is under 50 years adjusting for gender. From the results (1.3236- 1) indicates that someone who is over 50 years is 32.36% times likely to die compared to someone who is under 50 years when adjusted for gender. What this means is that if we pick two individuals who have same the gender, the one above 50 years is 32.36% times likely to die than the one below 50 years. Additionally, someone who is below 50 years is 0.7555 times likely to die as compared to someone who is above 50 years adjusting for gender.
This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.
cox.model3<-coxph(Surv(time,status)~over50+sex)
cox.model4<-coxph(Surv(time,status)~over50)
The test aims to establish whether dropping gender would better or worsen our model significantly. Consider the command below.
anova(cox.model4,cox.model3,test = "LRT")
From the results above(p-value<0.05), we cannot drop the the X-variable sex from the model. Doing so leads to loss of predictive power from the model. In other words, the X-variable sex is very important in this model.
Now, we have worked with adding categorical X-variables, let us now add numeric X-variable in the model.
cox.model5<-coxph(Surv(time,status)~age+wt.loss)
summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.021697 1.021934 0.009627 2.254 0.0242 *
wt.loss 0.001339 1.001340 0.006239 0.215 0.8301
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.022 0.9785 1.0028 1.041
wt.loss 1.001 0.9987 0.9892 1.014
Concordance= 0.563 (se = 0.026 )
Likelihood ratio test= 5.27 on 2 df, p=0.07
Wald test = 5.12 on 2 df, p=0.08
Score (logrank) test = 5.14 on 2 df, p=0.08
At any given instant in time, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger adjusting for weight loss in the past six months. What this means is that if we pick two individuals who have similar weight loss in the past six months, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger.
The Cox proportional hazards model makes two assumptions: * survival curves for different strata must have hazard functions that are proportional over the time t. * the relationship between the log hazard and each covariate is linear, which can be verified with residual plots. Examples of covariates can be categorical such as race or treatment group, or continuous such as biomarker concentrations.
cox.model5<-coxph(Surv(time,status)~age+wt.loss)
summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.021697 1.021934 0.009627 2.254 0.0242 *
wt.loss 0.001339 1.001340 0.006239 0.215 0.8301
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.022 0.9785 1.0028 1.041
wt.loss 1.001 0.9987 0.9892 1.014
Concordance= 0.563 (se = 0.026 )
Likelihood ratio test= 5.27 on 2 df, p=0.07
Wald test = 5.12 on 2 df, p=0.08
Score (logrank) test = 5.14 on 2 df, p=0.08
If you recall, we assume that the relationship between the covariates or rather X-variable and the log hazard is linear. We can check this assumption in the same way we check linearity in the linear regression model, logistic regression model, or Poisson regression model. In order to do, we can look at the residual plot. On the x-axis, we plot the predicted values and y-axis we plot the residuals.
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
abline(h=0)
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
abline(h=0)
lines(smooth.spline(predict(cox.model5),residuals(cox.model5,type = "martingale")),col="red")
plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
abline(h=0)
lines(smooth.spline(predict(cox.model5),
residuals(cox.model5,type = "deviance")),col="red")
From the results above, there is linearity between the covariates/X-variables and the log hazard.
This assumption is tested using Schoenfeld test which has the following hypothesis;
Performing this test will return test for each X-variable and for overall model
cox.zph(cox.model5)
chisq df p
age 0.6910 1 0.41
wt.loss 0.0121 1 0.91
GLOBAL 0.7209 2 0.70
The p-values for the covariates and overall model are greater than 0.05. This is an indicator that there is statistically significant evidence to reject the null hypothesis. We therefore conculude that HAZARDS are proportional over time.
We can see a plot of these as well …(one plot for each parameter). These are plots of “changes in b over time”, if we let “b” vary over time recall…if “b” vary over time, this means that there is NO PH!. The effect is not constant over time… it varies! However pay less attention to the extremes, as line is sensitive…
plot(cox.zph(cox.model5))
The smooth line in the plot tells us how HAZARds will change if we allow them to change. The dotted line on either side of the solid line is the confidence limit. Let us add the abline through zero.
plot(cox.zph(cox.model5)[1], main="Plot of Hazards")
abline(h=0,col="red")
From the graph above, the reference line [abline] lies with the confidence bound at least 1/3 of the time. However, that is not enough to show that coefficient of HAZARDS for age is not changing over time. Partially, the proportional hazards assumption seems to be partially met.
plot(cox.zph(cox.model5)[2], main="Plot of Hazards")
abline(h=0,col="red")
From the graph above, the reference line [abline] lies with the confidence bound at least throught the time. This is a clear indication that is enough evidence to enough to show that coefficient of HAZARDS for weight loss is not changing over time. From the graph, we can conlude that the coefficient for weight loss is not really changing. In other words proportional hazards assumption is fully met.