Walk attendees through creating an R project using the wizard to organize the code and data used in this workshop
# These only need to be executed once for your machine or RStudio Cloud project
#install.packages("tidyverse")
#install.packages("fpp2")
#install.packages("quantmod")
#install.packages("scales")
#install.packages("vegan")
R comes pre-loaded with many fun data sets to play around with. The dataset “rivers” gives the lengths (in miles) of 141 “major” rivers in North America, as compiled by the US Geological Survey.
rivers
## [1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870
## [16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280
## [31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600
## [46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350
## [61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
## [76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735
## [91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377
## [106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540
## [121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529
## [136] 500 720 270 430 671 1770
To start getting a feel for our data, we can ask R for a summary,
summary(rivers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 135.0 310.0 425.0 591.2 680.0 3710.0
a histogram,
hist(rivers)
and a stem-and-leaf plot to visualize the distribution of the data
stem(rivers, scale = 2) #the optional argument "scale" adjusts the plot length
##
## The decimal point is 2 digit(s) to the right of the |
##
## 1 | 4
## 2 | 0112233345555666677788888999
## 3 | 00001111223333344455555666688888999
## 4 | 111222333445566779
## 5 | 001233344567
## 6 | 000112233578
## 7 | 012234468
## 8 | 04579
## 9 | 0018
## 10 | 045
## 11 | 07
## 12 | 147
## 13 | 1
## 14 | 56
## 15 |
## 16 |
## 17 | 7
## 18 | 9
## 19 |
## 20 |
## 21 |
## 22 |
## 23 | 25
## 24 |
## 25 | 3
## 26 |
## 27 |
## 28 |
## 29 |
## 30 |
## 31 |
## 32 |
## 33 |
## 34 |
## 35 |
## 36 |
## 37 | 1
The Shapiro-Wilk test is a test for normality. It compares the sampled values with the expected values of the order statistics for a sample of the same size. Here, the Shapiro-Wilk test allows us to reject the null hypothesis that the distribution of river lenght is normal.
shapiro.test(rivers)
##
## Shapiro-Wilk normality test
##
## data: rivers
## W = 0.66662, p-value < 2.2e-16
Maybe the length of rivers follow an exponential distrubution? (Knowing nothing about rivers, this seems reasonably plausible – it means that if there are n rivers of a certain length, there will be ~2n rivers of half that length.) We can test this qualitatively using a Q-Q plot:
lambda <- 1 / mean(rivers)
observed <- sort(rivers)
expected <- qexp(ppoints(length(rivers)), rate = lambda)
qqplot(expected, observed)
abline(0,1)
Let’s break this down:
The closer a Q-Q plot is to the line y = x, the more the observed distribution is similar to the expected distribution (think about why this is true if you don’t see it right away!).
The Q-Q plot looks kind-of-okay: it is reasonably close to the line y = x, but it also deviates from this line in a clearly systematic way. So let’s try some statistical testing.
The Kolmogorov-Smirnov test is a nonparametric test of whether a sample comes from a particular hypothesized distribution.
ks.test(rivers, "pexp", rate = lambda)
## Warning in ks.test.default(rivers, "pexp", rate = lambda): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: rivers
## D = 0.2848, p-value = 2.331e-10
## alternative hypothesis: two-sided
Probably not exponential :(
log-normal?
ks.test(rivers, "plnorm", meanlog = mean(log(rivers)), sdlog = sd(log(rivers)))
## Warning in ks.test.default(rivers, "plnorm", meanlog = mean(log(rivers)), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: rivers
## D = 0.092305, p-value = 0.1808
## alternative hypothesis: two-sided
maybe?
shapiro.test(log(rivers))
##
## Shapiro-Wilk normality test
##
## data: log(rivers)
## W = 0.94801, p-value = 3.945e-05
probably not :(
In the above code, we used R’s built-in /d-p-q-r functions/, whose names are a(n abbreviation of a) probability distribution prefixed by one of the letters ‘d’, ‘p’, ‘q’, or ‘r’. The functions that start with:
'd' give the value of probability density function,'p' give the value of the cumulative distribution function,'r' give a randomly generated value from the distribution.The available distributions are:
"norm" (normal), "pois" (Poisson), "binom" (binomial), "nbinom" (negative binomial), "t" (Student's t), "chisq" (chi-squared), "beta" (Beta),"cauchy" (Cauchy), "exp" (exponential), "f" (F), "gamma" (Gamma), "geom" (geometric),"hyper" (hypergeometric), "logis" (logistic), "lnorm" (log-normal), "tukey" (Studentized range),"weibull" (Weibull), "signrank" (Wilcoxon signed rank statistic).For other distributions, one can use the function dpqr(fun, min, max), which returns a list of functions d(x), p(x), q(p), r(n), or the functions d(x,…), p(x,…), qn(x,…), and r(x,…) (note the name of qn to avoid conflicts with the similar function q(p, T.dist, T.dist.par)).
Remark: The function ks.test expects an r-function in the first argument and a p-function in the second argument.
There is a consensus in the literature that river lengths are actually distributed according to a power law distribution (see Fractal River Basins: Chance and Self-Organization by Rinaldo & Rodr’iguez-Iturbe). However it is generally quite difficult to test whether some particular data is distributed according to a power law, and this is an active area of statistical research (see e.g. “Power-Law Distributions in Empirical Data” by Clauset, Shalizi, & Newman)
Exploring Relationships between variables
Bivariate Relationships Estimate and visualize how two numerical variables are positively or negatively related.
A correlation exists between two variables when one is related to the other such that there is co-movement. Positive correlation means as one variable increases, the other variable also increases. Negative correlation means as one variable increases, the other variable decreases.
The code below downloads a CSV file that includes data from 1980 for 935 individuals on variables including their total monthly earnings (MonthlyEarnings) and a number of variables that could influence income, including years of education (YearsEdu). The data set originally comes from textbook website for Stock and Watson’s Introduction to Econometrics.
wages <- read.csv("http://murraylax.org/datasets/wage2.csv", na.strings = ".");
head(wages)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(data=wages,mapping=aes(x=YearsEdu, y=MonthlyEarnings)) +
geom_point() +
labs(title= "Monthly Earnigs vs. Years of Education",
x= "Years of Education",
y="Monthly Earnings") +
theme_bw()
ggplot(data=wages,mapping=aes(x=YearsEdu, y=MonthlyEarnings)) +
geom_point(position=position_jitter(width=0.4,seed=123)) +
labs(title= "Monthly Earnigs vs. Years of Education",
x= "Years of Education",
y="Monthly Earnings") +
theme_bw()
To learn more about data visualization using ggplot, you can consult the resources in the workshop powerpoint.
Combine with above add regression line to above plot
lmwages <- lm(MonthlyEarnings ~ YearsEdu, data=wages)
#lm() is a formula of the form y ~ x, fits a function of linear form y=a+bx
#Find the equation of the line by calling the coefficients
lmwages$coefficients
## (Intercept) YearsEdu
## 146.95244 60.21428
#Calculate the confidence intervals
confint(lmwages,level=0.95)
## 2.5 % 97.5 %
## (Intercept) -5.56393 299.46881
## YearsEdu 49.03783 71.39074
#Plot the scatter plot with regression line and confidence intervals
ggplot(data=wages, mapping=aes(x=YearsEdu, y=MonthlyEarnings)) +
geom_point(position = position_jitter(width=0.3,seed=123)) +
labs(title="Monthly Earnings Versus Years of Education",
x="Years of Education",
y="Monthly Earnings") +
geom_smooth(method="lm",se = T) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Measure the strength of the correlation and Confidence intervals
cor(x=wages$MonthlyEarnings, y=wages$YearsEdu, method='pearson')
## [1] 0.3271087
confint(lmwages,level=0.95)
## 2.5 % 97.5 %
## (Intercept) -5.56393 299.46881
## YearsEdu 49.03783 71.39074
Our sample estimate for the correlation coefficient is positive, but is this enough evidence that there is a relationship between years of schooling and real GDP growth in the population? To answer this, let us conduct a hypothesis test with the following null and alternative hypotheses:
Null hypothesis: rho equals 0
Alternative hypothesis: rho does not equal 0
The not-equal sign in the alternative hypothesis implies that this is a two-tailed test, so either positive or negative Pearson correlation coefficients significantly far away from zero will result in the null hypothesis rejected.
cor.test(x=wages$MonthlyEarnings, y=wages$YearsEdu,
alternative="two.sided", conf.level=0.95, method='pearson')
##
## Pearson's product-moment correlation
##
## data: wages$MonthlyEarnings and wages$YearsEdu
## t = 10.573, df = 933, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2686296 0.3831853
## sample estimates:
## cor
## 0.3271087
#If we have reason to believe that earnings are positively correlated
#to years of education, can do a one.sided test
In our example we are assuming a linear relationship between years of education and monthly earnings, but it may be more appropriate to suggest that each year of education leads to a similar percentage increase in monthly earnings.
To estimate such a relationship, we estimate the following regression equation that includes the natural logarithm of the dependent variable (monthly earnings)
# Estimage the regression equation with the log of monthly earnings as the outcome variable
loglmwages <- lm(log(wages$MonthlyEarnings) ~ wages$YearsEdu)
#view the summary
summary(loglmwages)
##
## Call:
## lm(formula = log(wages$MonthlyEarnings) ~ wages$YearsEdu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.94620 -0.24832 0.03507 0.27440 1.28106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.973062 0.081374 73.40 <2e-16 ***
## wages$YearsEdu 0.059839 0.005963 10.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4003 on 933 degrees of freedom
## Multiple R-squared: 0.09742, Adjusted R-squared: 0.09645
## F-statistic: 100.7 on 1 and 933 DF, p-value: < 2.2e-16
#Coefficient suggests that one additional year of education
#is associated with ~6% higher monthly salary
Extends simple bivariate regression analysis with the inclusion of more than one explanatory variable.
The procedure is used to determine how one or more explanatory variables influence an outcome variable, while holding all other explanatory variables fixed.
We can use the lm() function to estimate the regression. In the code below we call lm() with a single parameter that is a formula specifying how the outcome variable, MonthlyEarnings depends linearly on the seven explanatory variables. The function lm() produces a large list of output, which we assign to an object we name lmwages. We call the summary() function next to display the coefficient estimates to the screen.
lmwagesMR <- lm(MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience
+ Tenure, data=wages)
summary(lmwagesMR)
##
## Call:
## lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience +
## Tenure, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -826.33 -243.85 -44.83 180.83 2253.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -531.0392 115.0513 -4.616 4.47e-06 ***
## IQ 3.6966 0.9651 3.830 0.000137 ***
## Knowledge 8.2703 1.8273 4.526 6.79e-06 ***
## YearsEdu 47.2698 7.2980 6.477 1.51e-10 ***
## YearsExperience 11.8589 3.2494 3.650 0.000277 ***
## Tenure 6.2465 2.4565 2.543 0.011156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 365.4 on 929 degrees of freedom
## Multiple R-squared: 0.1878, Adjusted R-squared: 0.1834
## F-statistic: 42.97 on 5 and 929 DF, p-value: < 2.2e-16
The goal here is to identify how much variability in the outcome variable, average monthly earnings, is explained by the all of the explanatory variables, including IQ, knowledge of a worker’s job, years of education, years experience, and tenure.
How much of the variability in income is explained by the variables in our regression, and how much is explained by other factors we are neglecting or unable to account for?
We want to know the following: Sum of Squares Explained (SSE) (sometimes referred to as the Sum of Squares Regression): measure of the variability in the outcome variable that is explained by the explanatory variables, i.e. the x-variables in your regression.
Sum of Squares Residual (SSR) (sometimes referred to as the Sum of Squares Error): measures of the variability in the outcome variable that is not explained by your regression
#An ANOVA table is a common way to summarize these measures of variability.
anova_wages<-anova(lmwagesMR)
anova_wages
#The Sum of Squares for the variables vs. the residuals allows for
#comparison of how much of the variability is explained by the variables
#and how much variability is not explained by our regression
#coefficient of determination, sometimes referred to as the R-Squared value,
#is a measure of what percentage of the variability in your outcome variable
#is explained by your explanatory variables.
summary(lmwagesMR)
##
## Call:
## lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience +
## Tenure, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -826.33 -243.85 -44.83 180.83 2253.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -531.0392 115.0513 -4.616 4.47e-06 ***
## IQ 3.6966 0.9651 3.830 0.000137 ***
## Knowledge 8.2703 1.8273 4.526 6.79e-06 ***
## YearsEdu 47.2698 7.2980 6.477 1.51e-10 ***
## YearsExperience 11.8589 3.2494 3.650 0.000277 ***
## Tenure 6.2465 2.4565 2.543 0.011156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 365.4 on 929 degrees of freedom
## Multiple R-squared: 0.1878, Adjusted R-squared: 0.1834
## F-statistic: 42.97 on 5 and 929 DF, p-value: < 2.2e-16
# 18.78% of the variability in people’s monthly earnings is explained by IQ,
#knowledge of one’s job, educational attainment, experience, and tenure. The
#remaining 81.22% of variability in monthly earnings we cannot explain.
#F-statistic also gives an indication of how valid our regression coefficients
#are, with the larger the F-statistic, the larger the ratio of explained
#variability to unexplained variability
#The result of the F-test is an F-statistic equal to 42.97 and a p-value equal
#to 2.2 × 10-16 which is far below a typical significance level of alpha=0.05, so
#we reject the null hypothesis. We conclude that we have statistical evidence
#that at least one explanatory variable helps explain the outcome variable.
Often the best place to start understanding your data is to visualize it
You would like to know if their is a difference in monthly earnings between respondents who identify as Black and those who don’t
#Use ggplot2 to create a box plot to graph the median and distribution of earnings between groups
ggplot(data=wages, mapping=aes(x=as.factor(Black), y=MonthlyEarnings)) +
stat_boxplot(geom="errorbar",width=0.15)+
geom_boxplot() +
scale_y_continuous(labels=scales::label_dollar()) +
labs(title="Distribution of Monthly Earnings by Race",
x="Respondent Race (Black = 1, Other = 0)",
y="Monthly Earnings (Mean)") +
theme_bw()+
coord_flip()
With independent samples, we have two groups of observations, distinguishable by some measured characteristic to divide into these groups, and no member of one group is also in the other group. The outcome of the observations in one group must be independent of the outcome of the observations in the other group.
Testing differences in means between two independent samples is appropriate when a variable measured from two independent samples are in the same units and at the interval or ratio scale.
We want to estimate the difference in Monthly earnings if respondent indicated they were Black or not Black
#Calculate means between groups
wages %>%
group_by(Black) %>%
summarise(Mean_Monthly_Wage = mean(MonthlyEarnings, na.rm=T),
Median_Monthly_Wage=median(MonthlyEarnings, na.rm = T))
#Use function t.test to estimate and test differences between groups
wagestats <- t.test(MonthlyEarnings ~ Black, data=wages, conf.level=0.95, alternative="two.sided")
#Two-Tailed Independent Samples T-Test
wagestats
##
## Welch Two Sample t-test
##
## data: MonthlyEarnings by Black
## t = 8.3373, df = 192.73, p-value = 1.42e-14
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 194.5269 315.0854
## sample estimates:
## mean in group 0 mean in group 1
## 990.6479 735.8417
#Can see from p-value, that there is a statistical difference between
#earnings if the respondent identified as Black or not
#note difference in the means between the groups
#One-tailed Independent Samples T-Test
#Can test if non-Black respondents earned an average of more than $200/mo
t.test(MonthlyEarnings ~ Black, data=wages, conf.level=0.95, alternative="greater",mu=200)
##
## Welch Two Sample t-test
##
## data: MonthlyEarnings by Black
## t = 1.7933, df = 192.73, p-value = 0.03725
## alternative hypothesis: true difference in means between group 0 and group 1 is greater than 200
## 95 percent confidence interval:
## 204.2931 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 990.6479 735.8417
# can also set paired=TRUE if the variables are considered dependent - e.g., same variable two different conditions, or same variable at different times, two variables same individual
A nonparametric test that allows two groups or conditions or treatments to be compared without making the assumption that values are normally distributed.
The null hypothesis asserts that the medians of the two samples are identical.
# independent 2-group Mann-Whitney U Test
wilcox.test(wages$MonthlyEarnings ~ wages$Black)
##
## Wilcoxon rank sum test with continuity correction
##
## data: wages$MonthlyEarnings by wages$Black
## W = 68401, p-value = 1.657e-12
## alternative hypothesis: true location shift is not equal to 0
# where y is numeric and A is A binary factor
# dependent 2-group Wilcoxon Signed Rank Test
wilcox.test(wages$DadEdu,wages$MomEdu,paired=TRUE, alternative="less")
##
## Wilcoxon signed rank test with continuity correction
##
## data: wages$DadEdu and wages$MomEdu
## V = 43334, p-value = 7.045e-08
## alternative hypothesis: true location shift is less than 0
# where y1 and y2 are numeric, and dependent variables as they are measurements
#in education of the parents of the same individual, one has the potential to influence the other
Load packages needed
library(fpp2)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ forecast 8.24.0 ✔ expsmooth 2.3
## ✔ fma 2.5
##
library(quantmod) #reinstall this if getting an error fetching data
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## Loading required package: TTR
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
Download Data We will use R code to download latest available macroeconomic data from the Federal Reserve Economic Database (FRED). It is useful to browse https://fred.stlouisfed.org to identify variables you are interested in. At each variable’s page, there is a code next to the name, which is used in the call below.
Below we use the getSymbols() function in the quantmod package to download and load into memory monthly unemployment rate that is not seasonally adjusted. The code for this variable in FRED is.
getSymbols("UNRATENSA", src="FRED")
## [1] "UNRATENSA"
#downloading monthly unemployment rate that is not seasonally adjusted from the Federal Reserve Economic Database
Unemployment rate over time - not seasonally adjusted
str(UNRATENSA) #note the object contains an underlying matrix and time index
## An xts object on 1948-01-01 / 2026-01-01 containing:
## Data: double [937, 1]
## Columns: UNRATENSA
## Index: Date [937] (TZ: "UTC")
## xts Attributes:
## $ src : chr "FRED"
## $ updated: POSIXct[1:1], format: "2026-03-03 18:25:25"
# A time series object can be created with xts() - https://rpubs.com/mpfoley73/504487
head(UNRATENSA)
## UNRATENSA
## 1948-01-01 4.0
## 1948-02-01 4.7
## 1948-03-01 4.5
## 1948-04-01 4.0
## 1948-05-01 3.4
## 1948-06-01 3.9
autoplot(UNRATENSA) #uses ggplot2 to draw
plot(UNRATENSA) #uses generic R plotting to draw
head(UNRATENSA)
## UNRATENSA
## 1948-01-01 4.0
## 1948-02-01 4.7
## 1948-03-01 4.5
## 1948-04-01 4.0
## 1948-05-01 3.4
## 1948-06-01 3.9
frequency(UNRATENSA)
## [1] 1
#frequency is wrong, this is monthly data, this number should be 12
unrate <- ts(UNRATENSA/100, frequency=12, start=c(1948,1))
#creates a new time series object with the appropriate adjustments
#unemployment data as decimal not percentage
#frequency =12
#start date = January 1948
head(unrate)
## Jan Feb Mar Apr May Jun
## 1948 0.040 0.047 0.045 0.040 0.034 0.039
str(unrate)
## Time-Series [1:937, 1] from 1948 to 2026: 0.04 0.047 0.045 0.04 0.034 0.039 0.039 0.036 0.034 0.029 ...
## - attr(*, "src")= chr "FRED"
## - attr(*, "updated")= POSIXct[1:1], format: "2026-03-03 18:25:25"
## - attr(*, "index")= num [1:937] -6.94e+08 -6.92e+08 -6.89e+08 -6.86e+08 -6.84e+08 ...
## ..- attr(*, "tzone")= chr "UTC"
## ..- attr(*, "tclass")= chr "Date"
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "UNRATENSA"
#Plot new time series object
autoplot(unrate) +
theme_bw() +
labs(title="Unemployment Rate",y="Unemployment Rate") +
scale_y_continuous(label=percent) +
theme_bw()
An autoregression model is a linear regression forecasting model where the time series variable is regressed against one or more lags of the same variable.
In the code below, we will estimate an autoregression model, create a forecast, and plot the forecast. We will use the Arima() function, which can be used to estimate a large class of models.
We will incorporate seasonality and estimate a model that includes the last two lags and the last two observations from the same month. This allows us to use information from the most recent months and and information from the same month from the most recent years.
horizon = 24 # setting horizon equal to 24 months
p = 2 # Autoregressive order (last number of lags)
k = 2 # Seasonal autoregressive order (last number of observations for same month)
#?Arima
sarmodel <- Arima(unrate, order=c(p,0,0),
seasonal=list(order=c(k,0,0),period=12))
#order -c(p,d,q)
#p = number of lagged autoregressive terms used,
#d = "the degree of difference" indicates how many times the data is
#differenced to achieve stationarity,
#q = "MA order" represents the number of lagged moving average error terms
#included in the model
#seasonal - order is same but for seasonality, somtimes seen as PDQ
sarmodel
## Series: unrate
## ARIMA(2,0,0)(2,0,0)[12] with non-zero mean
##
## Coefficients:
## ar1 ar2 sar1 sar2 mean
## 1.0327 -0.0989 0.3245 0.2986 0.0566
## s.e. 0.0320 0.0363 0.0320 0.0339 0.0059
##
## sigma^2 = 2.238e-05: log likelihood = 3681.17
## AIC=-7350.34 AICc=-7350.25 BIC=-7321.28
#Coefficients - represents impact of past values and errors
#SE - measures uncertainty of each coefficient
#Sigma^2 - variance of errors, lower indicates better fit
#log likelihood - how well a model fits data, higher better
#AIC/BIC goodness of fit, smaller better
fsar <- forecast(sarmodel, h=horizon)
#use the model to forecast 24 months (horizon)
autoplot(fsar) +
theme_bw() +
scale_y_continuous(labels=percent) +
coord_cartesian(xlim=c(2005,2028))+
theme(text = element_text(size=12)) +
labs(title="Unemployment Rate AR(2)(2) Forecasts",
x="Time (years)", y="Unemployment Rate")
We can look at autocorrelation for choosing lag length. The autocorrelation coefficient is simply the Pearson correlation coefficient of the series compared to a lag of the same series.
The ggAcf() function produces a plot of the autocorrelation function for a number of lag lengths
ggAcf(unrate, lag.max = 48)
#The horizontal blue dotted line shows where the correlation coefficients are
#statistically significantly greater than equal to zero. All the
#autocorrelation coefficients between 1 month lag and 48 months lag are
#statistically significant.
#That does not mean that all 48 lags should go into the regression though. Just
#persistence at one lag is enough to cause correlation at longer lags.
#partial autocorrelation function nets out explanatory power already explained
#by later lags.
ggPacf(unrate, lag.max = 48)
#Many of the lags between 1 and 11 are statistically significant. We also see
#statistically significant seasonality, going back at least 2 years.
#This plot suggests it may be appropriate to use more lags. We need to be
#careful not to include too many lags
p = 4 # Autoregressive order
k = 2 # Seasonal autoregressive order
sarmodel_4_2 <- Arima(unrate, order=c(p,0,0),
seasonal=list(order=c(k,0,0),period=12))
#compare the models
sarmodel
## Series: unrate
## ARIMA(2,0,0)(2,0,0)[12] with non-zero mean
##
## Coefficients:
## ar1 ar2 sar1 sar2 mean
## 1.0327 -0.0989 0.3245 0.2986 0.0566
## s.e. 0.0320 0.0363 0.0320 0.0339 0.0059
##
## sigma^2 = 2.238e-05: log likelihood = 3681.17
## AIC=-7350.34 AICc=-7350.25 BIC=-7321.28
sarmodel_4_2
## Series: unrate
## ARIMA(4,0,0)(2,0,0)[12] with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 sar1 sar2 mean
## 1.0359 -0.1292 0.0484 -0.0210 0.3255 0.2997 0.0551
## s.e. 0.0327 0.0471 0.0472 0.0334 0.0315 0.0314 0.0060
##
## sigma^2 = 2.24e-05: log likelihood = 3681.74
## AIC=-7347.48 AICc=-7347.32 BIC=-7308.73
#Forecast with new model
fsar <- forecast(sarmodel_4_2, h=horizon)
autoplot(fsar) +
theme_bw() +
scale_y_continuous(labels=percent) +
coord_cartesian(xlim=c(2005,2028)) +
theme(text = element_text(size=15)) +
labs(title="Unemployment Rate AR(4)(2) Forecasts",
x="Time (years)", y="Unemployment Rate")
Ordination is a collective term for multivariate techniques which summarize a multidimensional dataset in such a way that when it is projected onto a low dimensional space, any intrinsic pattern the data may possess becomes apparent upon visual inspection
We will mainly use the vegan package to introduce you to the unconstrained ordination technique: Non-metric Multidimensional Scaling (NMDS).
Generally, ordination techniques are used in ecology to describe relationships between species composition patterns and the underlying environmental gradients.
library(vegan)
## Loading required package: permute
# Load a community abundance matrix and sample data
community<-read.csv("https://tinyurl.com/sim-community",header = T,row.names = 1)
samp_data<-read.csv("https://tinyurl.com/sample-data-sim",header = T, na.strings = "NA")
head(community)
head(samp_data)
# Calculate a distance matrix and nmds
dist_mat <- vegdist(t(community), method = "bray") #note need to transpose
nmds <- metaMDS(dist_mat, k = 2)
## Run 0 stress 0.1240372
## Run 1 stress 0.1256942
## Run 2 stress 0.1302118
## Run 3 stress 0.1312815
## Run 4 stress 0.1319521
## Run 5 stress 0.1344407
## Run 6 stress 0.1302696
## Run 7 stress 0.132393
## Run 8 stress 0.1343721
## Run 9 stress 0.1351043
## Run 10 stress 0.1362542
## Run 11 stress 0.1281839
## Run 12 stress 0.1261206
## Run 13 stress 0.1228673
## ... New best solution
## ... Procrustes: rmse 0.01860581 max resid 0.1300421
## Run 14 stress 0.1369991
## Run 15 stress 0.1335358
## Run 16 stress 0.1351451
## Run 17 stress 0.1361127
## Run 18 stress 0.1387353
## Run 19 stress 0.1272472
## Run 20 stress 0.1272718
## *** Best solution was not repeated -- monoMDS stopping criteria:
## 3: no. of iterations >= maxit
## 17: stress ratio > sratmax
str(nmds)
## List of 36
## $ nobj : int 100
## $ nfix : int 0
## $ ndim : num 2
## $ ndis : int 4950
## $ ngrp : int 1
## $ diss : num [1:4950] 0.226 0.231 0.235 0.241 0.246 ...
## $ iidx : int [1:4950] 85 33 3 85 9 81 99 6 4 83 ...
## $ jidx : int [1:4950] 83 31 1 76 3 80 92 1 3 81 ...
## $ xinit : num [1:200] 0.765 0.119 0.914 0.767 0.688 ...
## $ istart : int 1
## $ isform : int 1
## $ ities : int 1
## $ iregn : int 1
## $ iscal : int 1
## $ maxits : int 200
## $ sratmx : num 1
## $ strmin : num 1e-04
## $ sfgrmn : num 1e-07
## $ dist : num [1:4950] 0.0415 0.0771 0.0632 0.0915 0.0521 ...
## $ dhat : num [1:4950] 0.0415 0.0645 0.0645 0.0645 0.0645 ...
## $ points : num [1:100, 1:2] -0.914 -0.799 -0.871 -0.963 -0.747 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:100] "S002" "S003" "S004" "S005" ...
## .. ..$ : chr [1:2] "MDS1" "MDS2"
## ..- attr(*, "centre")= logi TRUE
## ..- attr(*, "pc")= logi TRUE
## ..- attr(*, "halfchange")= logi TRUE
## ..- attr(*, "internalscaling")= num 1.4
## $ stress : num 0.123
## $ grstress : num 0.123
## $ iters : int 183
## $ icause : int 3
## $ call : language metaMDS(comm = dist_mat, k = 2)
## $ model : chr "global"
## $ distmethod: chr "bray"
## $ distcall : chr "vegdist(x = t(community), method = \"bray\")"
## $ distance : chr "bray"
## $ converged : num 0
## $ tries : num 20
## $ bestry : num 13
## $ engine : chr "monoMDS"
## $ species : logi NA
## $ data : chr "dist_mat"
## - attr(*, "class")= chr [1:2] "metaMDS" "monoMDS"
# Extract NMDS scores
scores_df <- as.data.frame(scores(nmds, display = "sites"))
scores_df$sample_id <- rownames(scores_df)
# Add metadata to the NMDS scores
nmds_df <- merge(scores_df, samp_data, by = "sample_id")
head(nmds_df)
# Establish the factors for each of the grouping variables for plotting
nmds_df$location <- factor(nmds_df$location)
nmds_df$depth <- factor(nmds_df$depth)
# Use ggplot2 to create an NMDS
ggplot(nmds_df, aes(x = NMDS1, y = NMDS2,
color = location,
shape = depth)) +
geom_point(size = 3, alpha = 0.85) +
scale_color_brewer(palette = "Dark2",type="seq")+
stat_ellipse(aes(group = interaction(location, depth),
color = location),
level = 0.95,
linewidth = 0.8,
linetype = 2) +
theme_bw(base_size = 14) +
labs(title = "NMDS (Bray-Curtis)",
subtitle = "Simulated microbial community",
color = "Location",
shape = "Depth (m)") +
theme(panel.grid = element_blank())