Create R Project

Walk attendees through creating an R project using the wizard to organize the code and data used in this workshop

Install Packages

# These only need to be executed once for your machine or RStudio Cloud project
#install.packages("tidyverse")
#install.packages("fpp2")
#install.packages("quantmod")
#install.packages("scales")
#install.packages("vegan")

Distribution analysis

R comes pre-loaded with many fun data sets to play around with. The dataset “rivers” gives the lengths (in miles) of 141 “major” rivers in North America, as compiled by the US Geological Survey.

rivers

##   [1]  735  320  325  392  524  450 1459  135  465  600  330  336  280  315  870
##  [16]  906  202  329  290 1000  600  505 1450  840 1243  890  350  407  286  280
##  [31]  525  720  390  250  327  230  265  850  210  630  260  230  360  730  600
##  [46]  306  390  420  291  710  340  217  281  352  259  250  470  680  570  350
##  [61]  300  560  900  625  332 2348 1171 3710 2315 2533  780  280  410  460  260
##  [76]  255  431  350  760  618  338  981 1306  500  696  605  250  411 1054  735
##  [91]  233  435  490  310  460  383  375 1270  545  445 1885  380  300  380  377
## [106]  425  276  210  800  420  350  360  538 1100 1205  314  237  610  360  540
## [121] 1038  424  310  300  444  301  268  620  215  652  900  525  246  360  529
## [136]  500  720  270  430  671 1770

To start getting a feel for our data, we can ask R for a summary,

summary(rivers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   135.0   310.0   425.0   591.2   680.0  3710.0

a histogram,

hist(rivers)

and a stem-and-leaf plot to visualize the distribution of the data

stem(rivers, scale = 2) #the optional argument "scale" adjusts the plot length

## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##    1 | 4
##    2 | 0112233345555666677788888999
##    3 | 00001111223333344455555666688888999
##    4 | 111222333445566779
##    5 | 001233344567
##    6 | 000112233578
##    7 | 012234468
##    8 | 04579
##    9 | 0018
##   10 | 045
##   11 | 07
##   12 | 147
##   13 | 1
##   14 | 56
##   15 | 
##   16 | 
##   17 | 7
##   18 | 9
##   19 | 
##   20 | 
##   21 | 
##   22 | 
##   23 | 25
##   24 | 
##   25 | 3
##   26 | 
##   27 | 
##   28 | 
##   29 | 
##   30 | 
##   31 | 
##   32 | 
##   33 | 
##   34 | 
##   35 | 
##   36 | 
##   37 | 1

Shapiro-Wilk Test

The Shapiro-Wilk test is a test for normality. It compares the sampled values with the expected values of the order statistics for a sample of the same size. Here, the Shapiro-Wilk test allows us to reject the null hypothesis that the distribution of river lenght is normal.

shapiro.test(rivers)

## 
##  Shapiro-Wilk normality test
## 
## data:  rivers
## W = 0.66662, p-value < 2.2e-16

Maybe the length of rivers follow an exponential distrubution? (Knowing nothing about rivers, this seems reasonably plausible – it means that if there are n rivers of a certain length, there will be ~2n rivers of half that length.) We can test this qualitatively using a Q-Q plot:

lambda <- 1 / mean(rivers)
observed <- sort(rivers)
expected <- qexp(ppoints(length(rivers)), rate = lambda)

qqplot(expected, observed)
abline(0,1)

Let’s break this down:

ppoints(n) generates a list of n evenly spaced numbers between 0 and 1.
qexp generates a list of quantiles from an exponential distribution with mean = 1 / rate (the /rate/ of an exponential distribution is usually denoted by lambda, and is so-named because of exponential distribution is often used to model growth or decay over time, but that’s not how we’re using it here)
qqplot then takes these quantiles of a theoretical exponential distribution, and plots them against the observed data, which are their own quantiles (think about why this is true if you don’t see it right away!).
abline(a, b) adds a straight line with intercept a and slope b, so we simply added the line y = x.

The closer a Q-Q plot is to the line y = x, the more the observed distribution is similar to the expected distribution (think about why this is true if you don’t see it right away!).

The Q-Q plot looks kind-of-okay: it is reasonably close to the line y = x, but it also deviates from this line in a clearly systematic way. So let’s try some statistical testing.

Kolgomorov-Smirnov test:

The Kolmogorov-Smirnov test is a nonparametric test of whether a sample comes from a particular hypothesized distribution.

ks.test(rivers, "pexp", rate = lambda)

## Warning in ks.test.default(rivers, "pexp", rate = lambda): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  rivers
## D = 0.2848, p-value = 2.331e-10
## alternative hypothesis: two-sided

Probably not exponential :(

log-normal?

ks.test(rivers, "plnorm", meanlog = mean(log(rivers)), sdlog = sd(log(rivers)))

## Warning in ks.test.default(rivers, "plnorm", meanlog = mean(log(rivers)), :
## ties should not be present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  rivers
## D = 0.092305, p-value = 0.1808
## alternative hypothesis: two-sided

maybe?

shapiro.test(log(rivers))

## 
##  Shapiro-Wilk normality test
## 
## data:  log(rivers)
## W = 0.94801, p-value = 3.945e-05

probably not :(

In the above code, we used R’s built-in /d-p-q-r functions/, whose names are a(n abbreviation of a) probability distribution prefixed by one of the letters ‘d’, ‘p’, ‘q’, or ‘r’. The functions that start with:

 'd' give the value of probability density function,

'p' give the value of the cumulative distribution function,

‘q’ give the value of the quantile function, and

'r' give a randomly generated value from the distribution.

The available distributions are:

```
   "norm"        (normal),
```
```
  "pois"        (Poisson),
```
```
 "binom"       (binomial),
```
```
  "nbinom"      (negative binomial),
```
```
   "t"           (Student's t),
```
```
  "chisq"       (chi-squared),
```
```
 "beta"        (Beta),
```
```
"cauchy"      (Cauchy),
```
```
  "exp"         (exponential),
```
```
   "f"           (F),
```
```
  "gamma"       (Gamma),
```
```
 "geom"        (geometric),
```
```
"hyper"       (hypergeometric),
```
```
 "logis"       (logistic),
```
```
  "lnorm"       (log-normal),
```
```
 "tukey"       (Studentized range),
```
```
"weibull"     (Weibull),
```
“wilcox” (Wilcoxon rank sum statistic), and

 "signrank"    (Wilcoxon signed rank statistic).

For other distributions, one can use the function dpqr(fun, min, max), which returns a list of functions d(x), p(x), q(p), r(n), or the functions d(x,…), p(x,…), qn(x,…), and r(x,…) (note the name of qn to avoid conflicts with the similar function q(p, T.dist, T.dist.par)).

Remark: The function ks.test expects an r-function in the first argument and a p-function in the second argument.

There is a consensus in the literature that river lengths are actually distributed according to a power law distribution (see Fractal River Basins: Chance and Self-Organization by Rinaldo & Rodr’iguez-Iturbe). However it is generally quite difficult to test whether some particular data is distributed according to a power law, and this is an active area of statistical research (see e.g. “Power-Law Distributions in Empirical Data” by Clauset, Shalizi, & Newman)

Exploring Relationships between variables

Correlations & Regressions

Bivariate Relationships Estimate and visualize how two numerical variables are positively or negatively related.

A correlation exists between two variables when one is related to the other such that there is co-movement. Positive correlation means as one variable increases, the other variable also increases. Negative correlation means as one variable increases, the other variable decreases.

Download the Data

The code below downloads a CSV file that includes data from 1980 for 935 individuals on variables including their total monthly earnings (MonthlyEarnings) and a number of variables that could influence income, including years of education (YearsEdu). The data set originally comes from textbook website for Stock and Watson’s Introduction to Econometrics.

wages <- read.csv("http://murraylax.org/datasets/wage2.csv", na.strings = ".");

head(wages)

Scatter plot to visually inspect the relationship between years of school and monthly earnings

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

ggplot(data=wages,mapping=aes(x=YearsEdu, y=MonthlyEarnings)) +
  geom_point() +
  labs(title= "Monthly Earnigs vs. Years of Education",
       x= "Years of Education",
       y="Monthly Earnings") +
  theme_bw()

ggplot(data=wages,mapping=aes(x=YearsEdu, y=MonthlyEarnings)) +
  geom_point(position=position_jitter(width=0.4,seed=123)) +
  labs(title= "Monthly Earnigs vs. Years of Education",
       x= "Years of Education",
       y="Monthly Earnings") +
  theme_bw()

To learn more about data visualization using ggplot, you can consult the resources in the workshop powerpoint.

Compute best fitting straight line & add it to the plot

Combine with above add regression line to above plot

lmwages <- lm(MonthlyEarnings ~ YearsEdu, data=wages)
#lm() is a formula of the form y ~ x, fits a function of linear form y=a+bx

#Find the equation of the line by calling the coefficients
lmwages$coefficients

## (Intercept)    YearsEdu 
##   146.95244    60.21428

#Calculate the confidence intervals
confint(lmwages,level=0.95)

##                2.5 %    97.5 %
## (Intercept) -5.56393 299.46881
## YearsEdu    49.03783  71.39074

#Plot the scatter plot with regression line and confidence intervals
ggplot(data=wages, mapping=aes(x=YearsEdu, y=MonthlyEarnings)) +
  geom_point(position = position_jitter(width=0.3,seed=123)) +
  labs(title="Monthly Earnings Versus Years of Education",
       x="Years of Education",
       y="Monthly Earnings") +
  geom_smooth(method="lm",se = T) +
  theme_bw()

## `geom_smooth()` using formula = 'y ~ x'

Measure the strength of the correlation and Confidence intervals

cor(x=wages$MonthlyEarnings, y=wages$YearsEdu, method='pearson')

## [1] 0.3271087

confint(lmwages,level=0.95)

##                2.5 %    97.5 %
## (Intercept) -5.56393 299.46881
## YearsEdu    49.03783  71.39074

Our sample estimate for the correlation coefficient is positive, but is this enough evidence that there is a relationship between years of schooling and real GDP growth in the population? To answer this, let us conduct a hypothesis test with the following null and alternative hypotheses:

Null hypothesis: rho equals 0

Alternative hypothesis: rho does not equal 0

The not-equal sign in the alternative hypothesis implies that this is a two-tailed test, so either positive or negative Pearson correlation coefficients significantly far away from zero will result in the null hypothesis rejected.

cor.test(x=wages$MonthlyEarnings, y=wages$YearsEdu, 
         alternative="two.sided", conf.level=0.95, method='pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  wages$MonthlyEarnings and wages$YearsEdu
## t = 10.573, df = 933, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2686296 0.3831853
## sample estimates:
##       cor 
## 0.3271087

#If we have reason to believe that earnings are positively correlated 
#to years of education, can do a one.sided test

Non-linear Regression

In our example we are assuming a linear relationship between years of education and monthly earnings, but it may be more appropriate to suggest that each year of education leads to a similar percentage increase in monthly earnings.

To estimate such a relationship, we estimate the following regression equation that includes the natural logarithm of the dependent variable (monthly earnings)

# Estimage the regression equation with the log of monthly earnings as the outcome variable
loglmwages <- lm(log(wages$MonthlyEarnings) ~ wages$YearsEdu)

#view the summary
summary(loglmwages)

## 
## Call:
## lm(formula = log(wages$MonthlyEarnings) ~ wages$YearsEdu)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.94620 -0.24832  0.03507  0.27440  1.28106 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.973062   0.081374   73.40   <2e-16 ***
## wages$YearsEdu 0.059839   0.005963   10.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4003 on 933 degrees of freedom
## Multiple R-squared:  0.09742,    Adjusted R-squared:  0.09645 
## F-statistic: 100.7 on 1 and 933 DF,  p-value: < 2.2e-16

#Coefficient suggests that one additional year of education 
#is associated with ~6% higher monthly salary

Multiple Regression Analysis

Extends simple bivariate regression analysis with the inclusion of more than one explanatory variable.

The procedure is used to determine how one or more explanatory variables influence an outcome variable, while holding all other explanatory variables fixed.

We can use the lm() function to estimate the regression. In the code below we call lm() with a single parameter that is a formula specifying how the outcome variable, MonthlyEarnings depends linearly on the seven explanatory variables. The function lm() produces a large list of output, which we assign to an object we name lmwages. We call the summary() function next to display the coefficient estimates to the screen.

lmwagesMR <- lm(MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience 
              + Tenure, data=wages)
summary(lmwagesMR)

## 
## Call:
## lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + 
##     Tenure, data = wages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -826.33 -243.85  -44.83  180.83 2253.35 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -531.0392   115.0513  -4.616 4.47e-06 ***
## IQ                 3.6966     0.9651   3.830 0.000137 ***
## Knowledge          8.2703     1.8273   4.526 6.79e-06 ***
## YearsEdu          47.2698     7.2980   6.477 1.51e-10 ***
## YearsExperience   11.8589     3.2494   3.650 0.000277 ***
## Tenure             6.2465     2.4565   2.543 0.011156 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 365.4 on 929 degrees of freedom
## Multiple R-squared:  0.1878, Adjusted R-squared:  0.1834 
## F-statistic: 42.97 on 5 and 929 DF,  p-value: < 2.2e-16

Variance Decomposition

The goal here is to identify how much variability in the outcome variable, average monthly earnings, is explained by the all of the explanatory variables, including IQ, knowledge of a worker’s job, years of education, years experience, and tenure.

How much of the variability in income is explained by the variables in our regression, and how much is explained by other factors we are neglecting or unable to account for?

We want to know the following: Sum of Squares Explained (SSE) (sometimes referred to as the Sum of Squares Regression): measure of the variability in the outcome variable that is explained by the explanatory variables, i.e. the x-variables in your regression.

Sum of Squares Residual (SSR) (sometimes referred to as the Sum of Squares Error): measures of the variability in the outcome variable that is not explained by your regression

Analysis of Variance (ANOVA)

#An ANOVA table is a common way to summarize these measures of variability. 
anova_wages<-anova(lmwagesMR)
anova_wages

#The Sum of Squares for the variables vs. the residuals allows for 
#comparison of how much of the variability is explained by the variables
#and how much variability is not explained by our regression

#coefficient of determination, sometimes referred to as the R-Squared value, 
#is a measure of what percentage of the variability in your outcome variable 
#is explained by your explanatory variables.

summary(lmwagesMR)

## 
## Call:
## lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + 
##     Tenure, data = wages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -826.33 -243.85  -44.83  180.83 2253.35 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -531.0392   115.0513  -4.616 4.47e-06 ***
## IQ                 3.6966     0.9651   3.830 0.000137 ***
## Knowledge          8.2703     1.8273   4.526 6.79e-06 ***
## YearsEdu          47.2698     7.2980   6.477 1.51e-10 ***
## YearsExperience   11.8589     3.2494   3.650 0.000277 ***
## Tenure             6.2465     2.4565   2.543 0.011156 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 365.4 on 929 degrees of freedom
## Multiple R-squared:  0.1878, Adjusted R-squared:  0.1834 
## F-statistic: 42.97 on 5 and 929 DF,  p-value: < 2.2e-16

# 18.78% of the variability in people’s monthly earnings is explained by IQ,
#knowledge of one’s job, educational attainment, experience, and tenure. The 
#remaining 81.22% of variability in monthly earnings we cannot explain.

#F-statistic also gives an indication of how valid our regression coefficients 
#are, with the larger the F-statistic, the larger the ratio of explained 
#variability to unexplained variability

#The result of the F-test is an F-statistic equal to 42.97 and a p-value equal 
#to 2.2 × 10-16 which is far below a typical significance level of alpha=0.05, so
#we reject the null hypothesis. We conclude that we have statistical evidence 
#that at least one explanatory variable helps explain the outcome variable.

Comparison of Means

Data Visualization of Means

Often the best place to start understanding your data is to visualize it

You would like to know if their is a difference in monthly earnings between respondents who identify as Black and those who don’t

#Use ggplot2 to create a box plot to graph the median and distribution of earnings between groups
ggplot(data=wages, mapping=aes(x=as.factor(Black), y=MonthlyEarnings)) + 
  stat_boxplot(geom="errorbar",width=0.15)+
  geom_boxplot() +
  scale_y_continuous(labels=scales::label_dollar()) + 
  labs(title="Distribution of Monthly Earnings by Race", 
       x="Respondent Race (Black = 1, Other = 0)",
       y="Monthly Earnings (Mean)") +
  theme_bw()+
  coord_flip()

Estimated Differences in Means between Independent Samples

With independent samples, we have two groups of observations, distinguishable by some measured characteristic to divide into these groups, and no member of one group is also in the other group. The outcome of the observations in one group must be independent of the outcome of the observations in the other group.

Testing differences in means between two independent samples is appropriate when a variable measured from two independent samples are in the same units and at the interval or ratio scale.

We want to estimate the difference in Monthly earnings if respondent indicated they were Black or not Black

#Calculate means between groups
wages %>%
  group_by(Black) %>%
  summarise(Mean_Monthly_Wage = mean(MonthlyEarnings, na.rm=T),
            Median_Monthly_Wage=median(MonthlyEarnings, na.rm = T))

#Use function t.test to estimate and test differences between groups
wagestats <- t.test(MonthlyEarnings ~ Black, data=wages, conf.level=0.95, alternative="two.sided")


#Two-Tailed Independent Samples T-Test
wagestats

## 
##  Welch Two Sample t-test
## 
## data:  MonthlyEarnings by Black
## t = 8.3373, df = 192.73, p-value = 1.42e-14
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  194.5269 315.0854
## sample estimates:
## mean in group 0 mean in group 1 
##        990.6479        735.8417

  #Can see from p-value, that there is a statistical difference between
  #earnings if the respondent identified as Black or not
  #note difference in the means between the groups

#One-tailed Independent Samples T-Test
  #Can test if non-Black respondents earned an average of more than $200/mo
t.test(MonthlyEarnings ~ Black, data=wages, conf.level=0.95, alternative="greater",mu=200)

## 
##  Welch Two Sample t-test
## 
## data:  MonthlyEarnings by Black
## t = 1.7933, df = 192.73, p-value = 0.03725
## alternative hypothesis: true difference in means between group 0 and group 1 is greater than 200
## 95 percent confidence interval:
##  204.2931      Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        990.6479        735.8417

# can also set paired=TRUE if the variables are considered dependent - e.g., same variable two different conditions, or same variable at different times, two variables same individual

Non-parametric tests for group differences

Mann-Whitney Utest

A nonparametric test that allows two groups or conditions or treatments to be compared without making the assumption that values are normally distributed.

The null hypothesis asserts that the medians of the two samples are identical.

# independent 2-group Mann-Whitney U Test
wilcox.test(wages$MonthlyEarnings ~ wages$Black)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  wages$MonthlyEarnings by wages$Black
## W = 68401, p-value = 1.657e-12
## alternative hypothesis: true location shift is not equal to 0

# where y is numeric and A is A binary factor

Wilcoxon Signed Rank Test

# dependent 2-group Wilcoxon Signed Rank Test
wilcox.test(wages$DadEdu,wages$MomEdu,paired=TRUE, alternative="less")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  wages$DadEdu and wages$MomEdu
## V = 43334, p-value = 7.045e-08
## alternative hypothesis: true location shift is less than 0

# where y1 and y2 are numeric, and dependent variables as they are measurements 
#in education of the parents of the same individual, one has the potential to influence the other

Time-Series Data

Load packages needed

library(fpp2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──

## ✔ forecast  8.24.0     ✔ expsmooth 2.3   
## ✔ fma       2.5

##

library(quantmod) #reinstall this if getting an error fetching data

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## Loading required package: TTR

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

Download Data We will use R code to download latest available macroeconomic data from the Federal Reserve Economic Database (FRED). It is useful to browse https://fred.stlouisfed.org to identify variables you are interested in. At each variable’s page, there is a code next to the name, which is used in the call below.

Below we use the getSymbols() function in the quantmod package to download and load into memory monthly unemployment rate that is not seasonally adjusted. The code for this variable in FRED is.

getSymbols("UNRATENSA", src="FRED")

## [1] "UNRATENSA"

#downloading monthly unemployment rate that is not seasonally adjusted from the Federal Reserve Economic Database

Plot time series object

Unemployment rate over time - not seasonally adjusted

str(UNRATENSA) #note the object contains an underlying matrix and time index

## An xts object on 1948-01-01 / 2026-01-01 containing: 
##   Data:    double [937, 1]
##   Columns: UNRATENSA
##   Index:   Date [937] (TZ: "UTC")
##   xts Attributes:
##     $ src    : chr "FRED"
##     $ updated: POSIXct[1:1], format: "2026-03-03 18:25:25"

# A time series object can be created with xts() - https://rpubs.com/mpfoley73/504487
head(UNRATENSA)

##            UNRATENSA
## 1948-01-01       4.0
## 1948-02-01       4.7
## 1948-03-01       4.5
## 1948-04-01       4.0
## 1948-05-01       3.4
## 1948-06-01       3.9

autoplot(UNRATENSA) #uses ggplot2 to draw

plot(UNRATENSA) #uses generic R plotting to draw

Check data from a time-series perspective

head(UNRATENSA)

##            UNRATENSA
## 1948-01-01       4.0
## 1948-02-01       4.7
## 1948-03-01       4.5
## 1948-04-01       4.0
## 1948-05-01       3.4
## 1948-06-01       3.9

frequency(UNRATENSA)

## [1] 1

#frequency is wrong, this is monthly data, this number should be 12

unrate <- ts(UNRATENSA/100, frequency=12, start=c(1948,1))
#creates a new time series object with the appropriate adjustments
#unemployment data as decimal not percentage
#frequency =12
#start date = January 1948

head(unrate)

##        Jan   Feb   Mar   Apr   May   Jun
## 1948 0.040 0.047 0.045 0.040 0.034 0.039

str(unrate)

##  Time-Series [1:937, 1] from 1948 to 2026: 0.04 0.047 0.045 0.04 0.034 0.039 0.039 0.036 0.034 0.029 ...
##  - attr(*, "src")= chr "FRED"
##  - attr(*, "updated")= POSIXct[1:1], format: "2026-03-03 18:25:25"
##  - attr(*, "index")= num [1:937] -6.94e+08 -6.92e+08 -6.89e+08 -6.86e+08 -6.84e+08 ...
##   ..- attr(*, "tzone")= chr "UTC"
##   ..- attr(*, "tclass")= chr "Date"
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "UNRATENSA"

#Plot new time series object
autoplot(unrate) +
  theme_bw() +
  labs(title="Unemployment Rate",y="Unemployment Rate") +
  scale_y_continuous(label=percent) +
  theme_bw()

Forcasting future observations

Autoregression Models

An autoregression model is a linear regression forecasting model where the time series variable is regressed against one or more lags of the same variable.

In the code below, we will estimate an autoregression model, create a forecast, and plot the forecast. We will use the Arima() function, which can be used to estimate a large class of models.

We will incorporate seasonality and estimate a model that includes the last two lags and the last two observations from the same month. This allows us to use information from the most recent months and and information from the same month from the most recent years.

horizon = 24 # setting horizon equal to 24 months
p = 2 # Autoregressive order (last number of lags)
k = 2 # Seasonal autoregressive order (last number of observations for same month)


#?Arima

sarmodel <- Arima(unrate, order=c(p,0,0), 
                  seasonal=list(order=c(k,0,0),period=12))
#order -c(p,d,q)
#p = number of lagged autoregressive terms used, 
#d = "the degree of difference" indicates how many times the data is 
      #differenced to achieve stationarity, 
#q = "MA order" represents the number of lagged moving average error terms 
      #included in the model
#seasonal - order is same but for seasonality, somtimes seen as PDQ

sarmodel

## Series: unrate 
## ARIMA(2,0,0)(2,0,0)[12] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2    sar1    sar2    mean
##       1.0327  -0.0989  0.3245  0.2986  0.0566
## s.e.  0.0320   0.0363  0.0320  0.0339  0.0059
## 
## sigma^2 = 2.238e-05:  log likelihood = 3681.17
## AIC=-7350.34   AICc=-7350.25   BIC=-7321.28

#Coefficients - represents impact of past values and errors
#SE - measures uncertainty of each coefficient
#Sigma^2 - variance of errors, lower indicates better fit
#log likelihood - how well a model fits data, higher better
#AIC/BIC goodness of fit, smaller better

fsar <- forecast(sarmodel, h=horizon)
  #use the model to forecast 24 months (horizon)

autoplot(fsar) + 
  theme_bw() +
  scale_y_continuous(labels=percent) +
  coord_cartesian(xlim=c(2005,2028))+
  theme(text = element_text(size=12)) +
  labs(title="Unemployment Rate AR(2)(2) Forecasts", 
       x="Time (years)", y="Unemployment Rate")

Autocorrelation

We can look at autocorrelation for choosing lag length. The autocorrelation coefficient is simply the Pearson correlation coefficient of the series compared to a lag of the same series.

The ggAcf() function produces a plot of the autocorrelation function for a number of lag lengths

ggAcf(unrate, lag.max = 48)

#The horizontal blue dotted line shows where the correlation coefficients are 
#statistically significantly greater than equal to zero. All the 
#autocorrelation coefficients between 1 month lag and 48 months lag are 
#statistically significant.

#That does not mean that all 48 lags should go into the regression though. Just
#persistence at one lag is enough to cause correlation at longer lags. 

#partial autocorrelation function nets out explanatory power already explained 
#by later lags.

ggPacf(unrate, lag.max = 48)

#Many of the lags between 1 and 11 are statistically significant. We also see 
#statistically significant seasonality, going back at least 2 years.

#This plot suggests it may be appropriate to use more lags. We need to be 
#careful not to include too many lags
p = 4 # Autoregressive order
k = 2 # Seasonal autoregressive order
sarmodel_4_2 <- Arima(unrate, order=c(p,0,0), 
                  seasonal=list(order=c(k,0,0),period=12))

#compare the models
sarmodel

## Series: unrate 
## ARIMA(2,0,0)(2,0,0)[12] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2    sar1    sar2    mean
##       1.0327  -0.0989  0.3245  0.2986  0.0566
## s.e.  0.0320   0.0363  0.0320  0.0339  0.0059
## 
## sigma^2 = 2.238e-05:  log likelihood = 3681.17
## AIC=-7350.34   AICc=-7350.25   BIC=-7321.28

sarmodel_4_2

## Series: unrate 
## ARIMA(4,0,0)(2,0,0)[12] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4    sar1    sar2    mean
##       1.0359  -0.1292  0.0484  -0.0210  0.3255  0.2997  0.0551
## s.e.  0.0327   0.0471  0.0472   0.0334  0.0315  0.0314  0.0060
## 
## sigma^2 = 2.24e-05:  log likelihood = 3681.74
## AIC=-7347.48   AICc=-7347.32   BIC=-7308.73

#Forecast with new model
fsar <- forecast(sarmodel_4_2, h=horizon)

autoplot(fsar) + 
  theme_bw() +
  scale_y_continuous(labels=percent) +
  coord_cartesian(xlim=c(2005,2028)) +
  theme(text = element_text(size=15)) +
  labs(title="Unemployment Rate AR(4)(2) Forecasts", 
       x="Time (years)", y="Unemployment Rate")

Ordination plot - NMDS

Ordination is a collective term for multivariate techniques which summarize a multidimensional dataset in such a way that when it is projected onto a low dimensional space, any intrinsic pattern the data may possess becomes apparent upon visual inspection

We will mainly use the vegan package to introduce you to the unconstrained ordination technique: Non-metric Multidimensional Scaling (NMDS).

Generally, ordination techniques are used in ecology to describe relationships between species composition patterns and the underlying environmental gradients.

Load vegan package

library(vegan)

## Loading required package: permute

Load community dataset

# Load a community abundance matrix and sample data
community<-read.csv("https://tinyurl.com/sim-community",header = T,row.names = 1)
samp_data<-read.csv("https://tinyurl.com/sample-data-sim",header = T, na.strings = "NA")

head(community)

head(samp_data)

Calculate distance matrix

# Calculate a distance matrix and nmds

dist_mat <- vegdist(t(community), method = "bray") #note need to transpose
nmds <- metaMDS(dist_mat, k = 2)

## Run 0 stress 0.1240372 
## Run 1 stress 0.1256942 
## Run 2 stress 0.1302118 
## Run 3 stress 0.1312815 
## Run 4 stress 0.1319521 
## Run 5 stress 0.1344407 
## Run 6 stress 0.1302696 
## Run 7 stress 0.132393 
## Run 8 stress 0.1343721 
## Run 9 stress 0.1351043 
## Run 10 stress 0.1362542 
## Run 11 stress 0.1281839 
## Run 12 stress 0.1261206 
## Run 13 stress 0.1228673 
## ... New best solution
## ... Procrustes: rmse 0.01860581  max resid 0.1300421 
## Run 14 stress 0.1369991 
## Run 15 stress 0.1335358 
## Run 16 stress 0.1351451 
## Run 17 stress 0.1361127 
## Run 18 stress 0.1387353 
## Run 19 stress 0.1272472 
## Run 20 stress 0.1272718 
## *** Best solution was not repeated -- monoMDS stopping criteria:
##      3: no. of iterations >= maxit
##     17: stress ratio > sratmax

str(nmds)

## List of 36
##  $ nobj      : int 100
##  $ nfix      : int 0
##  $ ndim      : num 2
##  $ ndis      : int 4950
##  $ ngrp      : int 1
##  $ diss      : num [1:4950] 0.226 0.231 0.235 0.241 0.246 ...
##  $ iidx      : int [1:4950] 85 33 3 85 9 81 99 6 4 83 ...
##  $ jidx      : int [1:4950] 83 31 1 76 3 80 92 1 3 81 ...
##  $ xinit     : num [1:200] 0.765 0.119 0.914 0.767 0.688 ...
##  $ istart    : int 1
##  $ isform    : int 1
##  $ ities     : int 1
##  $ iregn     : int 1
##  $ iscal     : int 1
##  $ maxits    : int 200
##  $ sratmx    : num 1
##  $ strmin    : num 1e-04
##  $ sfgrmn    : num 1e-07
##  $ dist      : num [1:4950] 0.0415 0.0771 0.0632 0.0915 0.0521 ...
##  $ dhat      : num [1:4950] 0.0415 0.0645 0.0645 0.0645 0.0645 ...
##  $ points    : num [1:100, 1:2] -0.914 -0.799 -0.871 -0.963 -0.747 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:100] "S002" "S003" "S004" "S005" ...
##   .. ..$ : chr [1:2] "MDS1" "MDS2"
##   ..- attr(*, "centre")= logi TRUE
##   ..- attr(*, "pc")= logi TRUE
##   ..- attr(*, "halfchange")= logi TRUE
##   ..- attr(*, "internalscaling")= num 1.4
##  $ stress    : num 0.123
##  $ grstress  : num 0.123
##  $ iters     : int 183
##  $ icause    : int 3
##  $ call      : language metaMDS(comm = dist_mat, k = 2)
##  $ model     : chr "global"
##  $ distmethod: chr "bray"
##  $ distcall  : chr "vegdist(x = t(community), method = \"bray\")"
##  $ distance  : chr "bray"
##  $ converged : num 0
##  $ tries     : num 20
##  $ bestry    : num 13
##  $ engine    : chr "monoMDS"
##  $ species   : logi NA
##  $ data      : chr "dist_mat"
##  - attr(*, "class")= chr [1:2] "metaMDS" "monoMDS"

# Extract NMDS scores
scores_df <- as.data.frame(scores(nmds, display = "sites"))
scores_df$sample_id <- rownames(scores_df)

# Add metadata to the NMDS scores
nmds_df <- merge(scores_df, samp_data, by = "sample_id")
head(nmds_df)

Plot NMDS

# Establish the factors for each of the grouping variables for plotting
nmds_df$location <- factor(nmds_df$location)
nmds_df$depth <- factor(nmds_df$depth)

# Use ggplot2 to create an NMDS
ggplot(nmds_df, aes(x = NMDS1, y = NMDS2,
                    color = location,
                    shape = depth)) +
  geom_point(size = 3, alpha = 0.85) +
  scale_color_brewer(palette = "Dark2",type="seq")+
  stat_ellipse(aes(group = interaction(location, depth),
                   color = location),
               level = 0.95,
               linewidth = 0.8,
               linetype = 2) +
  theme_bw(base_size = 14) +
  labs(title = "NMDS (Bray-Curtis)",
       subtitle = "Simulated microbial community",
       color = "Location",
       shape = "Depth (m)") +
  theme(panel.grid = element_blank())

Intro to Data Analysis and Statistics in R

Alex Threlkeld and Alma Parada

2026-03-03