V. Statistical Analysis

Author

Arvind Sharma

Regression Analysis

Setup

Clean the environment

A basic setup routine in R that includes clearing the memory / removing datasets or objects, and setting up an environment to start fresh. This can be useful at the beginning of an analysis or a new session to ensure that no old variables or data interfere with your work.

# Clear memory
rm(list = ls())

# Optionally, remove specific objects individually
# rm(data1, model1)



# Clear all graphics devices
graphics.off()



# Detach packages (replace with actual package names)
# ?detach
# require(splines) # package
# detach(package:splines)



# Clear the console
cat("\014")
  • The default working directory in R is not necessarily the directory where your script is located.
# Set working directory 
getwd() 

setwd("/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/bootcamp/R/V. Statistical Analysis")  # adjust to your path, or comment it out (as we use base R data here)
  • Your R session is now clean and ready to go!

Load the packages

Again, there are multiple ways to do this. I am assuming these basic packages are installed already and you just have to load them.

A good practice to list the packages in the order you use them, and put a comment on what it does.

# Load essential packages

packages <- c(
  "conflicted",    # Resolves function name conflicts between packages
  "knitr",         # For dynamic report generation and R Markdown
  "stargazer",     # Formats regression tables and summary statistics
  "visdat",        # Visualizes missing data and data structures
  "tidyverse",     # Collection of packages for data manipulation and analysis
  "ggplot2",       # Creates customizable data visualizations
  "gridExtra",     # Arranges multiple plots on one page
  "car",           # Provides regression diagnostics and statistical tools
  "AER"            # Applied Econometrics with R
)


?library
for (pkg in packages) {
  library(package = pkg,
          character.only = TRUE # whether package can be assumed to be character strings.
          )
}

If you do not have any of the package, you may want to install them, or use the code chunk below to install and load them.

# Install and load packages
for (pkg in packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}
Function Conflict

In R, it’s common to use multiple packages in a single project. However, sometimes different packages contain functions with the same name, leading to conflicts. For example, both the dplyr and plyr packages have a summarize() function, which could cause issues if both packages are loaded simultaneously.

How to Manage Function Conflicts

  1. Namespace Specification: The most direct way to manage function conflicts is by specifying the package namespace explicitly when calling a function. This avoids ambiguity and ensures you’re using the correct function from the desired package.

    result <- dplyr::summarize(mtcars,
                               mean_value = mean(mpg)
                               )
  2. conflicted Package: The conflicted package is a helpful tool for managing function conflicts. When a conflict occurs, this package forces you to choose which function to use, preventing accidental usage of the wrong one.

    # install.packages("conflicted")
    # Load the conflicted package
    library(conflicted)
    
    # Set preference for dplyr's summarize function
    conflict_prefer(name = "summarize",
                    winner = "dplyr"
                    )
    [conflicted] Will prefer dplyr::summarize over any other package.
    # Now, you can use summarize without specifying the namespace
    result <- summarize(mtcars, 
                        mean_value = mean(mpg)
                        )
  3. Load Order: The order in which packages are loaded can influence which function is used by default. If two functions have the same name, R will use the function from the package that was loaded last. However, relying on load order can be error-prone, so it’s generally better to use namespace specification or the conflicted package.

By managing function conflicts carefully, you can avoid unexpected behavior and ensure that your code runs smoothly, even when using multiple packages with overlapping functionality.

Alternatively -

?lapply
# Load each package individually using lapply
lapply(X = packages, 
       FUN = library, 
       character.only = TRUE
       )

Load the data

Can come from multiple ways.

  • APIs. Example FRED API for macroeconomic data.

  • Flat file like .csv or .xlsx.

  • Base R packages like utils::data() or user written R packages like AER (data(package = "AER")). Run the following command to explore data sets in R - data(package = .packages(all.available = TRUE)).

utils::data()
data(package = "AER")

data(package = .packages(all.available = TRUE))

Data

We will use a simple dataset mtcars from base R for now.

?require
require(utils) # designed for use inside other functions - it returns FALSE and gives a warning (rather than an error as library() does by default) if the package does not exist. 

utils::data()

df <- mtcars

To read up on the variable definitions, lets open up mtcars help files-

?mtcars

Motor Trend Car Road Tests

Description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Format

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

Summary Statistics

Check for missing values.

vis_dat(df)

Before we run regressions, always a good idea to do summary statistics (helps understand the variables better as well).

With kable -

# Generate descriptive statistics using psych::describe
summary_stats <- psych::describe(mtcars)

# Create a nicely formatted table using kable
kable(x = summary_stats, 
      caption = "Descriptive Statistics for mtcars Dataset", 
      digits = 2 )
Descriptive Statistics for mtcars Dataset
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17 -1.76 0.32
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38 -1.21 21.91
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27 -0.71 0.09
wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37 0.34 0.32
vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24 -2.00 0.09
am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36 -1.92 0.09
gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53 -1.07 0.13
carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05 1.26 0.29

With stargazer -


Summary Statistics for mtcars Dataset
===============================================
Statistic       N   Mean  St. Dev.  Min   Max  
-----------------------------------------------
MPG             32 20.09    6.03   10.40 33.90 
Cylinders       32  6.19    1.79     4     8   
Disp.           32 230.72  123.94  71.10 472.00
HP              32 146.69  68.56    52    335  
Rear Axle Ratio 32  3.60    0.53   2.76   4.93 
Weight          32  3.22    0.98   1.51   5.42 
Q-Mile Time     32 17.85    1.79   14.50 22.90 
V/S             32  0.44    0.50     0     1   
Transmission    32  0.41    0.50     0     1   
Gears           32  3.69    0.74     3     5   
Carbs           32  2.81    1.62     1     8   
-----------------------------------------------

Simple Linear Regression

Estimating Equation

Lets write out the estimating equation before we run the regression -

\[ mpg_i = \beta_0 + \beta_1 hp_i + \epsilon_i \]

Now, lets implement it in R -

?lm
# Simple Linear Regression: mpg vs hp
lm(data = df, formula = mpg ~ hp)

Call:
lm(formula = mpg ~ hp, data = df)

Coefficients:
(Intercept)           hp  
   30.09886     -0.06823  

Usually, we store the lm output as an object in R -

?lm
# Simple Linear Regression: mpg vs hp
model_bivariate <- lm(data = df, 
                   formula = mpg ~ hp
                      )

We saved the output of the model into the object model_bivariate. Lets print it out and explore the output.

summary(object = model_bivariate)

Call:
lm(formula = mpg ~ hp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7121 -2.1122 -0.8854  1.5819  8.2360 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
hp          -0.06823    0.01012  -6.742 1.79e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared:  0.6024,    Adjusted R-squared:  0.5892 
F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
  • We can call upon specific parts of the output as well if we want.
model_bivariate$coefficients
(Intercept)          hp 
30.09886054 -0.06822828 

Plot the regression line

Base R

The three lines fit on top of each other, which is why you only see the last line.

plot(x = mtcars$hp, 
     y = mtcars$mpg, 
     main = "Simple Linear Regression: mpg vs hp",
     xlab = "Horsepower (hp)",
     ylab = "Miles per Gallon (mpg)"
     )

?abline # function adds one or more straight lines through the current plot.

# abline(model_bivariate, col = "red")
abline(a = 30.09886,  # intercept
       b = -0.06823,  # slope
       col = "red"
       )

abline(model_bivariate,
       col = "blue")


# abline(model_bivariate, col = "red")
abline(a = mean(mtcars$mpg) - model_bivariate$coefficients[2] * mean(mtcars$hp) , 
       b = cov(mtcars$mpg, mtcars$hp) / var(mtcars$hp),
       col = "green"
       )

You can confirm your logic/intuition of formulas.

\[ \beta_1 = \dfrac{Cov(y,x)}{Var(x)} \]

cov(y = df$mpg, x = df$hp) / var(x = df$hp) # slope parameter
[1] -0.06822828

We can average and rearrange the original estimating equation to solve for \(\beta_0\) if we know \(\beta_1\) -

\[\begin{eqnarray} y_i &=& \beta_0 + \beta_1 * x_i + \epsilon_i \\ => \bar y &=& \beta_0 + \beta_1 * \bar x \\ => \beta_0 &=& \bar y- \beta_1 * \bar x \end{eqnarray}\]

How to type out multiple equations and aligning them under =.

mean(mtcars$mpg) - model_bivariate$coefficients[2] * mean(mtcars$hp) # intercept parameter
      hp 
30.09886 

ggplot

?geom_smooth

# Create a scatter plot with a regression line
ggplot(data = mtcars, 
       mapping = aes(x = hp, y = mpg)) +
  geom_point() +  # Add scatter plot points
  geom_smooth(method = "lm", 
              col = "red") +  # Add a linear regression line
  labs(
    title = "Simple Linear Regression: mpg vs hp",
    x = "Horsepower (hp)",
    y = "Miles per Gallon (mpg)"
  ) +
  theme_minimal()  # Use a minimal theme for cleaner appearance
`geom_smooth()` using formula = 'y ~ x'

Explore the model output

  • List of 12 objects (see Environment tab)
model_bivariate

Call:
lm(formula = mpg ~ hp, data = df)

Coefficients:
(Intercept)           hp  
   30.09886     -0.06823  
  • Coefficients
model_bivariate$coefficients
(Intercept)          hp 
30.09886054 -0.06822828 
  • Residuals - should add to 0.
head(model_bivariate$residuals)
        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
       -1.5937500        -1.5937500        -0.9536307        -1.1937500 
Hornet Sportabout           Valiant 
        0.5410881        -4.8348913 
sum(model_bivariate$residuals) # should add to zero
[1] 1.487699e-14
round(x = sum(model_bivariate$residuals),
      digits = 13
      )
[1] 0

All kinds of model fit metrics can be calculated from residuals.

# Calculate Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)
mae  <- mean(abs  (model_bivariate$residuals)   )
rmse <- sqrt(mean(model_bivariate$residuals^2  ))

# Print performance metrics
print(paste("Mean Absolute Error (MAE):",      round(mae,  2)))
[1] "Mean Absolute Error (MAE): 2.91"
print(paste("Root Mean Squared Error (RMSE):", round(rmse, 2)))
[1] "Root Mean Squared Error (RMSE): 3.74"
  • Predicted values - points on the best fit line in a new data.
  • Fitted values - points on the best fit line in the data.

When using a regression model to make predictions on new data, the predicted values represent the points on the best fit line (or plane, in the case of multiple regression) corresponding to the new observations. To ensure accurate predictions, the new data must have variables with the same names as those used to fit the model.

Example 1: Split the original data into testing and training, implement model on training data and predict on testing data

# Set seed for reproducibility
set.seed(123)
?sample

# Define proportion for training data (80% for training, 20% for testing)
train_proportion <- 0.8

# Number of observations in the dataset
num_observations <- nrow(mtcars)

?seq_len
# Create random indices for training set
train_indices <- sample(x    = seq_len(num_observations),  # desired length of the sequence
                        size = floor(train_proportion * num_observations) # floor ensure we have a whole number
                        )


# Split the data into training and testing sets
mtcars_train <- mtcars[ train_indices, ]
mtcars_test  <- mtcars[-train_indices, ]

model_bivariate_train <-
lm(data = mtcars_train,
   formula = mpg ~ hp)


head(predict(object = model_bivariate_train, 
            newdata = mtcars_test)
     )
      Mazda RX4 Wag             Valiant          Merc 450SE          Merc 450SL 
           23.15635            23.49597            18.40169            18.40169 
Lincoln Continental       Toyota Corona 
           16.02436            24.03936 

Multivariate Regression

Lets type our multivariate regression first.

\[ mpg_i = \beta_0 + \beta_1 hp_i + \beta_2 wt_i + \epsilon_i \]

  • Implementation in R is the same, just add more independent variables.
model_mult1 <-
lm(data = df, 
   formula = mpg ~ hp + wt
   )
column.labels = c("Diff-in-Diff")

Quick way to create higher order terms

  • Suppose you think mpg is quadratically related with wt. You do not have to create a new variable and add it in your df data frame, but can simply ask R to create one on the spot with the I() function.
I() function
model_mult2 <-
lm(data = df, 
   formula = mpg ~ hp + wt + I(wt^2)
   )

model_mult2

Call:
lm(formula = mpg ~ hp + wt + I(wt^2), data = df)

Coefficients:
(Intercept)           hp           wt      I(wt^2)  
   47.83728     -0.02728    -10.82217      0.98181  
  • To generate variables “on the go” in R, especially when you want to create polynomial or interaction terms that do not exist as separate variables in your data frame, you can use the I() function within formulas. The I() function allows you to include mathematical expressions, such as squares or products of variables, directly within model formulas or when creating new variables.

  • If you think mpg is more non-linearly related with wt (in specifically is too many powers), you can try the poly function as well.

# Fit a polynomial regression model (quadratic)
model_mult2p <- lm(data = mtcars, 
                  formula = mpg ~ hp + poly(x = wt, degree = 2, raw = TRUE), 
                  )

# Fit a orthogonal polynomial regression model (quadratic)
model_mult2po <- lm(data = mtcars, 
                  formula = mpg ~ hp + poly(x = wt, degree = 2), 
                  )
Warning

Make sure the variables you want to add are being generated !

lm(data = df, 
   formula = mpg ~ hp + wt + wt^2
   ) # does not work !!!

Call:
lm(formula = mpg ~ hp + wt + wt^2, data = df)

Coefficients:
(Intercept)           hp           wt  
   37.22727     -0.03177     -3.87783  
  • Print out the models side by side with stargazer package.
stargazer(model_mult2, model_mult2p, model_mult2po,
          type = "text"
          )

======================================================================
                                            Dependent variable:       
                                      --------------------------------
                                                    mpg               
                                         (1)        (2)        (3)    
----------------------------------------------------------------------
hp                                    -0.027***  -0.027***  -0.027*** 
                                       (0.008)    (0.008)    (0.008)  
                                                                      
wt                                    -10.822***                      
                                       (2.281)                        
                                                                      
I(wt2)                                 0.982***                       
                                       (0.313)                        
                                                                      
poly(x = wt, degree = 2, raw = TRUE)1            -10.822***           
                                                  (2.281)             
                                                                      
poly(x = wt, degree = 2, raw = TRUE)2             0.982***            
                                                  (0.313)             
                                                                      
poly(x = wt, degree = 2)1                                   -22.255***
                                                             (3.039)  
                                                                      
poly(x = wt, degree = 2)2                                    7.240*** 
                                                             (2.307)  
                                                                      
Constant                              47.837***  47.837***  24.093*** 
                                       (3.659)    (3.659)    (1.245)  
                                                                      
----------------------------------------------------------------------
Observations                              32         32         32    
R2                                      0.872      0.872      0.872   
Adjusted R2                             0.858      0.858      0.858   
Residual Std. Error (df = 28)           2.270      2.270      2.270   
F Statistic (df = 3; 28)              63.503***  63.503***  63.503*** 
======================================================================
Note:                                      *p<0.1; **p<0.05; ***p<0.01
  • Can control lots of settings.
stargazer(model_mult2, model_mult2p, model_mult2po,
          type = "text", 
          title = "My regression table",
          omit.stat = c("n"),
          add.lines = list(c("F-Statistics", 
                             round(summary(model_mult2)$fstatistic[1], digits = 1), 
                            round(summary(model_mult2p)$fstatistic[1], digits = 1),
                           round(summary(model_mult2po)$fstatistic[1], digits = 1)))
)

My regression table
======================================================================
                                            Dependent variable:       
                                      --------------------------------
                                                    mpg               
                                         (1)        (2)        (3)    
----------------------------------------------------------------------
hp                                    -0.027***  -0.027***  -0.027*** 
                                       (0.008)    (0.008)    (0.008)  
                                                                      
wt                                    -10.822***                      
                                       (2.281)                        
                                                                      
I(wt2)                                 0.982***                       
                                       (0.313)                        
                                                                      
poly(x = wt, degree = 2, raw = TRUE)1            -10.822***           
                                                  (2.281)             
                                                                      
poly(x = wt, degree = 2, raw = TRUE)2             0.982***            
                                                  (0.313)             
                                                                      
poly(x = wt, degree = 2)1                                   -22.255***
                                                             (3.039)  
                                                                      
poly(x = wt, degree = 2)2                                    7.240*** 
                                                             (2.307)  
                                                                      
Constant                              47.837***  47.837***  24.093*** 
                                       (3.659)    (3.659)    (1.245)  
                                                                      
----------------------------------------------------------------------
F-Statistics                             63.5       63.5       63.5   
R2                                      0.872      0.872      0.872   
Adjusted R2                             0.858      0.858      0.858   
Residual Std. Error (df = 28)           2.270      2.270      2.270   
F Statistic (df = 3; 28)              63.503***  63.503***  63.503*** 
======================================================================
Note:                                      *p<0.1; **p<0.05; ***p<0.01

Quick way to create interaction terms

  • Short hand notation can be useful if using Diff-in-Diff model.
# only interaction term
model_mult5 <- lm(data = df, 
   formula = mpg ~ hp * wt 
   )

#interaction term and the original terms
model_mult5a <- lm(data = df, 
   formula = mpg ~ hp : wt 
   )

stargazer(model_mult5, model_mult5a, type = "text")

=================================================================
                                 Dependent variable:             
                    ---------------------------------------------
                                         mpg                     
                             (1)                    (2)          
-----------------------------------------------------------------
hp                        -0.120***                              
                           (0.025)                               
                                                                 
wt                        -8.217***                              
                           (1.270)                               
                                                                 
hp:wt                      0.028***              -0.015***       
                           (0.007)                (0.002)        
                                                                 
Constant                  49.808***              27.746***       
                           (3.605)                (1.062)        
                                                                 
-----------------------------------------------------------------
Observations                  32                     32          
R2                          0.885                  0.712         
Adjusted R2                 0.872                  0.702         
Residual Std. Error    2.153 (df = 28)        3.288 (df = 30)    
F Statistic         71.660*** (df = 3; 28) 74.136*** (df = 1; 30)
=================================================================
Note:                                 *p<0.1; **p<0.05; ***p<0.01

Declaring factor variables

  • You might have to decide if a quantitative variable needs to be treated as a continuous variable or a categorical variable !
Tip
# continous variable
model_mult7  <- lm(data = df, 
   formula = mpg ~ hp + wt + cyl
   )
# categorical variable
model_mult7f <- lm(data = df, 
   formula = mpg ~ hp + wt + as.factor(cyl)
   )

stargazer(model_mult7 , model_mult7f, type = "text")

=================================================================
                                 Dependent variable:             
                    ---------------------------------------------
                                         mpg                     
                             (1)                    (2)          
-----------------------------------------------------------------
hp                          -0.018                -0.023*        
                           (0.012)                (0.012)        
                                                                 
wt                        -3.167***              -3.181***       
                           (0.741)                (0.720)        
                                                                 
cyl                        -0.942*                               
                           (0.551)                               
                                                                 
as.factor(cyl)6                                   -3.359**       
                                                  (1.402)        
                                                                 
as.factor(cyl)8                                    -3.186        
                                                  (2.170)        
                                                                 
Constant                  38.752***              35.846***       
                           (1.787)                (2.041)        
                                                                 
-----------------------------------------------------------------
Observations                  32                     32          
R2                          0.843                  0.857         
Adjusted R2                 0.826                  0.836         
Residual Std. Error    2.512 (df = 28)        2.440 (df = 27)    
F Statistic         50.171*** (df = 3; 28) 40.525*** (df = 4; 27)
=================================================================
Note:                                 *p<0.1; **p<0.05; ***p<0.01

Residual Analysis

Residual analysis (diagnostics) is a crucial step in evaluating the performance and validity of a regression model. It helps in diagnosing potential issues like non-linearity, heteroscedasticity, and the presence of outliers, which can affect the model’s accuracy and reliability.

  • Common to check if Gauss markov assumptions hold.
plot(model_mult1)

  • par(mfrow=c(nrows, ncols)) function in R, which is used to set up a multi-panel plotting layout. This function divides the plotting area into a grid with nrows rows and ncols columns, allowing you to display multiple plots in a single plotting window.
par(mfrow=c(nrows=2, 2)) # sets up the plotting area into a 2-by-2 grid.

# Residuals vs Fitted Plot
plot(model_mult1, which = 1)

# Normal Q-Q Plot
plot(model_mult1, which = 2)

# Scale-Location Plot
plot(model_mult1, which = 3)

# Cook's Distance
plot(model_mult1, which = 4)

  • After creating the plots, it’s good practice to reset the plotting layout to the default (single plot).
# Reset to default single plot layout
par(mfrow = c(1, 1))
  • If you are using ggplot2, you can use the gridExtra package to achieve a similar layout.
# Residuals vs Fitted
p1 <- ggplot(model_mult1, aes(.fitted, .resid)) +
  geom_point() +
  ggtitle("Residuals vs Fitted") +
  xlab("Fitted values") +
  ylab("Residuals") +
  theme_minimal()

# Normal Q-Q Plot
p2 <- ggplot(model_mult1, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q") +
  xlab("Theoretical Quantiles") +
  ylab("Sample Quantiles") +
  theme_minimal()

# Scale-Location Plot
p3 <- ggplot(model_mult1, aes(.fitted, sqrt(abs(.resid)))) +
  geom_point() +
  ggtitle("Scale-Location") +
  xlab("Fitted values") +
  ylab("Square Root of |Residuals|") +
  theme_minimal()

# Cook's Distance
p4 <- ggplot(model_mult1, aes(seq_along(.cooksd), .cooksd)) +
  geom_point() +
  ggtitle("Cook's Distance") +
  xlab("Index") +
  ylab("Cook's Distance") +
  theme_minimal()

# Combine plots using gridExtra
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

Statistical tests

  • To check for heteroscedasticity (non-constant variance) in regression models, you can run specific statistical tests as well if you do not like visually inspecting the residuals.
# VIF to check for multicollinearity
car::vif(model_mult1)
      hp       wt 
1.766625 1.766625 

Appendix

1. Writing Math in Rmarkdown

2. Quoting usage of R.

To cite R in publications use:

  R Core Team (2025). _R: A Language and Environment for Statistical
  Computing_. R Foundation for Statistical Computing, Vienna, Austria.
  <https://www.R-project.org/>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2025},
    url = {https://www.R-project.org/},
  }

We have invested a lot of time and effort in creating R, please cite it
when using it for data analysis. See also 'citation("pkgname")' for
citing R packages.

3. Quoting usage of certain packages in R.

  • To generate citations for multiple R packages using citation(), you can’t directly pass a vector of package names. Instead, you’ll need to loop through each package name and call citation() for each one.
# Loop through each package and print its citation
for (pkg in packages) {
  print(citation(pkg))
}
To cite package 'conflicted' in publications use:

  Wickham H (2023). _conflicted: An Alternative Conflict Resolution
  Strategy_. doi:10.32614/CRAN.package.conflicted
  <https://doi.org/10.32614/CRAN.package.conflicted>, R package version
  1.2.0, <https://CRAN.R-project.org/package=conflicted>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {conflicted: An Alternative Conflict Resolution Strategy},
    author = {Hadley Wickham},
    year = {2023},
    note = {R package version 1.2.0},
    url = {https://CRAN.R-project.org/package=conflicted},
    doi = {10.32614/CRAN.package.conflicted},
  }
To cite package 'knitr' in publications use:

  Xie Y (2025). _knitr: A General-Purpose Package for Dynamic Report
  Generation in R_. R package version 1.50, <https://yihui.org/knitr/>.

  Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition.
  Chapman and Hall/CRC. ISBN 978-1498716963

  Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible
  Research in R. In Victoria Stodden, Friedrich Leisch and Roger D.
  Peng, editors, Implementing Reproducible Computational Research.
  Chapman and Hall/CRC. ISBN 978-1466561595

To see these entries in BibTeX format, use 'print(<citation>,
bibtex=TRUE)', 'toBibtex(.)', or set
'options(citation.bibtex.max=999)'.
Please cite stargazer in publications as:

  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and
  Summary Statistics Tables. R package version 5.2.3.
  https://CRAN.R-project.org/package=stargazer

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {stargazer: Well-Formatted Regression and Summary Statistics Tables},
    author = {Marek Hlavac},
    year = {2022},
    note = {R package version 5.2.3},
    organization = {Social Policy Institute},
    address = {Bratislava, Slovakia},
    url = {https://CRAN.R-project.org/package=stargazer},
  }
To cite package 'visdat' in publications use:

  Tierney N (2017). "visdat: Visualising Whole Data Frames." _JOSS_,
  *2*(16), 355. doi:10.21105/joss.00355
  <https://doi.org/10.21105/joss.00355>,
  <http://dx.doi.org/10.21105/joss.00355>.

A BibTeX entry for LaTeX users is

  @Article{,
    title = {visdat: Visualising Whole Data Frames},
    author = {Nicholas Tierney},
    doi = {10.21105/joss.00355},
    url = {http://dx.doi.org/10.21105/joss.00355},
    year = {2017},
    publisher = {Journal of Open Source Software},
    volume = {2},
    number = {16},
    pages = {355},
    journal = {JOSS},
  }
To cite package 'tidyverse' in publications use:

  Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
  Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
  E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
  Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
  the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
  doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

A BibTeX entry for LaTeX users is

  @Article{,
    title = {Welcome to the {tidyverse}},
    author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
    year = {2019},
    journal = {Journal of Open Source Software},
    volume = {4},
    number = {43},
    pages = {1686},
    doi = {10.21105/joss.01686},
  }
To cite ggplot2 in publications, please use

  H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
  Springer-Verlag New York, 2016.

A BibTeX entry for LaTeX users is

  @Book{,
    author = {Hadley Wickham},
    title = {ggplot2: Elegant Graphics for Data Analysis},
    publisher = {Springer-Verlag New York},
    year = {2016},
    isbn = {978-3-319-24277-4},
    url = {https://ggplot2.tidyverse.org},
  }
To cite package 'gridExtra' in publications use:

  Auguie B (2017). _gridExtra: Miscellaneous Functions for "Grid"
  Graphics_. doi:10.32614/CRAN.package.gridExtra
  <https://doi.org/10.32614/CRAN.package.gridExtra>, R package version
  2.3, <https://CRAN.R-project.org/package=gridExtra>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {gridExtra: Miscellaneous Functions for "Grid" Graphics},
    author = {Baptiste Auguie},
    year = {2017},
    note = {R package version 2.3},
    url = {https://CRAN.R-project.org/package=gridExtra},
    doi = {10.32614/CRAN.package.gridExtra},
  }
To cite the car package in publications use:

  Fox J, Weisberg S (2019). _An R Companion to Applied Regression_,
  Third edition. Sage, Thousand Oaks CA.
  <https://www.john-fox.ca/Companion/>.

A BibTeX entry for LaTeX users is

  @Book{,
    title = {An {R} Companion to Applied Regression},
    edition = {Third},
    author = {John Fox and Sanford Weisberg},
    year = {2019},
    publisher = {Sage},
    address = {Thousand Oaks {CA}},
    url = {https://www.john-fox.ca/Companion/},
  }
To cite AER, please use:

  Kleiber C, Zeileis A (2008). _Applied Econometrics with R_.
  Springer-Verlag, New York. doi:10.1007/978-0-387-77318-6
  <https://doi.org/10.1007/978-0-387-77318-6>,
  <https://CRAN.R-project.org/package=AER>.

A BibTeX entry for LaTeX users is

  @Book{,
    title = {Applied Econometrics with {R}},
    author = {Christian Kleiber and Achim Zeileis},
    year = {2008},
    publisher = {Springer-Verlag},
    address = {New York},
    doi = {10.1007/978-0-387-77318-6},
    url = {https://CRAN.R-project.org/package=AER},
  }