II. Basics of R Programming

Author

Arvind Sharma

Getting Started with R

  • R is a versatile language used for statistical computing and graphics.

  • R Environment:

    • RStudio: Integrated Development Environment (IDE) for R.
    • R Console: Where you can directly execute R commands.
    • R Script Editor: Write and save R scripts (.R files).
    • R Markdown / Quarto: Create dynamic documents (.Rmd, .qmd).
  • Running Code:

    • Type code in the Console or Script Editor and press Enter.
    • We can run code from within a code chunk in .qmd or .Rmd file as well.

Basic Syntax in R

Note
  • R can be used a calculator, and commands end with a newline or semicolon.
# Common usage - commands separated by new line
2**1 
2**3            # 2nd command is on a new second line

# Alternative usage - commands separated by semicolon 
2**2 ; 2**3    # 2nd command is on a the first line 

Commenting on Code

  • Creating Single-Line Comments: Created using the # symbol. Anything following the # on that line is treated as a comment and is ignored by the R interpreter.
    • Often used to explain code.
# This is a single line comment in R

2 * 2  # multiplication
[1] 4
  • Multiple Single-Line Comments: The most common way to create multi-line comments in R is to use multiple single-line comments, each starting with the # symbol.
    • Often used to stop R from interpretting certain parts of the code.
    • Code -> Comment/Uncomment Lines is a shortcut if you prefer.
# This is a multi-line comment
# that spans across several lines.
# Each line starts with the '#' symbol.
  • Purpose of Comments: Commenting your code helps in increasing its readability and maintainability. Comments provide context and explanations for various parts of the code (self-documentation), which is helpful for both yourself and others who may read or work with your code in the future.

Best Practices for Commenting:

  • Be Descriptive: Explain what the code is doing and why, not just what it does.

  • Keep Comments Up-to-Date: Update comments when the code changes to avoid inaccuracies.

  • Avoid Redundant Comments: Skip obvious or trivial details that the code already clearly expresses.

Lets comment on some basic arithmetic math operations in R.

Looking ahead - common type of operators in R (Datacamp R for Data Science Cheatsheet).

Looking ahead - common type of operators in R (Datacamp R for Data Science Cheatsheet).

Math Operations in R

Basic math operations are fundamental arithmetic operations that you can perform directly on numeric values. They include:

5 + 3               # Addition
[1] 8
10 - 4              # Subtraction
[1] 6
6 * 7               # Multiplication
[1] 42
8 / 2               # Division
[1] 4
2^3                 # Exponentiation
[1] 8
7 %% 3              # Modulus: 7 %% 3 = 1 (Remainder of division) {odd/even numbers}
[1] 1
7 %/% 3             # Integer Division: 7 %/% 3 = 2 (Quotient without remainder)
[1] 2

Mathematical Functions

Mathematical functions in R are built-in functions that perform more complex calculations or transformations. These functions are designed to handle specific mathematical operations and often involve more complex logic than basic arithmetic operations. Some common mathematical functions include:

sqrt(16)           # Square Root: sqrt(16) = 4
[1] 4
abs(-5)            # Absolute Value: abs(-5) = 5
[1] 5
log(10)            # Logarithm (Natural Log): log(10) = 2.302585
[1] 2.302585
log10(100)         # Logarithm with base 10: log10(100) = 2
[1] 2
exp(2)             # Exponential: exp(2) = 7.389056
[1] 7.389056
sin(pi / 2)        # Sine: sin(pi / 2) = 1
[1] 1
cos(0)             # Cosine: cos(0) = 1
[1] 1
tan(pi / 4)        # Tangent: tan(pi / 4) = 1
[1] 1
  • Warning

    In R Markdown, if a code chunk contains errors or incorrect answers, it can prevent the document from knitting or compiling properly.

    log(0) 
    [1] -Inf
    log(-1) # command is set up that it warns you 
    Warning in log(-1): NaNs produced
    [1] NaN
    # document will not compile if you uncomment the line below. Try uncommenting it.  
    #10/a # 1>
    1. R will typically show you the line number/chunk lines where you have the error. You will have to troubleshoot that part for the document to compile.
      • Switching to Source helps to identify the part where the error is.
  • Mathematical operations that can be performed on numbers, vectors, and other numeric data types. More on data types in a bit, as applying functions on incorrect data types is one important reason why one gets errors.

Explicitly specify the arguments

  • Explicitly specifying arguments in your code can greatly enhance readability and maintainability.
tan(pi/4) # 1>
[1] 1
?tan 
tan(x = pi/4)
[1] 1
  1. You might be able to guess, or understand what is the argument here by opening the help file on the function with ?tan command, but with functions that require multiple arguments it is often not obvious.
?seq
seq(0,10,1)                    # less readable
 [1]  0  1  2  3  4  5  6  7  8  9 10
seq(from = 0, to = 10, by = 1) # more readable
 [1]  0  1  2  3  4  5  6  7  8  9 10
Getting Help - 3 common ways
  1. In R, you can perform help searches both globally and locally to find information about functions, packages, or keywords.

    • Local help refers to retrieving documentation related to specific functions, objects, or packages that you are currently working with in your R environment.

    • Global help refers to searching across all available R documentation for topics, keywords, or functions related to a specific subject. This is useful when you are looking for information on a broader topic or when you don’t know the exact function name.

      # local help
      ?tanh
      help(tanh)
      
      # global help
      ??tanh
      help.search("tanh") # to find help topics related to specific keywords.
      • You do not leave the R Studio interface.
  2. Google Search: Use Google to search for specific examples or solutions related to your problem by querying: "R - 'what you are trying to do'". This approach helps you find blogs, articles, and tutorials that provide relevant examples and explanations.

    • Example Search: If you are trying to understand how to perform linear regression in R, you might search for: “R - linear regression example".
      • You have to leave the R Studio interface.
  3. AI Tools: Utilize AI tools like ChatGPT, Bard, or other similar platforms to get tailored advice and examples. These tools can provide quick explanations, code snippets, and solutions based on your specific queries.

    • Example: If you need help with a specific R function or concept, you can ask: `“How do I use the ggplot2 package to create a scatter plot in R?”` and receive a detailed response or code snippet.

      • Can be combined within the R Studio interface.

By combining these three methods, you can effectively find and adapt the code examples you need to complete your tasks in R.

Aligning Code

Caution

As much as possible, align your code and explicitly specify function arguments.

  • This practice enhances code readability and maintainability.

When reading code written by others, especially as a beginner, look for and follow these practices to improve your understanding and integration with the code.

See example below. Which is the easiest to understand and maintain?

seq(0,10,1) # hard to understand but concise
 [1]  0  1  2  3  4  5  6  7  8  9 10
seq(from = 0, to = 10, by = 1) # easier to understand but not optimised for maintainance
 [1]  0  1  2  3  4  5  6  7  8  9 10
# easiest to interpret and edit 
seq(from = 0 , 
    to   = 10, 
    by   = 1
    )
 [1]  0  1  2  3  4  5  6  7  8  9 10

Each methods has its pros and cons though.

  1. Basic Usage: The seq(0, 10, 1) format is concise but may not immediately convey the role of each argument, especially for someone new to the function.

  2. Named Arguments: Using seq(from = 0, to = 10, by = 1) makes it clear what each argument represents (from, to, by), improving readability.

  3. Formatted for Readability: Placing each argument on a new line (seq(from = 0, to = 10, by = 1)) makes it very clear and easy to edit, especially for more complex sequences with additional parameters.

    • This format is beneficial for maintaining code clarity and avoiding errors, especially for beginners.

Using variables and the Assignment Operator

  • Variables are used to store data in R. Use the <- operator for assignment, though = is also acceptable.
    • Objects should appear in your Environment tab upon successful assignment, which you can examine/print.
x <- 10      # Using <- for assignment, assigns 10 to variable x

y = 5        # Using = for assignment

RStudio Base R Cheat Sheet.

Common Mistakes:

  1. Unassigned Variables: Trying to use a variable in a function that has not been assigned a value will result in an error.
remove(list=ls()) # 1>

tryCatch( # 2>
  {
    # Attempt to print a non-existent object
    print(x)
  },
  error = function(e) {
    # Handle the error
    print("Error: The object does not exist.")
  }
)
[1] "Error: The object does not exist."
  1. Removed all variables from the environment.
  2. tryCatch is used in R to handle errors gracefully by allowing you to specify actions to take if an error occurs, ensuring the code continues running instead of stopping abruptly. It can help prevent the entire script or document from failing due to a single error.
x <- 24   # define x

print(x)  # Explicitly prints the value of x to the console
[1] 24
  • Now, print(x) does not give an error.
x  # Simply typing the variable name in the console or script will display its value (implicitly prints the value of x). This method is often used interactively in the console or within scripts for quick checks.
[1] 24
  1. Overwriting Values: Reassigning a value to an existing variable will overwrite the previous value and may cause errors if not handled properly.
  • In lengthy code, it’s easy to accidentally overwrite variables.
  • Always verify the type and contents of your objects to avoid unintended issues.

R is Object-Oriented

R is an object-oriented language, which means:

  • Objects: In R, data and functions are treated as objects. Everything you work with in R, such as vectors, lists, and data frames, is an object.

  • Classes: Objects belong to classes, which define their structure and behavior. For example, a data frame is a class of objects with specific methods for handling tabular data.

  • Methods: R supports method dispatch, meaning that the methods (functions) that operate on objects can vary depending on the object’s class. This allows for more flexible and powerful data manipulation.

  • Inheritance: R uses a system of inheritance, where objects can inherit properties and methods from other classes. This enables the creation of complex data structures built on simpler ones.

Understanding R’s object-oriented nature helps in designing more efficient and modular code, making it easier to work with complex data and perform sophisticated analyses.

Writing and using functions

  • Functions are defined using the function keyword. The basic syntax is function(arg1, arg2, ...) { ... }.

  • Lets create your first function.

my_addition <- function(x, y) {
  return(x + y)
}

result <- my_addition(10, 5)  # Calls the function with arguments 10 and 5
result
[1] 15
  • Can open up functions to see how they have been written. Some functions can be very complex. Lets look at how a simple function sd is written -
sd # square root of variance
function (x, na.rm = FALSE) 
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))
<bytecode: 0x10ebd3d08>
<environment: namespace:stats>
  • na.rm: A logical value indicating whether to remove missing values (NA) before the computation. The default is FALSE, meaning missing values are not removed.

  • R provides a wide range of built-in functions for various tasks, such as data manipulation, statistical analysis, and plotting. These functions, like mean(), lm(), and plot(), offer ready-to-use solutions for common problems and streamline the coding process. Utilizing these functions can significantly simplify and accelerate data analysis workflows.

Packages and Libraries

In R, packages and libraries are essential for extending the functionality of the base R environment.

  • Definition: Packages are collections of R functions, data, and documentation bundled together. They are designed to add specific capabilities or functions to R, such as data manipulation, statistical analysis, or visualization.

  • Installation: To use a package, you first need to install it. This is done using the install.packages() function.

Use install.packages("package_name") to install a package and library(package_name) to load it.

# install.packages("tidyverse")  # Install the tidyverse package
library(tidyverse)               # Load the tidyverse package
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  • I am commenting out tidyverse as I do not want to install it again (can take time and is not required as I have already installed it in the past).
Note
  • Installation is a one-time setup process where you get all the necessary tools (functions) into your R environment.

  • Loading is done every time you start a new R session and want to use the tools you’ve previously installed.

Think of installing a package like installing a light bulb in a fixture. When you install a light bulb, you’re setting up the bulb in the fixture so that it’s ready to be used whenever you need light.

Loading a package is like turning on the light switch. Once the bulb is installed, you need to flip the switch to actually turn on the light and make use of the bulb’s illumination. Similarly, after installing a package, you need to load it in your current R session to use its functions and data.

Caution
  • Without the internet, you won’t be able to download these packages and their dependencies, just as you wouldn’t be able to get the necessary components for your light fixture without a store.
Package vs Library?
  • A library, on the other hand, refers to the location on your system where installed packages are stored and accessed. Essentially, a package is a unit of functionality, while a library is the storage space for these packages.

  • To find your library, type .libPaths() in your R console and find the address.

.libPaths()
[1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library"
  • You can find all the installed packages in your R library.

  • Again, only loaded packages will have a tick next to them.

Data Types and Data Structures

  • Data Type: In R, “data type” refers to the classification of variables or values, such as numeric, integer, character, etc., which determine how data is stored in memory and what operations can be performed on them.

  • Data Structure: A “data structure” refers to the way data is organized and stored in R objects, such as vectors, lists, matrices.

Data Types

In R, data types refer to the classification of variables or values that determine how they are stored in memory and what operations can be performed on them.

Here are the primary data types in R:

Numeric

Used for real numbers, which which includes both integers and decimal (floating-point) values.

  • Examples: 1.5, 3, 0.25

    x <- 3
    typeof(x)
    [1] "double"
    typeof(3)
    [1] "double"
    x <- 1.5
    typeof(x)
    [1] "double"
  • Typically, R stores these as “double-precision floating-point numbers”, which is why the typeof() function often returns "double" for “double-precision” numeric type.

    • “Double-precision floating-point numbers” are a way of representing real numbers (numbers with decimals) in computer memory using 64 bits. This format allows for a wide range of values, including very large and very small numbers, while maintaining precision.
  • Used for quantitative variables. Typically refers to continuous data that can take on any value within a range.

    • Examples include height, weight, and income.

Integer

Used for integer values (whole numbers and negative of natural numbers).

  • Examples: 1L, 2L, -100L

    x <- 3L
    typeof(x)
    [1] "integer"
    x <- -100L
    typeof(x)
    [1] "integer"
  • L declares the number to an integer.

  • Used for count data (discrete or integer data that represents counts or the number of occurrences of an event).

    • Examples include the number of children in a family, the number of cars in a parking lot, or the number of students in a class. These variables are usually whole numbers and do not have decimal points.

Logical

Used for Boolean values to represent binary conditions or logic, often as the result of comparisons and logical operations. Boolean values are fundamental in control flow, decision-making, and logical operations within programs.

  • Example: TRUE, FALSE
keep_ob <- TRUE    # FALSE
typeof(keep_ob)
[1] "logical"
keep_ob <- T       # F
typeof(keep_ob)
[1] "logical"
  • T and F do work as shorthand for TRUE and FALSE in R. However, it’s generally recommended to use TRUE and FALSE instead of T and F to avoid potential issues. This is because T and F are just variables that are predefined in R to represent TRUE and FALSE, and they can be reassigned to something else, which could lead to unexpected behavior in your code.
tryCatch(
  {
    keep_obs <- c(False, True)  # This will cause an error
  },
  error = function(e) {
    print("Error: object 'False' not found")
  }
)
[1] "Error: object 'False' not found"
  • The code does not work because R is case-sensitive, and the correct logical values should be FALSE and TRUE, not False and True.
  • Used for comparisons, especially in selecting rows/cases that meet a certain condition (of truth or falsity).

Character

Used for strings of text.

  • Example: "Hello", "R programming"

    address <- c("Old Road", "Undone Road", "Anton Road")
    typeof(address)
    [1] "character"
    keep_obs <- c("TRUE", "TRUE", "FALSE", "TRUE")
    typeof(keep_obs)
    [1] "character"
  • Sometimes numeric and integer data can be stored as character, and you will not be able to execute mathematical operations on them. You will have to coerce/force the data type into the correct type.

result
[1] 15
tryCatch(
  {
    a <- "10"
    b <- "5"
    my_addition(a, b)  # This will cause an error
  },
  error = function(e) {
    print("! non-numeric argument to binary operator")
  }
)
[1] "! non-numeric argument to binary operator"
  • Used for textual data. Exampls include storing words (sentiment analysis), addresses and geographic data, or processing text in tasks like web scraping.
    • Strings are typically enclosed in quotation marks, either single (') or double (").

Factor

A factor is a specific data type used to represent categorical variables.

  • Example: factor(c("low", "medium", "high"))
colors <- factor(c("red", "blue", "green", "red"))
typeof(colors)
[1] "integer"
  • For factors, typeof() will return "integer", as factors are internally stored as integers with corresponding character labels.

  • To check the class of an object, which provides more descriptive information, use the class() function.

class(colors)
[1] "factor"
  • For a factor, class() will return "factor".

  • Factors can be ordered or unordered, depending on whether the categories have a natural order (e.g., “low,” “medium,” “high”) or not (e.g., “red,” “blue,” “green”).

  • Factors are particularly useful for handling qualitative data, where the values represent categories or groups rather than numerical quantities.

    • Factors are often used in statistical modeling, where categorical variables need to be treated differently from numerical variables.

Date

Used for representing dates in R.

  • Example: as.Date("2023-01-01") creates a Date object for January 1, 2023.

    date_example <- as.Date("2023-01-01")
    typeof(date_example) 
    [1] "double"
  • The Date type allows you to perform operations related to dates, such as calculating differences between dates, extracting components (e.g., day, month, year), and formatting dates for display.

    • Dates are stored internally as the number of days since January 1, 1970, which is the origin date in R. This allows for efficient date calculations and manipulations.
  • Commonly used in data analysis for handling temporal data, such as event timestamps, transaction dates, or historical data.

    • Examples include tracking the date of sales transactions, scheduling events, or analyzing time series data.

POSIXct and POSIXlt

Used for representing date-time values as the number of seconds since the Unix epoch (January 1, 1970).

  • Example: as.POSIXct("2023-01-01 12:00:00") creates a POSIXct object for January 1, 2023, at 12:00 PM.

    datetime_example_ct <- as.POSIXct("2023-01-01 12:00:00")
    typeof(datetime_example_ct) 
    [1] "double"
  • The POSIXct type stores date-time as the number of seconds since the epoch, which allows for efficient arithmetic and comparison operations involving date-time values.

    • This format is useful for handling and manipulating continuous date-time data, such as timestamps in logs or time series data.
  • Commonly used for precise time measurements and calculations, including scheduling events, analyzing temporal data, and managing timestamps.

    • Examples include recording the exact time of transactions, tracking event occurrences, or computing time intervals between events.

Complex

Used for representing complex numbers (in the form a + bi, where a is the real part and b is the coefficient of the imaginary unit i).

  • Example: 1 + 2i creates a complex number where 1 is the real part and 2i is the imaginary part.

    complex_example <- 1 + 2i
    typeof(complex_example) 
    [1] "complex"
    • This format allows for calculations involving complex numbers, such as addition, multiplication, and finding magnitudes or phases.
  • Commonly used in fields such as engineering, physics, and applied mathematics, where complex numbers are used to model phenomena involving oscillations, waves, and other periodic behaviors.

    • Examples include electrical engineering (for AC circuit analysis), quantum mechanics (for wave functions), and signal processing (for Fourier transforms).

    • Not common in Economics.

Common data types - Geek for Geeks.

Common data types - Geek for Geeks.

Why data types matter ?

These data types help R manage different kinds of information and determine how operations such as arithmetic, comparisons, and transformations are performed on them. Understanding these types is crucial for effective data manipulation, analysis, and programming in R.

  1. R Data types are used to specify the kind of data that can be stored in a variable. 

  2. For effective memory consumption and precise computation, the right data type must be selected. 

  3. Each R data type has unique properties and associated operations.

    • Different forms of data that can be saved and manipulated are defined and categorized using data types in computer languages including R.

Common Coercion

Valid coercion

Numeric to Integer
  • Numeric values are coerced to integers by rounding towards zero / dropping everything after the decimal.
as.integer(1.9)
[1] 1
as.integer(-1.9)
[1] -1
  • Coercion of TRUE is 1 and FALSE is 0.
as.integer(TRUE)  # Results in 1
[1] 1
as.integer(FALSE) # Results in 0
[1] 0

Invalid coercion

Charecter to Integer

Attempting to coerce characters directly to integers will result in NA values, as R cannot convert non-numeric strings to numbers.

as.integer("abc")  # Results in NA
Warning: NAs introduced by coercion
[1] NA
as.integer("$123") # Will have to clean your data and remove $
Warning: NAs introduced by coercion
[1] NA
as.integer("123")  # Results in 123 (valid conversion)
[1] 123

Common Data Structures: Vectors, Lists, Matrices and Dataframes

In R, data structures are fundamental components used to store and organize data types efficiently.

Each data structure has specific characteristics that determine how data is stored in memory and what operations can be performed on it.

Understanding these structures is crucial for effectively manipulating and analyzing data in R.

  1. Vectors:

    • Atomic Vectors: These are one-dimensional arrays that can hold elements of the same data type, such as numeric, character, or logical.

      • Example: c(1, 2, 3, 4) creates a numeric vector.

        my_vector <- c(1, 2, 3, 4)
        class(my_vector) # numeric
        [1] "numeric"
        my_vector <- c("1", 2, 3, 4)
        class(my_vector) # character
        [1] "character"
    • Factors in R are a special type of vector. Factors are stored as integer vectors internally, where each integer corresponds to a level in the factor. The levels themselves are stored as character vectors. Factors are used to handle categorical data efficiently, such as groups or categories. They allow R to manage and sort categorical variables properly.

      my_factor <- colors
      
      class(my_factor) # factor
      [1] "factor"
      typeof(my_factor) # integer
      [1] "integer"
      levels(my_factor) # charecter
      [1] "blue"  "green" "red"  
    • Lists: Lists are also one-dimensional but can hold elements of different types or structures.

      • Example: list(1, "a", TRUE) creates a list with numeric, character, and logical elements.

      • Versatile data structures.

        my_list   <- list(1, "a", TRUE)
        class(my_list) # list
        [1] "list"
  2. Matrices:

    • Matrices are two-dimensional arrays where all elements are of the same data type (numeric, character, etc.).

      • Example: matrix(data = 1:6, nrow=2, ncol=3) creates a 2x3 matrix.

        my_matrix <- 
        matrix(data = 1:6, # shorthand for - 1 2 3 4 5 6
               nrow = 2, 
               ncol = 3
               )
        class(my_matrix)  # list
        [1] "matrix" "array" 
        typeof(my_matrix) # integer
        [1] "integer"
        my_matrix <- 
        matrix(data = c(a,2,3,4,5,6),
               nrow = 2, 
               ncol = 3
               )
        class(my_matrix)  # list
        [1] "matrix" "array" 
        typeof(my_matrix) # charecter
        [1] "character"
    • Arrays: Generalized versions of matrices with more than two dimensions.

      • An array in R is a data structure that holds elements in multiple dimensions. Unlike a matrix, which is two-dimensional, an array can have three or more dimensions. Arrays are used to store data in a grid-like format with dimensions that can be specified.

        • Dimensions: Arrays can have any number of dimensions. For example, a three-dimensional array can be thought of as a stack of matrices.

        • Homogeneity: All elements in an array must be of the same type (e.g., all numeric, all character).

        • Creation: Arrays are created using the array() function, where you specify the data, dimensions, and optional dimension names.

          # Create a 3x3x2 array
          data <- 1:18
          my_array <- array(data, dim = c(3, 3, 2))
          
          # Print the array
          my_array
          , , 1
          
               [,1] [,2] [,3]
          [1,]    1    4    7
          [2,]    2    5    8
          [3,]    3    6    9
          
          , , 2
          
               [,1] [,2] [,3]
          [1,]   10   13   16
          [2,]   11   14   17
          [3,]   12   15   18
          typeof(my_array) # integer
          [1] "integer"
          class(my_array)  # array
          [1] "array"

          my_array is a three-dimensional array with dimensions 3x3x2, where the array holds numbers from 1 to 18.

  3. Data Frames:

    • Data frames are two-dimensional structures similar to tables in databases or spreadsheets.

    • Columns can be of different data types (numeric, character, etc.).

    • Example: data.frame(id=c(1, 2, 3), name=c("Alice", "Bob", "Charlie")) creates a data frame with columns “id” and “name”.

      my_df <-
      data.frame(id=c(1, 2, 3),
                 name=c("Alice", "Bob", "Charlie")
                 )
      
      class(my_df)            # data.frame
      [1] "data.frame"
      typeof(class(my_df))    # "character"
      [1] "character"

      What are matrices and dataframes? YaRrr

      What are matrices and dataframes? YaRrr
  4. Data Tables (from data.table package):

    • Data tables are enhanced data frames optimized for large datasets and efficient operations.

    • Example: data.table(id=c(1, 2, 3), name=c("Alice", "Bob", "Charlie")) creates a data table with columns “id” and “name”.

Each structure offers different capabilities and efficiencies depending on the nature of the data and the tasks being performed. By leveraging the appropriate data structure, you can optimize your workflow and enhance your ability to work with data in R effectively.

Basic Operations on Common Data Structures

Access elements

Basic Operations on Data Structures

Vectors

Create a vector

numeric_vector <- c(1, 2, 3, 4, 5)
char_vector <- c("apple", "banana", "cherry")

Access elements

numeric_vector[1]      # Returns 1
[1] 1
char_vector[1:2]       # Returns "apple" "banana"
[1] "apple"  "banana"

Modify elements

numeric_vector[2] <- 10

Vector operations

numeric_vector + 5    # Adds 5 to each element
[1]  6 15  8  9 10
numeric_vector * 2    # Multiplies each element by 2
[1]  2 20  6  8 10
numeric_vector > 3    # Returns a logical vector
[1] FALSE  TRUE FALSE  TRUE  TRUE

Matrices

Create a matrix

?matrix
my_matrix <- matrix(data = 1:9, 
                         nrow = 3, 
                         ncol = 3
                         )
my_matrix
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Access elements

my_matrix[1, 2]   # Returns 2
[1] 4
my_matrix[1, ]    # First row
[1] 1 4 7
my_matrix[, 2]    # Second column
[1] 4 5 6

Modify elements

my_matrix[2, 3] <- 10

Matrix operations

my_matrix + 5               # Adds 5 to each element
     [,1] [,2] [,3]
[1,]    6    9   12
[2,]    7   10   15
[3,]    8   11   14
t(my_matrix)                # transpose of a matrix
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7   10    9
solve(my_matrix)            # inverse of a matrix - exists only for square matrices
      [,1] [,2]       [,3]
[1,] -1.25  0.5  0.4166667
[2,]  1.00 -1.0  0.3333333
[3,] -0.25  0.5 -0.2500000
my_matrix %*% my_matrix     # Matrix multiplication - require conformability
     [,1] [,2] [,3]
[1,]   30   66  110
[2,]   42   93  154
[3,]   42   96  162

Lists

Create a list

my_list_example <- list(number = 1:5, 
                     character = c("apple", "banana"), 
                     matrix = my_matrix
                     )

Access elements

my_list_example$number       # Returns 1 2 3 4 5
[1] 1 2 3 4 5
my_list_example[[2]]         # Returns "apple" "banana" (second element)
[1] "apple"  "banana"

Modify elements

my_list_example$number[2] <- 10

List Operations

# Add a new element to the list
my_list_example$new_element <- c(TRUE, FALSE, TRUE)

# Remove an element from the list
my_list_example$new_element <- NULL

# Combine lists
another_list <- list(new_matrix = matrix(5:8, nrow = 2))
combined_list <- c(my_list_example, another_list)

Data Frames

Create a data frame

my_df_example <- data.frame(
  Column1 = c(1, 2, 3),
  Column2 = c("A", "B", "C"),
  Column3 = c(TRUE, FALSE, TRUE)
)
my_df_example
  Column1 Column2 Column3
1       1       A    TRUE
2       2       B   FALSE
3       3       C    TRUE

Access elements

my_df_example$Column1     # Returns 1 2 3
[1] 1 2 3
my_df_example[1, 2]       # Returns "A" (Element at row 1, column 2)
[1] "A"

Modify elements

my_df_example$Column1[2] <- 10

Descriptive Statistics

First, lets load a basic inbuilt dataset in R as an object named df.

  • More on how to import real data in later tutorials.
df <- mtcars

Check it has been imported.

  • Often you have to check if it has been correctly imported. More on that in later tutorials.
class(df)
[1] "data.frame"
colnames(df)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
nrow(df)
[1] 32

Calculating mean, median, variance, and standard deviation.

  • Function: mean() calculates the average of a numeric vector.

    • Provides a central value of the dataset.

      mean(df$mpg)  # Calculates the mean of the 'mpg' column in the dataframe 'df'
      [1] 20.09062
  • Function: median() finds the middle value in a numeric vector when the values are sorted in ascending order.

    • Offers a robust central value, less affected by outliers compared to the mean.

      median(df$mpg)  # Calculates the median of the 'mpg' column in the dataframe 'df'
      [1] 19.2
  • Function: var() measures the dispersion of a numeric vector around its mean, showing how much the values spread out.

    • Gives an indication of variability but is in squared units of the original data.

      var(df$mpg)  # Calculates the variance of the 'mpg' column in the dataframe 'df'
      [1] 36.3241
  • Function: sd() indicates the amount of variation or dispersion in a numeric vector, providing a measure of how spread out the values are around the mean.

    • Provides variability in the same units as the data, making it more interpretable.

      sd(df$mpg)  # Calculates the standard deviation of the 'mpg' column in the dataframe 'df'
      [1] 6.026948
      sd(df$mpg) == var(df$mpg)^.5
      [1] TRUE
  • When using these functions, ensure that the data is numeric and clean to avoid errors in calculations.

NA (Not Available)
  • Type: General purpose placeholder for any kind of missing data.

  • Usage: NA is used to represent missing values across all data types (e.g., numeric, character, logical).

  • Behavior: When you perform operations involving NA, the result is typically NA unless explicitly handled.

    x <- c(1, NA, 3)
    mean(x)  # Result: NA
    [1] NA
    mean(x = x,
         na.rm = TRUE
         )
    [1] 2

Summary statistics for data frames

Instead of applying functions one by one to each column in a dataframe, you can either iteratively apply functions across columns (useful in simulations) or use a package to extend R’s capabilities and create summary statistics for the entire dataframe.

Iteratively apply a function on each column in a dataframe

  • When performing simulations, one often needs to generate multiple sets of results and keep them in a list for further analysis.
  1. apply returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
?apply 

# Calculate the standard deviation for each column in the dataframe
sd_per_column <- apply(X      = df, 
                       MARGIN = 2,  # for a matrix 1 indicates rows, 2 indicates columns
                       FUN    = sd
                       )

sd_per_column # Print the result
        mpg         cyl        disp          hp        drat          wt 
  6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
       qsec          vs          am        gear        carb 
  1.7869432   0.5040161   0.4989909   0.7378041   1.6152000 
  • The sapply and lapply functions in R are used for applying functions to elements of a list or dataframe. They differ in their return types and use cases (but are in the same family of commands):
  1. lapply function applies a function to each element of a list or dataframe and returns a list (regardless of the function’s return type).
# Apply function to each element of the list/dataframe
result_list <- lapply(df, sd)

result_list # Print the result
$mpg
[1] 6.026948

$cyl
[1] 1.785922

$disp
[1] 123.9387

$hp
[1] 68.56287

$drat
[1] 0.5346787

$wt
[1] 0.9784574

$qsec
[1] 1.786943

$vs
[1] 0.5040161

$am
[1] 0.4989909

$gear
[1] 0.7378041

$carb
[1] 1.6152
  1. sapply function applies a function to each element of a list or dataframe and attempts to simplify the result into a vector or matrix. Used when you expect a vector or matrix as the output and want automatic simplification of the result.
# Apply function to each element of the list/dataframe
result_vector <- sapply(df, mean)

result_vector # Print the result
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 
  1. Looping - a more direct example for simulations.
  • Initialize a vector: Create an empty numeric vector sd_results with a length equal to the number of columns in df to store the standard deviations.

  • Loop and calculate: Iterate over each column, compute the standard deviation, and store it in the sd_results vector. Assign column names and print the results.

?numeric

# Initialize a vector to store standard deviations (contains only 0)
sd_results <- numeric(ncol(df))

head(sd_results) # check it has only 0, read up on numeric to confirm
[1] 0 0 0 0 0 0
# Loop over each column
for (i in 1:ncol(df)) {
  sd_results[i] <- sd( x = df[ , i] )  # Calculate and store the standard deviation
}

head(sd_results) # sd_result has no set names
[1]   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574
# Set names for the results
names(sd_results) <- colnames(df)

# Print the results
print(sd_results)
        mpg         cyl        disp          hp        drat          wt 
  6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
       qsec          vs          am        gear        carb 
  1.7869432   0.5040161   0.4989909   0.7378041   1.6152000 

Packages

I will demonstrate three packages for generating summary statistics:

  1. summary from base R

  2. dfSummary from the summarytools package

  3. skim from the skimr package

summary

Purpose: Provides a summary of statistics for each column in a data frame or vector, including measures like minimum, maximum, mean, median, and quartiles.

summary(df)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

summarytools

Purpose: dfSummary from the summarytools package provides a comprehensive summary of a data frame, including statistics for numeric variables and frequency tables for categorical variables.

# install.packages(summarytools)
library(summarytools)

Attaching package: 'summarytools'
The following object is masked from 'package:tibble':

    view
dfSummary(df)
Data Frame Summary  
df  
Dimensions: 32 x 11  
Duplicates: 0  

----------------------------------------------------------------------------------------------------------
No   Variable    Stats / Values              Freqs (% of Valid)   Graph               Valid      Missing  
---- ----------- --------------------------- -------------------- ------------------- ---------- ---------
1    mpg         Mean (sd) : 20.1 (6)        25 distinct values     :                 32         0        
     [numeric]   min < med < max:                                   : .               (100.0%)   (0.0%)   
                 10.4 < 19.2 < 33.9                               . : :                                   
                 IQR (CV) : 7.4 (0.3)                             : : :   .                               
                                                                  : : : : :                               

2    cyl         Mean (sd) : 6.2 (1.8)       4 : 11 (34.4%)       IIIIII              32         0        
     [numeric]   min < med < max:            6 :  7 (21.9%)       IIII                (100.0%)   (0.0%)   
                 4 < 6 < 8                   8 : 14 (43.8%)       IIIIIIII                                
                 IQR (CV) : 4 (0.3)                                                                       

3    disp        Mean (sd) : 230.7 (123.9)   27 distinct values     :                 32         0        
     [numeric]   min < med < max:                                 . :                 (100.0%)   (0.0%)   
                 71.1 < 196.3 < 472                               : : :   : : :                           
                 IQR (CV) : 205.2 (0.5)                           : : :   : : :   .                       
                                                                  : : : . : : : . :                       

4    hp          Mean (sd) : 146.7 (68.6)    22 distinct values   . :                 32         0        
     [numeric]   min < med < max:                                 : :                 (100.0%)   (0.0%)   
                 52 < 123 < 335                                   : : : .                                 
                 IQR (CV) : 83.5 (0.5)                            : : : :                                 
                                                                  : : : : . .                             

5    drat        Mean (sd) : 3.6 (0.5)       22 distinct values       :               32         0        
     [numeric]   min < med < max:                                   : :               (100.0%)   (0.0%)   
                 2.8 < 3.7 < 4.9                                    : : .                                 
                 IQR (CV) : 0.8 (0.1)                             . : : :                                 
                                                                  : : : : .                               

6    wt          Mean (sd) : 3.2 (1)         29 distinct values         :             32         0        
     [numeric]   min < med < max:                                       : :           (100.0%)   (0.0%)   
                 1.5 < 3.3 < 5.4                                        : :                               
                 IQR (CV) : 1 (0.3)                               : : : : :     .                         
                                                                  : : : : : .   :                         

7    qsec        Mean (sd) : 17.8 (1.8)      30 distinct values         :             32         0        
     [numeric]   min < med < max:                                       :             (100.0%)   (0.0%)   
                 14.5 < 17.7 < 22.9                                     : :                               
                 IQR (CV) : 2 (0.1)                                 . : : : :                             
                                                                  : : : : : : :   .                       

8    vs          Min  : 0                    0 : 18 (56.2%)       IIIIIIIIIII         32         0        
     [numeric]   Mean : 0.4                  1 : 14 (43.8%)       IIIIIIII            (100.0%)   (0.0%)   
                 Max  : 1                                                                                 

9    am          Min  : 0                    0 : 19 (59.4%)       IIIIIIIIIII         32         0        
     [numeric]   Mean : 0.4                  1 : 13 (40.6%)       IIIIIIII            (100.0%)   (0.0%)   
                 Max  : 1                                                                                 

10   gear        Mean (sd) : 3.7 (0.7)       3 : 15 (46.9%)       IIIIIIIII           32         0        
     [numeric]   min < med < max:            4 : 12 (37.5%)       IIIIIII             (100.0%)   (0.0%)   
                 3 < 4 < 5                   5 :  5 (15.6%)       III                                     
                 IQR (CV) : 1 (0.2)                                                                       

11   carb        Mean (sd) : 2.8 (1.6)       1 :  7 (21.9%)       IIII                32         0        
     [numeric]   min < med < max:            2 : 10 (31.2%)       IIIIII              (100.0%)   (0.0%)   
                 1 < 2 < 8                   3 :  3 ( 9.4%)       I                                       
                 IQR (CV) : 2 (0.6)          4 : 10 (31.2%)       IIIIII                                  
                                             6 :  1 ( 3.1%)                                               
                                             8 :  1 ( 3.1%)                                               
----------------------------------------------------------------------------------------------------------

skim

Purpose: skim from the skimr package provides a detailed and aesthetically pleasing summary of data, including distributions and missing values.

# install.packages("skimr")
library(skimr)
skim(mtcars)
Data summary
Name mtcars
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.40 15.43 19.20 22.80 33.90 ▃▇▅▁▂
cyl 0 1 6.19 1.79 4.00 4.00 6.00 8.00 8.00 ▆▁▃▁▇
disp 0 1 230.72 123.94 71.10 120.83 196.30 326.00 472.00 ▇▃▃▃▂
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 ▇▇▆▃▁
drat 0 1 3.60 0.53 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 ▃▃▇▁▂
qsec 0 1 17.85 1.79 14.50 16.89 17.71 18.90 22.90 ▃▇▇▂▁
vs 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
am 0 1 0.41 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
gear 0 1 3.69 0.74 3.00 3.00 4.00 4.00 5.00 ▇▁▆▁▂
carb 0 1 2.81 1.62 1.00 2.00 2.00 4.00 8.00 ▇▂▅▁▁

stargazer and kable package

I will now introduce one of the most common packages used for summary statistics in Economics journals, stargazer, and then show how to use the kable package for more flexible table formatting based on your preferred summary statistics command. This will help you create a professional looking summary statistics table.

  • stargazer package

The stargazer function is designed primarily for summarizing regression models or data frames. See more examples in Appendix.

Base command

# install.packages("stargazer")
require(stargazer)
Loading required package: stargazer

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
stargazer(df)

% Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
% Date and time: Tue, Jul 29, 2025 - 14:40:16
\begin{table}[!htbp] \centering 
  \caption{} 
  \label{} 
\begin{tabular}{@{\extracolsep{5pt}}lccccc} 
\\[-1.8ex]\hline 
\hline \\[-1.8ex] 
Statistic & \multicolumn{1}{c}{N} & \multicolumn{1}{c}{Mean} & \multicolumn{1}{c}{St. Dev.} & \multicolumn{1}{c}{Min} & \multicolumn{1}{c}{Max} \\ 
\hline \\[-1.8ex] 
mpg & 32 & 20.091 & 6.027 & 10.400 & 33.900 \\ 
cyl & 32 & 6.188 & 1.786 & 4 & 8 \\ 
disp & 32 & 230.722 & 123.939 & 71.100 & 472.000 \\ 
hp & 32 & 146.688 & 68.563 & 52 & 335 \\ 
drat & 32 & 3.597 & 0.535 & 2.760 & 4.930 \\ 
wt & 32 & 3.217 & 0.978 & 1.513 & 5.424 \\ 
qsec & 32 & 17.849 & 1.787 & 14.500 & 22.900 \\ 
vs & 32 & 0.438 & 0.504 & 0 & 1 \\ 
am & 32 & 0.406 & 0.499 & 0 & 1 \\ 
gear & 32 & 3.688 & 0.738 & 3 & 5 \\ 
carb & 32 & 2.812 & 1.615 & 1 & 8 \\ 
\hline \\[-1.8ex] 
\end{tabular} 
\end{table} 
What is the default argument for type in stargazer?

Lets change the type="text" and see if we get something more readable.

# install.packages("stargazer")
require(stargazer)

stargazer(df, 
          type = "text" # default argument is not text
)

============================================
Statistic N   Mean   St. Dev.  Min     Max  
--------------------------------------------
mpg       32 20.091   6.027   10.400 33.900 
cyl       32  6.188   1.786     4       8   
disp      32 230.722 123.939  71.100 472.000
hp        32 146.688  68.563    52     335  
drat      32  3.597   0.535   2.760   4.930 
wt        32  3.217   0.978   1.513   5.424 
qsec      32 17.849   1.787   14.500 22.900 
vs        32  0.438   0.504     0       1   
am        32  0.406   0.499     0       1   
gear      32  3.688   0.738     3       5   
carb      32  2.812   1.615     1       8   
--------------------------------------------

Embellished command.

Lets add arguments one by one to improve the presentation and customize the output. You can further play with omit.summary.stat).

library(stargazer)

# Enhanced stargazer table
stargazer(df, 
          type = "text",                 # Output format
          title = "Summary Statistics for mtcars Dataset", # Title of the table
          digits = 2,                    # Number of decimal places
          covariate.labels = c("MPG", "Cylinders", "Disp.", "HP", "Rear Axle Ratio", 
                               "Weight", "Q-Mile Time", "V/S", "Transmission", 
                               "Gears", "Carbs")  # Custom labels for variables
)

Summary Statistics for mtcars Dataset
===============================================
Statistic       N   Mean  St. Dev.  Min   Max  
-----------------------------------------------
MPG             32 20.09    6.03   10.40 33.90 
Cylinders       32  6.19    1.79     4     8   
Disp.           32 230.72  123.94  71.10 472.00
HP              32 146.69  68.56    52    335  
Rear Axle Ratio 32  3.60    0.53   2.76   4.93 
Weight          32  3.22    0.98   1.51   5.42 
Q-Mile Time     32 17.85    1.79   14.50 22.90 
V/S             32  0.44    0.50     0     1   
Transmission    32  0.41    0.50     0     1   
Gears           32  3.69    0.74     3     5   
Carbs           32  2.81    1.62     1     8   
-----------------------------------------------

kable

#  install.packages("knitr")
#  install.packages("kableExtra")
library(knitr)
library(kableExtra)

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
# detach(kableExtra)

# Generate descriptive statistics using psych::describe
summary_stats <- psych::describe(mtcars)
summary_stats
     vars  n   mean     sd median trimmed    mad   min    max  range  skew
mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
am      9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
     kurtosis    se
mpg     -0.37  1.07
cyl     -1.76  0.32
disp    -1.21 21.91
hp      -0.14 12.12
drat    -0.71  0.09
wt      -0.02  0.17
qsec     0.34  0.32
vs      -2.00  0.09
am      -1.92  0.09
gear    -1.07  0.13
carb     1.26  0.29
# Create a nicely formatted table using kable
kable(x = summary_stats, 
      caption = "Descriptive Statistics for mtcars Dataset", 
      digits = 2 )
Descriptive Statistics for mtcars Dataset
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17 -1.76 0.32
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38 -1.21 21.91
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27 -0.71 0.09
wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37 0.34 0.32
vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24 -2.00 0.09
am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36 -1.92 0.09
gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53 -1.07 0.13
carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05 1.26 0.29

Control Structures

Control structures like if, else, for, and while are used to control the flow of execution.

If-Else statement

  • An if-else statement allows you to make decisions based on certain conditions. This is useful in scenarios where you need to handle different cases or scenarios based on the data.
Baby example
x <- 10

if (x > 5) {
  print("x is greater than 5")
} else {
  print("x is 5 or less")
}
[1] "x is greater than 5"
  • You might use an if-else statement to categorize countries into high, medium, or low income based on their GDP per capita:
More realistic example ()
# Example GDP per capita
gdp_per_capita <- 55000

# If-else statement to categorize income level
if (gdp_per_capita > 40000) {
  income_level <- "High Income"
} else if (gdp_per_capita > 20000) {
  income_level <- "Medium Income"
} else {
  income_level <- "Low Income"
}

print(paste("Income level:", income_level))
[1] "Income level: High Income"

For Loop

  • A for loop is useful when you need to perform a repetitive task for a fixed number of iterations.

    • Repetitive Tasks: The for loop automates repetitive tasks, such as reading multiple files and performing calculations like applying a function to each element in a list of variables, which is efficient and reduces the risk of errors compared to manually repeating these tasks.

    • Scalability: As the number of datasets grows, you can simply add more file names to the list without needing to rewrite the calculation code.

Baby example
for (i in 1:5) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
More realistic (fake*) example
file_names <- c("dataset1.csv", "dataset2.csv", "dataset3.csv") # List of file names

datasets <- list()  # Initialize an empty list to store datasets

# Loop over each file name
for (i in 1:length(file_names)) {
 
  df[[i]] <- read.csv(file_names[i])  # Read the dataset and store it in the list

}

While Loop

  • A while is useful for situations where you need to repeat a task until a certain condition is met. It’s particularly handy when the number of iterations isn’t known in advance and depends on some condition that changes dynamically.
Baby example
i <- 1

while (i <= 5) {
  print(i)
  i <- i + 1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
More realistic example: Newton-Raphson method
  • This is useful for iterative algorithms, such as numerical optimization routines, iterative estimation and convergence checks in econometric models.

  • Newton-Raphson method iteratively improves the estimate until it converges to the true square root.

number <- 25    # Number for which we want to find the square root

# Initial guess of square root
guess <- number / 2

# Threshold for convergence
tolerance <- 0.0001

# Iterative process to refine the guess
while (abs(guess^2 - number) > tolerance) {   
  guess <- (guess + number / guess) / 2       # improves the guess
}

# Print the final guess
print(paste("Approximate square root:", round(x = guess,
                                              digits = 4)
            )
      )
[1] "Approximate square root: 5"
  • Initial Guess: Start with an initial guess for the square root.

  • While Loop Condition: Continue the loop as long as the difference between the squared guess and the original number is greater than the tolerance level.

  • Update Guess: Use the average of the guess and the number divided by the guess to refine the estimate.

  • Convergence Check: The loop stops when the difference is smaller than the tolerance, meaning the guess is close enough to the actual square root.

Appendix

  1. Data Types - Geek for Geeks ; Programiz
  2. Many Data Structure - Software Carpentry
  3. Stargazer basics.