# Common usage - commands separated by new line
2**1
2**3 # 2nd command is on a new second line
# Alternative usage - commands separated by semicolon
2**2 ; 2**3 # 2nd command is on a the first line
II. Basics of R Programming
Getting Started with R
R is a versatile language used for statistical computing and graphics.
R Environment:
- RStudio: Integrated Development Environment (IDE) for R.
- R Console: Where you can directly execute R commands.
- R Script Editor: Write and save R scripts (.R files).
- R Markdown / Quarto: Create dynamic documents (.Rmd, .qmd).
Running Code:
- Type code in the Console or Script Editor and press
Enter
. - We can run code from within a code chunk in
.qmd
or.Rmd
file as well.
- Type code in the Console or Script Editor and press
Basic Syntax in R
- R can be used a calculator, and commands end with a newline or semicolon.
Commenting on Code
- Creating Single-Line Comments: Created using the
#
symbol. Anything following the#
on that line is treated as a comment and is ignored by the R interpreter.- Often used to explain code.
# This is a single line comment in R
2 * 2 # multiplication
[1] 4
- Multiple Single-Line Comments: The most common way to create multi-line comments in R is to use multiple single-line comments, each starting with the
#
symbol.- Often used to stop R from interpretting certain parts of the code.
Code
->Comment/Uncomment Lines
is a shortcut if you prefer.
# This is a multi-line comment
# that spans across several lines.
# Each line starts with the '#' symbol.
- Purpose of Comments: Commenting your code helps in increasing its readability and maintainability. Comments provide context and explanations for various parts of the code (self-documentation), which is helpful for both yourself and others who may read or work with your code in the future.
Best Practices for Commenting:
Be Descriptive: Explain what the code is doing and why, not just what it does.
Keep Comments Up-to-Date: Update comments when the code changes to avoid inaccuracies.
Avoid Redundant Comments: Skip obvious or trivial details that the code already clearly expresses.
Lets comment on some basic arithmetic math operations in R.
Math Operations in R
Basic math operations are fundamental arithmetic operations that you can perform directly on numeric values. They include:
5 + 3 # Addition
[1] 8
10 - 4 # Subtraction
[1] 6
6 * 7 # Multiplication
[1] 42
8 / 2 # Division
[1] 4
2^3 # Exponentiation
[1] 8
7 %% 3 # Modulus: 7 %% 3 = 1 (Remainder of division) {odd/even numbers}
[1] 1
7 %/% 3 # Integer Division: 7 %/% 3 = 2 (Quotient without remainder)
[1] 2
Mathematical Functions
Mathematical functions in R are built-in functions that perform more complex calculations or transformations. These functions are designed to handle specific mathematical operations and often involve more complex logic than basic arithmetic operations. Some common mathematical functions include:
sqrt(16) # Square Root: sqrt(16) = 4
[1] 4
abs(-5) # Absolute Value: abs(-5) = 5
[1] 5
log(10) # Logarithm (Natural Log): log(10) = 2.302585
[1] 2.302585
log10(100) # Logarithm with base 10: log10(100) = 2
[1] 2
exp(2) # Exponential: exp(2) = 7.389056
[1] 7.389056
sin(pi / 2) # Sine: sin(pi / 2) = 1
[1] 1
cos(0) # Cosine: cos(0) = 1
[1] 1
tan(pi / 4) # Tangent: tan(pi / 4) = 1
[1] 1
- Warning
In R Markdown, if a code chunk contains errors or incorrect answers, it can prevent the document from knitting or compiling properly.
log(0)
[1] -Inf
log(-1) # command is set up that it warns you
Warning in log(-1): NaNs produced
[1] NaN
# document will not compile if you uncomment the line below. Try uncommenting it. #10/a # 1>
- R will typically show you the line number/chunk lines where you have the error.
You will have to troubleshoot that part for the document to compile.
- Switching to
Source
helps to identify the part where the error is.
- Switching to
- R will typically show you the line number/chunk lines where you have the error.
- Mathematical operations that can be performed on numbers, vectors, and other numeric data types. More on data types in a bit, as applying functions on incorrect data types is one important reason why one gets errors.
Explicitly specify the arguments
- Explicitly specifying arguments in your code can greatly enhance readability and maintainability.
tan(pi/4) # 1>
[1] 1
?tan tan(x = pi/4)
[1] 1
- You might be able to guess, or understand what is the argument here by opening the help file on the function with
?tan
command, but with functions that require multiple arguments it is often not obvious.
?seqseq(0,10,1) # less readable
[1] 0 1 2 3 4 5 6 7 8 9 10
seq(from = 0, to = 10, by = 1) # more readable
[1] 0 1 2 3 4 5 6 7 8 9 10
In R, you can perform help searches both globally and locally to find information about functions, packages, or keywords.
Local help refers to retrieving documentation related to specific functions, objects, or packages that you are currently working with in your R environment.
Global help refers to searching across all available R documentation for topics, keywords, or functions related to a specific subject. This is useful when you are looking for information on a broader topic or when you don’t know the exact function name.
# local help ?tanhhelp(tanh) # global help ??tanhhelp.search("tanh") # to find help topics related to specific keywords.
- You do not leave the R Studio interface.
Google Search: Use Google to search for specific examples or solutions related to your problem by querying:
"R - 'what you are trying to do'"
. This approach helps you find blogs, articles, and tutorials that provide relevant examples and explanations.- Example Search: If you are trying to understand how to perform linear regression in R, you might search for: “
R - linear regression example"
.- You have to leave the R Studio interface.
- Example Search: If you are trying to understand how to perform linear regression in R, you might search for: “
AI Tools: Utilize AI tools like ChatGPT, Bard, or other similar platforms to get tailored advice and examples. These tools can provide quick explanations, code snippets, and solutions based on your specific queries.
Example: If you need help with a specific R function or concept, you can ask: `“How do I use the
ggplot2
package to create a scatter plot in R?”` and receive a detailed response or code snippet.- Can be combined within the R Studio interface.
By combining these three methods, you can effectively find and adapt the code examples you need to complete your tasks in R.
Aligning Code
As much as possible, align your code and explicitly specify function arguments.
- This practice enhances code readability and maintainability.
When reading code written by others, especially as a beginner, look for and follow these practices to improve your understanding and integration with the code.
See example below. Which is the easiest to understand and maintain?
seq(0,10,1) # hard to understand but concise
[1] 0 1 2 3 4 5 6 7 8 9 10
seq(from = 0, to = 10, by = 1) # easier to understand but not optimised for maintainance
[1] 0 1 2 3 4 5 6 7 8 9 10
# easiest to interpret and edit
seq(from = 0 ,
to = 10,
by = 1
)
[1] 0 1 2 3 4 5 6 7 8 9 10
Each methods has its pros and cons though.
Basic Usage: The
seq(0, 10, 1)
format is concise but may not immediately convey the role of each argument, especially for someone new to the function.Named Arguments: Using
seq(from = 0, to = 10, by = 1)
makes it clear what each argument represents (from
,to
,by
), improving readability.Formatted for Readability: Placing each argument on a new line (
seq(from = 0, to = 10, by = 1)
) makes it very clear and easy to edit, especially for more complex sequences with additional parameters.- This format is beneficial for maintaining code clarity and avoiding errors, especially for beginners.
Using variables and the Assignment Operator
- Variables are used to store data in R. Use the
<-
operator for assignment, though=
is also acceptable.- Objects should appear in your
Environment
tab upon successful assignment, which you can examine/print.
- Objects should appear in your
<- 10 # Using <- for assignment, assigns 10 to variable x
x
= 5 # Using = for assignment y
Common Mistakes:
- Unassigned Variables: Trying to use a variable in a function that has not been assigned a value will result in an error.
remove(list=ls()) # 1>
tryCatch( # 2>
{# Attempt to print a non-existent object
print(x)
},error = function(e) {
# Handle the error
print("Error: The object does not exist.")
} )
[1] "Error: The object does not exist."
- Removed all variables from the environment.
tryCatch
is used in R to handle errors gracefully by allowing you to specify actions to take if an error occurs, ensuring the code continues running instead of stopping abruptly. It can help prevent the entire script or document from failing due to a single error.
<- 24 # define x
x
print(x) # Explicitly prints the value of x to the console
[1] 24
- Now,
print(x)
does not give an error.
# Simply typing the variable name in the console or script will display its value (implicitly prints the value of x). This method is often used interactively in the console or within scripts for quick checks. x
[1] 24
- Overwriting Values: Reassigning a value to an existing variable will overwrite the previous value and may cause errors if not handled properly.
- In lengthy code, it’s easy to accidentally overwrite variables.
- Always verify the type and contents of your objects to avoid unintended issues.
R is Object-Oriented
R is an object-oriented language, which means:
Objects: In R, data and functions are treated as objects. Everything you work with in R, such as vectors, lists, and data frames, is an object.
Classes: Objects belong to classes, which define their structure and behavior. For example, a data frame is a class of objects with specific methods for handling tabular data.
Methods: R supports method dispatch, meaning that the methods (functions) that operate on objects can vary depending on the object’s class. This allows for more flexible and powerful data manipulation.
Inheritance: R uses a system of inheritance, where objects can inherit properties and methods from other classes. This enables the creation of complex data structures built on simpler ones.
Understanding R’s object-oriented nature helps in designing more efficient and modular code, making it easier to work with complex data and perform sophisticated analyses.
Writing and using functions
Functions are defined using the
function
keyword. The basic syntax isfunction(arg1, arg2, ...) { ... }
.Lets create your first function.
<- function(x, y) {
my_addition return(x + y)
}
<- my_addition(10, 5) # Calls the function with arguments 10 and 5
result result
[1] 15
- Can open up functions to see how they have been written. Some functions can be very complex. Lets look at how a simple function
sd
is written -
# square root of variance sd
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
<bytecode: 0x10ebd3d08>
<environment: namespace:stats>
na.rm
: A logical value indicating whether to remove missing values (NA
) before the computation. The default isFALSE
, meaning missing values are not removed.R provides a wide range of built-in functions for various tasks, such as data manipulation, statistical analysis, and plotting. These functions, like
mean()
,lm()
, andplot()
, offer ready-to-use solutions for common problems and streamline the coding process. Utilizing these functions can significantly simplify and accelerate data analysis workflows.
Packages and Libraries
In R, packages and libraries are essential for extending the functionality of the base R environment.
Definition: Packages are collections of R functions, data, and documentation bundled together. They are designed to add specific capabilities or functions to R, such as data manipulation, statistical analysis, or visualization.
Installation: To use a package, you first need to install it. This is done using the
install.packages()
function.
Use install.packages("package_name")
to install a package and library(package_name)
to load it.
- Lets install
tidyverse
, which is a set of R packages for data science.
# install.packages("tidyverse") # Install the tidyverse package
library(tidyverse) # Load the tidyverse package
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
- I am commenting out tidyverse as I do not want to install it again (can take time and is not required as I have already installed it in the past).
Installation is a one-time setup process where you get all the necessary tools (functions) into your R environment.
Loading is done every time you start a new R session and want to use the tools you’ve previously installed.
Think of installing a package like installing a light bulb in a fixture. When you install a light bulb, you’re setting up the bulb in the fixture so that it’s ready to be used whenever you need light.
Loading a package is like turning on the light switch. Once the bulb is installed, you need to flip the switch to actually turn on the light and make use of the bulb’s illumination. Similarly, after installing a package, you need to load it in your current R session to use its functions and data.
- Without the internet, you won’t be able to download these packages and their dependencies, just as you wouldn’t be able to get the necessary components for your light fixture without a store.
A library, on the other hand, refers to the location on your system where installed packages are stored and accessed. Essentially, a package is a unit of functionality, while a library is the storage space for these packages.
To find your library, type
.libPaths()
in your R console and find the address.
.libPaths()
[1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library"
- You can find all the installed packages in your R library.
- Again, only loaded packages will have a tick next to them.
Data Types and Data Structures
Data Type: In R, “data type” refers to the classification of variables or values, such as numeric, integer, character, etc., which determine how data is stored in memory and what operations can be performed on them.
Data Structure: A “data structure” refers to the way data is organized and stored in R objects, such as vectors, lists, matrices.
Data Types
In R, data types refer to the classification of variables or values that determine how they are stored in memory and what operations can be performed on them.
Here are the primary data types in R:
Numeric
Used for real numbers, which which includes both integers and decimal (floating-point) values.
Examples:
1.5
,3
,0.25
<- 3 x typeof(x)
[1] "double"
typeof(3)
[1] "double"
<- 1.5 x typeof(x)
[1] "double"
Typically, R stores these as “double-precision floating-point numbers”, which is why the
typeof()
function often returns"double"
for “double-precision” numeric type.- “Double-precision floating-point numbers” are a way of representing real numbers (numbers with decimals) in computer memory using 64 bits. This format allows for a wide range of values, including very large and very small numbers, while maintaining precision.
Used for quantitative variables. Typically refers to continuous data that can take on any value within a range.
- Examples include height, weight, and income.
Integer
Used for integer values (whole numbers and negative of natural numbers).
Examples:
1L
,2L
,-100L
<- 3L x typeof(x)
[1] "integer"
<- -100L x typeof(x)
[1] "integer"
L
declares the number to an integer.Used for count data (discrete or integer data that represents counts or the number of occurrences of an event).
- Examples include the number of children in a family, the number of cars in a parking lot, or the number of students in a class. These variables are usually whole numbers and do not have decimal points.
Logical
Used for Boolean values to represent binary conditions or logic, often as the result of comparisons and logical operations. Boolean values are fundamental in control flow, decision-making, and logical operations within programs.
- Example:
TRUE
,FALSE
<- TRUE # FALSE
keep_ob typeof(keep_ob)
[1] "logical"
<- T # F
keep_ob typeof(keep_ob)
[1] "logical"
T
andF
do work as shorthand forTRUE
andFALSE
in R. However, it’s generally recommended to useTRUE
andFALSE
instead ofT
andF
to avoid potential issues. This is becauseT
andF
are just variables that are predefined in R to representTRUE
andFALSE
, and they can be reassigned to something else, which could lead to unexpected behavior in your code.
tryCatch(
{<- c(False, True) # This will cause an error
keep_obs
},error = function(e) {
print("Error: object 'False' not found")
} )
[1] "Error: object 'False' not found"
- The code does not work because R is case-sensitive, and the correct logical values should be
FALSE
andTRUE
, notFalse
andTrue
. - Used for comparisons, especially in selecting rows/cases that meet a certain condition (of truth or falsity).
Character
Used for strings of text.
Example:
"Hello"
,"R programming"
<- c("Old Road", "Undone Road", "Anton Road") address typeof(address)
[1] "character"
<- c("TRUE", "TRUE", "FALSE", "TRUE") keep_obs typeof(keep_obs)
[1] "character"
Sometimes numeric and integer data can be stored as character, and you will not be able to execute mathematical operations on them. You will have to coerce/force the data type into the correct type.
result
[1] 15
tryCatch(
{<- "10"
a <- "5"
b my_addition(a, b) # This will cause an error
},error = function(e) {
print("! non-numeric argument to binary operator")
} )
[1] "! non-numeric argument to binary operator"
- Used for textual data. Exampls include storing words (sentiment analysis), addresses and geographic data, or processing text in tasks like web scraping.
- Strings are typically enclosed in quotation marks, either single (
'
) or double ("
).
- Strings are typically enclosed in quotation marks, either single (
Factor
A factor is a specific data type used to represent categorical variables.
- Example:
factor(c("low", "medium", "high"))
<- factor(c("red", "blue", "green", "red"))
colors typeof(colors)
[1] "integer"
For factors,
typeof()
will return"integer"
, as factors are internally stored as integers with corresponding character labels.To check the class of an object, which provides more descriptive information, use the
class()
function.
class(colors)
[1] "factor"
For a factor,
class()
will return"factor"
.Factors can be ordered or unordered, depending on whether the categories have a natural order (e.g., “low,” “medium,” “high”) or not (e.g., “red,” “blue,” “green”).
Factors are particularly useful for handling qualitative data, where the values represent categories or groups rather than numerical quantities.
- Factors are often used in statistical modeling, where categorical variables need to be treated differently from numerical variables.
Date
Used for representing dates in R.
Example:
as.Date("2023-01-01")
creates a Date object for January 1, 2023.<- as.Date("2023-01-01") date_example typeof(date_example)
[1] "double"
The
Date
type allows you to perform operations related to dates, such as calculating differences between dates, extracting components (e.g., day, month, year), and formatting dates for display.- Dates are stored internally as the number of days since January 1, 1970, which is the origin date in R. This allows for efficient date calculations and manipulations.
Commonly used in data analysis for handling temporal data, such as event timestamps, transaction dates, or historical data.
- Examples include tracking the date of sales transactions, scheduling events, or analyzing time series data.
POSIXct and POSIXlt
Used for representing date-time values as the number of seconds since the Unix epoch (January 1, 1970).
Example:
as.POSIXct("2023-01-01 12:00:00")
creates a POSIXct object for January 1, 2023, at 12:00 PM.<- as.POSIXct("2023-01-01 12:00:00") datetime_example_ct typeof(datetime_example_ct)
[1] "double"
The
POSIXct
type stores date-time as the number of seconds since the epoch, which allows for efficient arithmetic and comparison operations involving date-time values.- This format is useful for handling and manipulating continuous date-time data, such as timestamps in logs or time series data.
Commonly used for precise time measurements and calculations, including scheduling events, analyzing temporal data, and managing timestamps.
- Examples include recording the exact time of transactions, tracking event occurrences, or computing time intervals between events.
Complex
Used for representing complex numbers (in the form a + bi
, where a
is the real part and b
is the coefficient of the imaginary unit i
).
Example:
1 + 2i
creates a complex number where1
is the real part and2i
is the imaginary part.<- 1 + 2i complex_example typeof(complex_example)
[1] "complex"
- This format allows for calculations involving complex numbers, such as addition, multiplication, and finding magnitudes or phases.
Commonly used in fields such as engineering, physics, and applied mathematics, where complex numbers are used to model phenomena involving oscillations, waves, and other periodic behaviors.
Examples include electrical engineering (for AC circuit analysis), quantum mechanics (for wave functions), and signal processing (for Fourier transforms).
Not common in Economics.
Why data types matter ?
These data types help R manage different kinds of information and determine how operations such as arithmetic, comparisons, and transformations are performed on them. Understanding these types is crucial for effective data manipulation, analysis, and programming in R.
R Data types are used to specify the kind of data that can be stored in a variable.
For effective memory consumption and precise computation, the right data type must be selected.
Each R data type has unique properties and associated operations.
- Different forms of data that can be saved and manipulated are defined and categorized using data types in computer languages including R.
Common Coercion
Valid coercion
Numeric to Integer
- Numeric values are coerced to integers by rounding towards zero / dropping everything after the decimal.
as.integer(1.9)
[1] 1
as.integer(-1.9)
[1] -1
- Coercion of
TRUE
is 1 andFALSE
is 0.
as.integer(TRUE) # Results in 1
[1] 1
as.integer(FALSE) # Results in 0
[1] 0
Invalid coercion
Charecter to Integer
Attempting to coerce characters directly to integers will result in NA values, as R cannot convert non-numeric strings to numbers.
as.integer("abc") # Results in NA
Warning: NAs introduced by coercion
[1] NA
as.integer("$123") # Will have to clean your data and remove $
Warning: NAs introduced by coercion
[1] NA
as.integer("123") # Results in 123 (valid conversion)
[1] 123
Common Data Structures: Vectors, Lists, Matrices and Dataframes
In R, data structures are fundamental components used to store and organize data types efficiently.
- There are many data structures in R.
Each data structure has specific characteristics that determine how data is stored in memory and what operations can be performed on it.
Understanding these structures is crucial for effectively manipulating and analyzing data in R.
Vectors:
Atomic Vectors: These are one-dimensional arrays that can hold elements of the same data type, such as numeric, character, or logical.
Example:
c(1, 2, 3, 4)
creates a numeric vector.<- c(1, 2, 3, 4) my_vector class(my_vector) # numeric
[1] "numeric"
<- c("1", 2, 3, 4) my_vector class(my_vector) # character
[1] "character"
Factors in R are a special type of vector. Factors are stored as integer vectors internally, where each integer corresponds to a level in the factor. The levels themselves are stored as character vectors. Factors are used to handle categorical data efficiently, such as groups or categories. They allow R to manage and sort categorical variables properly.
<- colors my_factor class(my_factor) # factor
[1] "factor"
typeof(my_factor) # integer
[1] "integer"
levels(my_factor) # charecter
[1] "blue" "green" "red"
Lists: Lists are also one-dimensional but can hold elements of different types or structures.
Example:
list(1, "a", TRUE)
creates a list with numeric, character, and logical elements.Versatile data structures.
<- list(1, "a", TRUE) my_list class(my_list) # list
[1] "list"
Matrices:
Matrices are two-dimensional arrays where all elements are of the same data type (numeric, character, etc.).
Example:
matrix(data = 1:6, nrow=2, ncol=3)
creates a 2x3 matrix.<- my_matrix matrix(data = 1:6, # shorthand for - 1 2 3 4 5 6 nrow = 2, ncol = 3 )class(my_matrix) # list
[1] "matrix" "array"
typeof(my_matrix) # integer
[1] "integer"
<- my_matrix matrix(data = c(a,2,3,4,5,6), nrow = 2, ncol = 3 )class(my_matrix) # list
[1] "matrix" "array"
typeof(my_matrix) # charecter
[1] "character"
Arrays: Generalized versions of matrices with more than two dimensions.
An array in R is a data structure that holds elements in multiple dimensions. Unlike a matrix, which is two-dimensional, an array can have three or more dimensions. Arrays are used to store data in a grid-like format with dimensions that can be specified.
Dimensions: Arrays can have any number of dimensions. For example, a three-dimensional array can be thought of as a stack of matrices.
Homogeneity: All elements in an array must be of the same type (e.g., all numeric, all character).
Creation: Arrays are created using the
array()
function, where you specify the data, dimensions, and optional dimension names.# Create a 3x3x2 array <- 1:18 data <- array(data, dim = c(3, 3, 2)) my_array # Print the array my_array
, , 1 [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 , , 2 [,1] [,2] [,3] [1,] 10 13 16 [2,] 11 14 17 [3,] 12 15 18
typeof(my_array) # integer
[1] "integer"
class(my_array) # array
[1] "array"
my_array
is a three-dimensional array with dimensions 3x3x2, where the array holds numbers from 1 to 18.
Data Frames:
Data frames are two-dimensional structures similar to tables in databases or spreadsheets.
Columns can be of different data types (numeric, character, etc.).
Example:
data.frame(id=c(1, 2, 3), name=c("Alice", "Bob", "Charlie"))
creates a data frame with columns “id” and “name”.<- my_df data.frame(id=c(1, 2, 3), name=c("Alice", "Bob", "Charlie") ) class(my_df) # data.frame
[1] "data.frame"
typeof(class(my_df)) # "character"
[1] "character"
Data Tables (from
data.table
package):Data tables are enhanced data frames optimized for large datasets and efficient operations.
Example:
data.table(id=c(1, 2, 3), name=c("Alice", "Bob", "Charlie"))
creates a data table with columns “id” and “name”.
Each structure offers different capabilities and efficiencies depending on the nature of the data and the tasks being performed. By leveraging the appropriate data structure, you can optimize your workflow and enhance your ability to work with data in R effectively.
Basic Operations on Common Data Structures
Access elements
Basic Operations on Data Structures
Vectors
Create a vector
<- c(1, 2, 3, 4, 5)
numeric_vector <- c("apple", "banana", "cherry") char_vector
Access elements
1] # Returns 1 numeric_vector[
[1] 1
1:2] # Returns "apple" "banana" char_vector[
[1] "apple" "banana"
Modify elements
2] <- 10 numeric_vector[
Vector operations
+ 5 # Adds 5 to each element numeric_vector
[1] 6 15 8 9 10
* 2 # Multiplies each element by 2 numeric_vector
[1] 2 20 6 8 10
> 3 # Returns a logical vector numeric_vector
[1] FALSE TRUE FALSE TRUE TRUE
Matrices
Create a matrix
?matrix<- matrix(data = 1:9,
my_matrix nrow = 3,
ncol = 3
) my_matrix
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Access elements
1, 2] # Returns 2 my_matrix[
[1] 4
1, ] # First row my_matrix[
[1] 1 4 7
2] # Second column my_matrix[,
[1] 4 5 6
Modify elements
2, 3] <- 10 my_matrix[
Matrix operations
+ 5 # Adds 5 to each element my_matrix
[,1] [,2] [,3]
[1,] 6 9 12
[2,] 7 10 15
[3,] 8 11 14
t(my_matrix) # transpose of a matrix
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 10 9
solve(my_matrix) # inverse of a matrix - exists only for square matrices
[,1] [,2] [,3]
[1,] -1.25 0.5 0.4166667
[2,] 1.00 -1.0 0.3333333
[3,] -0.25 0.5 -0.2500000
%*% my_matrix # Matrix multiplication - require conformability my_matrix
[,1] [,2] [,3]
[1,] 30 66 110
[2,] 42 93 154
[3,] 42 96 162
Lists
Create a list
<- list(number = 1:5,
my_list_example character = c("apple", "banana"),
matrix = my_matrix
)
Access elements
$number # Returns 1 2 3 4 5 my_list_example
[1] 1 2 3 4 5
2]] # Returns "apple" "banana" (second element) my_list_example[[
[1] "apple" "banana"
Modify elements
$number[2] <- 10 my_list_example
List Operations
# Add a new element to the list
$new_element <- c(TRUE, FALSE, TRUE)
my_list_example
# Remove an element from the list
$new_element <- NULL
my_list_example
# Combine lists
<- list(new_matrix = matrix(5:8, nrow = 2))
another_list <- c(my_list_example, another_list) combined_list
Data Frames
Create a data frame
<- data.frame(
my_df_example Column1 = c(1, 2, 3),
Column2 = c("A", "B", "C"),
Column3 = c(TRUE, FALSE, TRUE)
) my_df_example
Column1 Column2 Column3
1 1 A TRUE
2 2 B FALSE
3 3 C TRUE
Access elements
$Column1 # Returns 1 2 3 my_df_example
[1] 1 2 3
1, 2] # Returns "A" (Element at row 1, column 2) my_df_example[
[1] "A"
Modify elements
$Column1[2] <- 10 my_df_example
Descriptive Statistics
First, lets load a basic inbuilt dataset in R as an object named df
.
- More on how to import real data in later tutorials.
<- mtcars df
Check it has been imported.
- Often you have to check if it has been correctly imported. More on that in later tutorials.
class(df)
[1] "data.frame"
colnames(df)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
nrow(df)
[1] 32
Calculating mean, median, variance, and standard deviation.
Function:
mean()
calculates the average of a numeric vector.Provides a central value of the dataset.
mean(df$mpg) # Calculates the mean of the 'mpg' column in the dataframe 'df'
[1] 20.09062
Function:
median()
finds the middle value in a numeric vector when the values are sorted in ascending order.Offers a robust central value, less affected by outliers compared to the mean.
median(df$mpg) # Calculates the median of the 'mpg' column in the dataframe 'df'
[1] 19.2
Function:
var()
measures the dispersion of a numeric vector around its mean, showing how much the values spread out.Gives an indication of variability but is in squared units of the original data.
var(df$mpg) # Calculates the variance of the 'mpg' column in the dataframe 'df'
[1] 36.3241
Function:
sd()
indicates the amount of variation or dispersion in a numeric vector, providing a measure of how spread out the values are around the mean.Provides variability in the same units as the data, making it more interpretable.
sd(df$mpg) # Calculates the standard deviation of the 'mpg' column in the dataframe 'df'
[1] 6.026948
sd(df$mpg) == var(df$mpg)^.5
[1] TRUE
When using these functions, ensure that the data is numeric and clean to avoid errors in calculations.
NA
(Not Available)
Type: General purpose placeholder for any kind of missing data.
Usage:
NA
is used to represent missing values across all data types (e.g., numeric, character, logical).Behavior: When you perform operations involving
NA
, the result is typicallyNA
unless explicitly handled.<- c(1, NA, 3) x mean(x) # Result: NA
[1] NA
mean(x = x, na.rm = TRUE )
[1] 2
Summary statistics for data frames
Instead of applying functions one by one to each column in a dataframe, you can either iteratively apply functions across columns (useful in simulations) or use a package to extend R’s capabilities and create summary statistics for the entire dataframe.
Iteratively apply a function on each column in a dataframe
- When performing simulations, one often needs to generate multiple sets of results and keep them in a list for further analysis.
apply
returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
?apply
# Calculate the standard deviation for each column in the dataframe
<- apply(X = df,
sd_per_column MARGIN = 2, # for a matrix 1 indicates rows, 2 indicates columns
FUN = sd
)
# Print the result sd_per_column
mpg cyl disp hp drat wt
6.0269481 1.7859216 123.9386938 68.5628685 0.5346787 0.9784574
qsec vs am gear carb
1.7869432 0.5040161 0.4989909 0.7378041 1.6152000
- The
sapply
andlapply
functions in R are used for applying functions to elements of a list or dataframe. They differ in their return types and use cases (but are in the same family of commands):
lapply
function applies a function to each element of a list or dataframe and returns a list (regardless of the function’s return type).
# Apply function to each element of the list/dataframe
<- lapply(df, sd)
result_list
# Print the result result_list
$mpg
[1] 6.026948
$cyl
[1] 1.785922
$disp
[1] 123.9387
$hp
[1] 68.56287
$drat
[1] 0.5346787
$wt
[1] 0.9784574
$qsec
[1] 1.786943
$vs
[1] 0.5040161
$am
[1] 0.4989909
$gear
[1] 0.7378041
$carb
[1] 1.6152
sapply
function applies a function to each element of a list or dataframe and attempts to simplify the result into a vector or matrix. Used when you expect a vector or matrix as the output and want automatic simplification of the result.
# Apply function to each element of the list/dataframe
<- sapply(df, mean)
result_vector
# Print the result result_vector
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
- Looping - a more direct example for simulations.
Initialize a vector: Create an empty numeric vector
sd_results
with a length equal to the number of columns indf
to store the standard deviations.Loop and calculate: Iterate over each column, compute the standard deviation, and store it in the
sd_results
vector. Assign column names and print the results.
?numeric
# Initialize a vector to store standard deviations (contains only 0)
<- numeric(ncol(df))
sd_results
head(sd_results) # check it has only 0, read up on numeric to confirm
[1] 0 0 0 0 0 0
# Loop over each column
for (i in 1:ncol(df)) {
<- sd( x = df[ , i] ) # Calculate and store the standard deviation
sd_results[i]
}
head(sd_results) # sd_result has no set names
[1] 6.0269481 1.7859216 123.9386938 68.5628685 0.5346787 0.9784574
# Set names for the results
names(sd_results) <- colnames(df)
# Print the results
print(sd_results)
mpg cyl disp hp drat wt
6.0269481 1.7859216 123.9386938 68.5628685 0.5346787 0.9784574
qsec vs am gear carb
1.7869432 0.5040161 0.4989909 0.7378041 1.6152000
Packages
I will demonstrate three packages for generating summary statistics:
summary
from base RdfSummary
from thesummarytools
packageskim
from theskimr
package
summary
Purpose: Provides a summary of statistics for each column in a data frame or vector, including measures like minimum, maximum, mean, median, and quartiles.
summary(df)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
summarytools
Purpose: dfSummary
from the summarytools
package provides a comprehensive summary of a data frame, including statistics for numeric variables and frequency tables for categorical variables.
# install.packages(summarytools)
library(summarytools)
Attaching package: 'summarytools'
The following object is masked from 'package:tibble':
view
dfSummary(df)
Data Frame Summary
df
Dimensions: 32 x 11
Duplicates: 0
----------------------------------------------------------------------------------------------------------
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
---- ----------- --------------------------- -------------------- ------------------- ---------- ---------
1 mpg Mean (sd) : 20.1 (6) 25 distinct values : 32 0
[numeric] min < med < max: : . (100.0%) (0.0%)
10.4 < 19.2 < 33.9 . : :
IQR (CV) : 7.4 (0.3) : : : .
: : : : :
2 cyl Mean (sd) : 6.2 (1.8) 4 : 11 (34.4%) IIIIII 32 0
[numeric] min < med < max: 6 : 7 (21.9%) IIII (100.0%) (0.0%)
4 < 6 < 8 8 : 14 (43.8%) IIIIIIII
IQR (CV) : 4 (0.3)
3 disp Mean (sd) : 230.7 (123.9) 27 distinct values : 32 0
[numeric] min < med < max: . : (100.0%) (0.0%)
71.1 < 196.3 < 472 : : : : : :
IQR (CV) : 205.2 (0.5) : : : : : : .
: : : . : : : . :
4 hp Mean (sd) : 146.7 (68.6) 22 distinct values . : 32 0
[numeric] min < med < max: : : (100.0%) (0.0%)
52 < 123 < 335 : : : .
IQR (CV) : 83.5 (0.5) : : : :
: : : : . .
5 drat Mean (sd) : 3.6 (0.5) 22 distinct values : 32 0
[numeric] min < med < max: : : (100.0%) (0.0%)
2.8 < 3.7 < 4.9 : : .
IQR (CV) : 0.8 (0.1) . : : :
: : : : .
6 wt Mean (sd) : 3.2 (1) 29 distinct values : 32 0
[numeric] min < med < max: : : (100.0%) (0.0%)
1.5 < 3.3 < 5.4 : :
IQR (CV) : 1 (0.3) : : : : : .
: : : : : . :
7 qsec Mean (sd) : 17.8 (1.8) 30 distinct values : 32 0
[numeric] min < med < max: : (100.0%) (0.0%)
14.5 < 17.7 < 22.9 : :
IQR (CV) : 2 (0.1) . : : : :
: : : : : : : .
8 vs Min : 0 0 : 18 (56.2%) IIIIIIIIIII 32 0
[numeric] Mean : 0.4 1 : 14 (43.8%) IIIIIIII (100.0%) (0.0%)
Max : 1
9 am Min : 0 0 : 19 (59.4%) IIIIIIIIIII 32 0
[numeric] Mean : 0.4 1 : 13 (40.6%) IIIIIIII (100.0%) (0.0%)
Max : 1
10 gear Mean (sd) : 3.7 (0.7) 3 : 15 (46.9%) IIIIIIIII 32 0
[numeric] min < med < max: 4 : 12 (37.5%) IIIIIII (100.0%) (0.0%)
3 < 4 < 5 5 : 5 (15.6%) III
IQR (CV) : 1 (0.2)
11 carb Mean (sd) : 2.8 (1.6) 1 : 7 (21.9%) IIII 32 0
[numeric] min < med < max: 2 : 10 (31.2%) IIIIII (100.0%) (0.0%)
1 < 2 < 8 3 : 3 ( 9.4%) I
IQR (CV) : 2 (0.6) 4 : 10 (31.2%) IIIIII
6 : 1 ( 3.1%)
8 : 1 ( 3.1%)
----------------------------------------------------------------------------------------------------------
skim
Purpose: skim
from the skimr
package provides a detailed and aesthetically pleasing summary of data, including distributions and missing values.
# install.packages("skimr")
library(skimr)
skim(mtcars)
Name | mtcars |
Number of rows | 32 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
cyl | 0 | 1 | 6.19 | 1.79 | 4.00 | 4.00 | 6.00 | 8.00 | 8.00 | ▆▁▃▁▇ |
disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
vs | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
stargazer
and kable
package
I will now introduce one of the most common packages used for summary statistics in Economics journals, stargazer
, and then show how to use the kable
package for more flexible table formatting based on your preferred summary statistics command. This will help you create a professional looking summary statistics table.
stargazer
package
The stargazer
function is designed primarily for summarizing regression models or data frames. See more examples in Appendix.
Base command
# install.packages("stargazer")
require(stargazer)
Loading required package: stargazer
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(df)
% Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
% Date and time: Tue, Jul 29, 2025 - 14:40:16
\begin{table}[!htbp] \centering
\caption{}
\label{}
\begin{tabular}{@{\extracolsep{5pt}}lccccc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
Statistic & \multicolumn{1}{c}{N} & \multicolumn{1}{c}{Mean} & \multicolumn{1}{c}{St. Dev.} & \multicolumn{1}{c}{Min} & \multicolumn{1}{c}{Max} \\
\hline \\[-1.8ex]
mpg & 32 & 20.091 & 6.027 & 10.400 & 33.900 \\
cyl & 32 & 6.188 & 1.786 & 4 & 8 \\
disp & 32 & 230.722 & 123.939 & 71.100 & 472.000 \\
hp & 32 & 146.688 & 68.563 & 52 & 335 \\
drat & 32 & 3.597 & 0.535 & 2.760 & 4.930 \\
wt & 32 & 3.217 & 0.978 & 1.513 & 5.424 \\
qsec & 32 & 17.849 & 1.787 & 14.500 & 22.900 \\
vs & 32 & 0.438 & 0.504 & 0 & 1 \\
am & 32 & 0.406 & 0.499 & 0 & 1 \\
gear & 32 & 3.688 & 0.738 & 3 & 5 \\
carb & 32 & 2.812 & 1.615 & 1 & 8 \\
\hline \\[-1.8ex]
\end{tabular}
\end{table}
stargazer
?
Lets change the type="text"
and see if we get something more readable.
# install.packages("stargazer")
require(stargazer)
stargazer(df,
type = "text" # default argument is not text
)
============================================
Statistic N Mean St. Dev. Min Max
--------------------------------------------
mpg 32 20.091 6.027 10.400 33.900
cyl 32 6.188 1.786 4 8
disp 32 230.722 123.939 71.100 472.000
hp 32 146.688 68.563 52 335
drat 32 3.597 0.535 2.760 4.930
wt 32 3.217 0.978 1.513 5.424
qsec 32 17.849 1.787 14.500 22.900
vs 32 0.438 0.504 0 1
am 32 0.406 0.499 0 1
gear 32 3.688 0.738 3 5
carb 32 2.812 1.615 1 8
--------------------------------------------
Embellished command.
Lets add arguments one by one to improve the presentation and customize the output. You can further play with omit.summary.stat
).
library(stargazer)
# Enhanced stargazer table
stargazer(df,
type = "text", # Output format
title = "Summary Statistics for mtcars Dataset", # Title of the table
digits = 2, # Number of decimal places
covariate.labels = c("MPG", "Cylinders", "Disp.", "HP", "Rear Axle Ratio",
"Weight", "Q-Mile Time", "V/S", "Transmission",
"Gears", "Carbs") # Custom labels for variables
)
Summary Statistics for mtcars Dataset
===============================================
Statistic N Mean St. Dev. Min Max
-----------------------------------------------
MPG 32 20.09 6.03 10.40 33.90
Cylinders 32 6.19 1.79 4 8
Disp. 32 230.72 123.94 71.10 472.00
HP 32 146.69 68.56 52 335
Rear Axle Ratio 32 3.60 0.53 2.76 4.93
Weight 32 3.22 0.98 1.51 5.42
Q-Mile Time 32 17.85 1.79 14.50 22.90
V/S 32 0.44 0.50 0 1
Transmission 32 0.41 0.50 0 1
Gears 32 3.69 0.74 3 5
Carbs 32 2.81 1.62 1 8
-----------------------------------------------
kable
# install.packages("knitr")
# install.packages("kableExtra")
library(knitr)
library(kableExtra)
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
# detach(kableExtra)
# Generate descriptive statistics using psych::describe
<- psych::describe(mtcars)
summary_stats summary_stats
vars n mean sd median trimmed mad min max range skew
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73
drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27
wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42
qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37
vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24
am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36
gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53
carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05
kurtosis se
mpg -0.37 1.07
cyl -1.76 0.32
disp -1.21 21.91
hp -0.14 12.12
drat -0.71 0.09
wt -0.02 0.17
qsec 0.34 0.32
vs -2.00 0.09
am -1.92 0.09
gear -1.07 0.13
carb 1.26 0.29
# Create a nicely formatted table using kable
kable(x = summary_stats,
caption = "Descriptive Statistics for mtcars Dataset",
digits = 2 )
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 1 | 32 | 20.09 | 6.03 | 19.20 | 19.70 | 5.41 | 10.40 | 33.90 | 23.50 | 0.61 | -0.37 | 1.07 |
cyl | 2 | 32 | 6.19 | 1.79 | 6.00 | 6.23 | 2.97 | 4.00 | 8.00 | 4.00 | -0.17 | -1.76 | 0.32 |
disp | 3 | 32 | 230.72 | 123.94 | 196.30 | 222.52 | 140.48 | 71.10 | 472.00 | 400.90 | 0.38 | -1.21 | 21.91 |
hp | 4 | 32 | 146.69 | 68.56 | 123.00 | 141.19 | 77.10 | 52.00 | 335.00 | 283.00 | 0.73 | -0.14 | 12.12 |
drat | 5 | 32 | 3.60 | 0.53 | 3.70 | 3.58 | 0.70 | 2.76 | 4.93 | 2.17 | 0.27 | -0.71 | 0.09 |
wt | 6 | 32 | 3.22 | 0.98 | 3.33 | 3.15 | 0.77 | 1.51 | 5.42 | 3.91 | 0.42 | -0.02 | 0.17 |
qsec | 7 | 32 | 17.85 | 1.79 | 17.71 | 17.83 | 1.42 | 14.50 | 22.90 | 8.40 | 0.37 | 0.34 | 0.32 |
vs | 8 | 32 | 0.44 | 0.50 | 0.00 | 0.42 | 0.00 | 0.00 | 1.00 | 1.00 | 0.24 | -2.00 | 0.09 |
am | 9 | 32 | 0.41 | 0.50 | 0.00 | 0.38 | 0.00 | 0.00 | 1.00 | 1.00 | 0.36 | -1.92 | 0.09 |
gear | 10 | 32 | 3.69 | 0.74 | 4.00 | 3.62 | 1.48 | 3.00 | 5.00 | 2.00 | 0.53 | -1.07 | 0.13 |
carb | 11 | 32 | 2.81 | 1.62 | 2.00 | 2.65 | 1.48 | 1.00 | 8.00 | 7.00 | 1.05 | 1.26 | 0.29 |
Control Structures
Control structures like if
, else
, for
, and while
are used to control the flow of execution.
If-Else statement
- An
if-else
statement allows you to make decisions based on certain conditions. This is useful in scenarios where you need to handle different cases or scenarios based on the data.
Baby example
<- 10
x
if (x > 5) {
print("x is greater than 5")
else {
} print("x is 5 or less")
}
[1] "x is greater than 5"
- You might use an
if-else
statement to categorize countries into high, medium, or low income based on their GDP per capita:
More realistic example ()
# Example GDP per capita
<- 55000
gdp_per_capita
# If-else statement to categorize income level
if (gdp_per_capita > 40000) {
<- "High Income"
income_level else if (gdp_per_capita > 20000) {
} <- "Medium Income"
income_level else {
} <- "Low Income"
income_level
}
print(paste("Income level:", income_level))
[1] "Income level: High Income"
For Loop
A
for
loop is useful when you need to perform a repetitive task for a fixed number of iterations.Repetitive Tasks: The
for
loop automates repetitive tasks, such as reading multiple files and performing calculations like applying a function to each element in a list of variables, which is efficient and reduces the risk of errors compared to manually repeating these tasks.Scalability: As the number of datasets grows, you can simply add more file names to the list without needing to rewrite the calculation code.
Baby example
for (i in 1:5) {
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
More realistic (fake*) example
<- c("dataset1.csv", "dataset2.csv", "dataset3.csv") # List of file names
file_names
<- list() # Initialize an empty list to store datasets
datasets
# Loop over each file name
for (i in 1:length(file_names)) {
<- read.csv(file_names[i]) # Read the dataset and store it in the list
df[[i]]
}
While Loop
- A
while
is useful for situations where you need to repeat a task until a certain condition is met. It’s particularly handy when the number of iterations isn’t known in advance and depends on some condition that changes dynamically.
Baby example
<- 1
i
while (i <= 5) {
print(i)
<- i + 1
i }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
More realistic example: Newton-Raphson method
This is useful for iterative algorithms, such as numerical optimization routines, iterative estimation and convergence checks in econometric models.
Newton-Raphson method iteratively improves the estimate until it converges to the true square root.
<- 25 # Number for which we want to find the square root
number
# Initial guess of square root
<- number / 2
guess
# Threshold for convergence
<- 0.0001
tolerance
# Iterative process to refine the guess
while (abs(guess^2 - number) > tolerance) {
<- (guess + number / guess) / 2 # improves the guess
guess
}
# Print the final guess
print(paste("Approximate square root:", round(x = guess,
digits = 4)
) )
[1] "Approximate square root: 5"
Initial Guess: Start with an initial guess for the square root.
While Loop Condition: Continue the loop as long as the difference between the squared guess and the original number is greater than the tolerance level.
Update Guess: Use the average of the guess and the number divided by the guess to refine the estimate.
Convergence Check: The loop stops when the difference is smaller than the tolerance, meaning the guess is close enough to the actual square root.
Appendix
- Data Types - Geek for Geeks ; Programiz
- Many Data Structure - Software Carpentry
- Stargazer basics.